Transpose Messy Data

I have a messy, pipe-delimited ("|") input dataset.

I would like to create a file of ID plus each component of field 4 which is delimited by ";" into a long, skinny shape for easier processing.

A couple of complications are that field 4 may contain both commas and linefeed characters from the source.

Sample data looks like:

ID1|VAR2|VAR3|VAR4|VAR5
ID2|VAR2|VAR3|PART1;PART2|1;2
ID3|VAR2|VAR3|A, B, C;PART2;BEFORE LF\nAFTER LF|1;2;3
ID4|VAR2|VAR3|1;2;3,;4|1;2;3;4

I would something like data like:

I

D1|VAR4
ID2|PART1
ID2|PART2
ID3|A, B, C
ID3|PART2
ID3|BEFORE LF  AFTER LF
ID4|1
ID4|2
ID4|3
ID4|4

Is there an elegant way to do this at the command line?

Thanks!

To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags

```text
 and 
```

by hand.)

Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums

1 Like

What have you tried to solve this problem?

I don't see anything in your description that explains why the transformations shown in red above happened. What input characters are supposed to be changed to spaces in the output? (The string "\n" is not a linefeed character, but it can be used in a format string to cause some programs to print a linefeed character.) What input characters are supposed to be deleted from the output?

1 Like

Thanks for the reply.

The data file is a large data file from the Federal government.

I would like to read the data using stat software packages.

The lines end with \r\n .

Field 4 is a text field. It can have a \n character embedded in it. Stat software package tends to incorrectly split a record with a \n character into two records. Commas are a legitimate part of field 4 (so I should not have had it removed in the sample output). Field 4 is all caps.

I have tried things like

awk -F"|" '{ print $1f"|"$4F }' < patienttest2.txt | sed 's/,/comma/g' | sed 's/|/,/' | sed 's/;/,/g' | awk -F , '{for (i=2;i<=NF;i++) if ($i>=0) print $1 FS $i}' | sed 's/co
mma/,/g' |

extracting field 1 (the id variable) and field 4 (the value of intereset) from a file

changing "," to "comma" because lowercase letters are not in the datafile

change the pipe to a comma.

the 2nd awk statement was to reshape fields 1 and 4 so that there a line with the id variable for each field 4.

the last sed is to swap "comma" out for "," .

Thanks!

I don't see why you need to modify commas for what you seem to be trying to do.

How is the <newline> (or <carriage-return><newline>) incorrectly inserted in field 4 supposed to be modified? Should it/they be removed, replaced by a single space, or replaced by two spaces (as in your sample)?

Does the software that creates your input ever incorrectly insert <newline> characters in field 5? Does it ever incorrectly insert <newline> characters in fields 1, 2, or 3?

Can there be more than one <newline> character incorrectly inserted in field 4 for what should be a single input line?

1 Like

Thanks for the reply!

There is no need to modify the commas if the second awk statement could split the lines on the pipe characters.

The <newline> should be replaced by a single space.

Fields 1, 2, and 3 are numeric fields, so they never have <newline> characters.

Field 5 is a text field and can have <newline> characters incorrectly inserted.

Yes, fields 4 and 5 can each have multiple <newline> characters.

Thanks!

If I understand your problem correctly, I don't see any need for anything but one awk script for this problem. Try:

awk '
BEGIN {	FS = OFS = "|"
}
{	while(NF < 5) {
		if(NF <= 1) {
			# Read a continuation line for field 5 or 1st line
			# of next record.
			if(getline != 1) {
				# Break out on EOF
				break
			}
		} else {# Read continuation line for field 4.
			if((getline x) != 1) {
				# We should not hit EOF in the middle of a
				# continued line, but check for it anyway.
				break
			}
			$0 = $0 " " x	# Replace incorrect <newline> with a
					# space.
			$1 = $1		# Reset NF after combining lines.
		}
	}
	# Discard <carriage-return>s.
	gsub(/\r/, "")
	n = split($4, sf, ";")
	for(i = 1; i <= n; i++)
		print $1, sf
}' patienttest2.txt

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .

If patienttest2.txt contains:

ID1|VAR2|VAR3|VAR4|VAR5
ID2|VAR2|VAR3|PART1;PART2|1;2
ID3|VAR2|VAR3|A, B, C;PART2;BEFORE LF
AFTER LF|1;2;3
ID4|VAR2|VAR3|1;2;3,;4|1;2;3;4
ID5|VAR2|VAR3|1,
2;3
4;5
6|f5
f6
f7
ID6|VAR2|VAR3|A,b;C,d|a
con

(with <carriage-return><newline> line terminators or <newline> line terminators), produces the output:

ID1|VAR4
ID2|PART1
ID2|PART2
ID3|A, B, C
ID3|PART2
ID3|BEFORE LF AFTER LF
ID4|1
ID4|2
ID4|3,
ID4|4
ID5|1, 2
ID5|3 4
ID5|5 6
ID6|A,b
ID6|C,d

Does this match what you're trying to do?

1 Like

Yes!

This looks great! Thanks so much!

This code was giving a record for each ID plus a blank field 4 line after the records with data, so I changed

for(i = 1; i <= n; i++)

to

for(i = 1; i < n; i++)

and that seemed to give the desired results and that seemed worthwhile given millions of records.

One additional aspect is that the federal government delivers this data monthly as four files:

thisyear.txt
thismonth.txt
changes.txt
previousyears.txt

I know I could make a copy of this code for each file and change the input datafile, but I was wondering how to loop over the four input files in one program.

Thanks so much for all your help!

---------- Post updated at 12:45 PM ---------- Previous update was at 10:08 AM ----------

I got rid of the leading

awk '

and the close single quote from the last line and ran the script via

awk -f myscript.awk thismonth.txt 

and that seemed to work.

Thanks!

If field 4 contains semicolon terminated fields instead of semicolon separated fields, that is a good change. If field 4 contains semicolon separated fields as shown in you sample input, this change will discard the last subfield in field 4 for each line and will completely skip lines that only have one sub-field terminated by the field separator ( | ). If some lines have an empty subfield after the last semicolon, you could check for an empty subfield before printing an output line.

If you want to produce one output file from the concatenated four input files, just change:

awk -f myscript.awk thismonth.txt

to:

awk -f myscript.awk thisyear.txt thismonth.txt changes.txt previousyears.txt

But, if changes.txt contains additions and deletions, you'll need to modify the script to ignore deletions and only print additions (or if you have a combined file and need to remove other lines that have already been processed).

And, if you want the script to switch output files when it starts processing a new input file, you'll also need to make some minor changes to the script for that.

If you need help with additional changes like this, give detailed descriptions of how field 4 is formatted, how output file names are related in input file names, etc. for the changes that you want to make.

1 Like

Thanks so much for the helpful reply!

Yes, field 4 is terminated by

;

and not merely separated.