To keep the forums high quality for all users, please take the time to format your posts correctly.
First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags
```text
and
```
by hand.)
Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.
Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.
I don't see anything in your description that explains why the transformations shown in red above happened. What input characters are supposed to be changed to spaces in the output? (The string "\n" is not a linefeed character, but it can be used in a format string to cause some programs to print a linefeed character.) What input characters are supposed to be deleted from the output?
The data file is a large data file from the Federal government.
I would like to read the data using stat software packages.
The lines end with \r\n .
Field 4 is a text field. It can have a \n character embedded in it. Stat software package tends to incorrectly split a record with a \n character into two records. Commas are a legitimate part of field 4 (so I should not have had it removed in the sample output). Field 4 is all caps.
I have tried things like
awk -F"|" '{ print $1f"|"$4F }' < patienttest2.txt | sed 's/,/comma/g' | sed 's/|/,/' | sed 's/;/,/g' | awk -F , '{for (i=2;i<=NF;i++) if ($i>=0) print $1 FS $i}' | sed 's/co
mma/,/g' |
extracting field 1 (the id variable) and field 4 (the value of intereset) from a file
changing "," to "comma" because lowercase letters are not in the datafile
change the pipe to a comma.
the 2nd awk statement was to reshape fields 1 and 4 so that there a line with the id variable for each field 4.
I don't see why you need to modify commas for what you seem to be trying to do.
How is the <newline> (or <carriage-return><newline>) incorrectly inserted in field 4 supposed to be modified? Should it/they be removed, replaced by a single space, or replaced by two spaces (as in your sample)?
Does the software that creates your input ever incorrectly insert <newline> characters in field 5? Does it ever incorrectly insert <newline> characters in fields 1, 2, or 3?
Can there be more than one <newline> character incorrectly inserted in field 4 for what should be a single input line?
If I understand your problem correctly, I don't see any need for anything but one awk script for this problem. Try:
awk '
BEGIN { FS = OFS = "|"
}
{ while(NF < 5) {
if(NF <= 1) {
# Read a continuation line for field 5 or 1st line
# of next record.
if(getline != 1) {
# Break out on EOF
break
}
} else {# Read continuation line for field 4.
if((getline x) != 1) {
# We should not hit EOF in the middle of a
# continued line, but check for it anyway.
break
}
$0 = $0 " " x # Replace incorrect <newline> with a
# space.
$1 = $1 # Reset NF after combining lines.
}
}
# Discard <carriage-return>s.
gsub(/\r/, "")
n = split($4, sf, ";")
for(i = 1; i <= n; i++)
print $1, sf
}' patienttest2.txt
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .
If patienttest2.txt contains:
ID1|VAR2|VAR3|VAR4|VAR5
ID2|VAR2|VAR3|PART1;PART2|1;2
ID3|VAR2|VAR3|A, B, C;PART2;BEFORE LF
AFTER LF|1;2;3
ID4|VAR2|VAR3|1;2;3,;4|1;2;3;4
ID5|VAR2|VAR3|1,
2;3
4;5
6|f5
f6
f7
ID6|VAR2|VAR3|A,b;C,d|a
con
(with <carriage-return><newline> line terminators or <newline> line terminators), produces the output:
I know I could make a copy of this code for each file and change the input datafile, but I was wondering how to loop over the four input files in one program.
Thanks so much for all your help!
---------- Post updated at 12:45 PM ---------- Previous update was at 10:08 AM ----------
I got rid of the leading
awk '
and the close single quote from the last line and ran the script via
If field 4 contains semicolon terminated fields instead of semicolon separated fields, that is a good change. If field 4 contains semicolon separated fields as shown in you sample input, this change will discard the last subfield in field 4 for each line and will completely skip lines that only have one sub-field terminated by the field separator ( | ). If some lines have an empty subfield after the last semicolon, you could check for an empty subfield before printing an output line.
If you want to produce one output file from the concatenated four input files, just change:
But, if changes.txt contains additions and deletions, you'll need to modify the script to ignore deletions and only print additions (or if you have a combined file and need to remove other lines that have already been processed).
And, if you want the script to switch output files when it starts processing a new input file, you'll also need to make some minor changes to the script for that.
If you need help with additional changes like this, give detailed descriptions of how field 4 is formatted, how output file names are related in input file names, etc. for the changes that you want to make.