carriage returns within quotation marks causing new lines in csv

koeji · September 2, 2012, 10:17pm

I have a csv file with 3 columns. Fields are comma delimited and strings are enclosed with quotation marks "". About 40% of the time, the line of values will start a new line thanks to carriage return characters within a string.

Example:

"apple","banana","orange"
"pineapple","grape","straw
berry"
"apple","banana","cherry"

Should be:

"apple","banana","orange"
"pineapple","grape","strawberry"
"apple","banana","cherry"

How to create a bash script that will catch this issue and combine the line and the line below?

Don_Cragun · September 2, 2012, 11:18pm

Try:

awk '/^["]/{if(out != "") print out;out = $0;next}
        {out = out $0}
END     {if(out != "") print out}' input

koeji · September 3, 2012, 1:21am

While that worked for the sample provided, the actual sample includes strings:

"Smith", "Margie", "200 Elm St."
"Doe", "John", "N", "123 Maple St.
Apt. C."
"Cox", "William", "234 Church St."

How would the awk script combine lines 2 & 3 and add the appropriate space?

"Smith", "Margie", "200 Elm St."
"Doe", "John", "N", "123 Maple St. Apt. C."
"Cox", "William", "234 Church St.

"

pamu · September 3, 2012, 2:24am

Just add quotes to Don's solution....

awk '/^["]/{if(out != "") print out;out = $0;next}
        {out = out" "$0}
END     {if(out != "") print out}' input

Don_Cragun · September 3, 2012, 7:43am

Adding quotes would produce:

"apple","banana","orange"
"pineapple","grape","straw berry"
"apple","banana","cherry"

instead of:

"apple","banana","orange"
"pineapple","grape","strawberry"
"apple","banana","cherry"

for the sample input that was given in the first post in this thread. You have to choose whether you want "strawberry" and "123 Maple St.Apt. C." OR "straw berry" and "123 Maple St. Apt. C." . If you want "strawberry" and "123 Maple St. Apt. C." , you need to specify the logic that should be used to determine when a space is needed and when it isn't.

Note also that in koeji's original post explicitly stated: "I have a csv file with 3 columns." , but the "actual sample" provided in a later post has:

"Doe", "John", "N", "123 Maple St.
Apt. C."

which not only has 4 fields (instead of 3), but also has <comma><space> separated fields instead of <comma> separated fields.

With the original problem statement, I had considered pairing three sets of <double-quote> characters instead of just looking for lines that didn't start with a <double-quote>. In the future, please post real examples, instead of simplified examples that don't match the actual input you expect to be processing.

Franklin52 · September 3, 2012, 8:04am

You can give this a try:

awk -F\" '!(NF%2){s=$0;getline;$0=s OFS $0}1' file

koeji · September 4, 2012, 2:03am

Thanks to all. While I could not provide actual examples due to the content, I should have been more clear in the requirements and provided a better example. It has helped my learning tremendously.