Remove Carriage Return (CRLF) within double quotes

How to remove Carriage Return (CRLF) within double quotes in a file. There are multiple CRLFs within double quotes. We are on Ubuntu 14.04.2 LTS.

The file that we are importing is a csv file from unix to windows and the file was formatted to unix2dos. Therefore all lines in the file all have CRLF at the end. However, there is a comment field where we see multiple CRLFs within this single field and the contents of the field is enclosed with double quotes. So, when we open the file in notepad it looks like a single line is broken into several lines. If the file is opened in the excel it comes out OK.

My requirement is to replace all CRLFs within double quotes should be replaced by a single space.

Hello Covina,

Welcome to forums, hope you will enjoy learning here. Following are some examples which may help you to remove carriage characters.

tr -d '\r' < Input_file > Output_file
awk '{gsub(/\r/, " ");print}' Input_file > Output_file
 

Apart from this you can have dos2unix utility installed in your box too. Hope this helps.
Enjoy learning :b:

Thanks,
R. Singh

I'd be surprised if a field in a .csv file had <CR><LF> chars in it, as usually EXCEL uses single <LF> chars for in-cell line breaks.
Please post (attach) a small but meaningful sample file on which we can work.

The problem started when the file was transferred to Unix and all CRLF became LF, now no longer distinguishable from the embedded LF.
Converting all to CRLF does not help.
Maybe you can redo the original download/transfer to Windows?
In case of ftp transfer use binary mode!

Please copy and paste in Note++ and you will see it (display the end of line symbol to see CRLF)
Here is the sample file:

6,,G0570405,037112,13907187-P06241,P06241,,B1249998,,,03/10/2014,12:42:00,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,1301,DNA PaxGene Tube,,,"-- Added 3/11/2014 2:36 PM by abc --

Processed in purification",Small Box,,WB for DNA,Yes,3/11/2014 11:55 AM,130100010,,,,,,0,,
9,,,,12112008,BAF312A2304,EH07583,G0521371,2024001,D-D,02/21/2014,09:22:00,03/11/2014 09:00:00,,,,,Male,1240924436,,Frozen Acceptable ,2024,DNA PaxGene Tube,V1CYP,,"Partsample Barcode: 01240924436011
Material: BA
Sample Description: CYP2C9 Haplotype
Eurofins Study Code: 2181261098

-- Added 3/11/2014 2:36 PM by abc --

Processed in purification",Small Box,,WB for DNA,No,3/11/2014 12:18 PM,,,,,,,0,,
10,,,,CL15312003,12-102-13,,G0573245,0135-1306,,03/10/2014,,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,135,DNA PaxGene Tube,,,"-- Added 3/11/2014 2:36 PM by abc --

Processed in purification",Small Box,3064,WB for DNA,Yes,3/11/2014 12:36 PM,,,,,,,0,,
11,,,,CL15312003,12-102-13,,G0576563,0135-1308,,03/10/2014,15:45:00,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,135,DNA PaxGene Tube,,,"-- Added 3/11/2014 2:36 PM by abc --

Processed in purification",Small Box,3064,WB for DNA,Yes,3/11/2014 12:36 PM,,,,,,,0,,
12,,,,CL15312003,12-102-13,,G0576562,0135-1307,,03/10/2014,12:48:00,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,135,DNA PaxGene Tube,,,"-- Added 3/11/2014 2:36 PM by abc --

Processed in purification",Small Box,3064,WB for DNA,Yes,3/11/2014 12:36 PM,,,,,,,0,,
24,,,,19012034,BAN2401-G000-201,I6720150214,G0521407,10901054,,03/04/2014,,03/11/2014 10:00:00,,,,,Female,,,Frozen Acceptable ,1090,EDTA Tube,2 (Baseline),,"Sample Name: PG (APOE4)
Year of Birth: 1933",Small Box,,WB for DNA,No,3/11/2014 3:20 PM,,,,,,,0,28456103,

I wanted to be formatted like this:

5,,G0570409,037112,13907187-P06241,P06241,,B1249997,,,03/10/2014,12:42:00,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,1301,SST Tube,,,,Large Box,,Serum,Yes,3/11/2014 11:07 AM,130100010,,,,,,0,,
6,,G0570405,037112,13907187-P06241,P06241,,B1249998,,,03/10/2014,12:42:00,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,1301,DNA PaxGene Tube,,,"-- Added 3/11/2014 2:36 PM by abc --Processed in purification",Small Box,,WB for DNA,Yes,3/11/2014 11:55 AM,130100010,,,,,,0,,
9,,,,12112008,BAF312A2304,EH07583,G0521371,2024001,D-D,02/21/2014,09:22:00,03/11/2014 09:00:00,,,,,Male,1240924436,,Frozen Acceptable ,2024,DNA PaxGene Tube,V1CYP,,"Partsample Barcode: 01240924436011 Material: BA Sample Description: CYP2C9 Haplotype Eurofins Study Code: 2181261098 -- Added 3/11/2014 2:36 PM by abc -- Processed in purification",Small Box,,WB for DNA,No,3/11/2014 12:18 PM,,,,,,,0,,
10,,,,CL15312003,12-102-13,,G0573245,0135-1306,,03/10/2014,,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,135,DNA PaxGene Tube,,,"-- Added 3/11/2014 2:36 PM by abc --Processed in purification",Small Box,3064,WB for DNA,Yes,3/11/2014 12:36 PM,,,,,,,0,,
11,,,,CL15312003,12-102-13,,G0576563,0135-1308,,03/10/2014,15:45:00,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,135,DNA PaxGene Tube,,,"-- Added 3/11/2014 2:36 PM by abc --Processed in purification",Small Box,3064,WB for DNA,Yes,3/11/2014 12:36 PM,,,,,,,0,,
12,,,,CL15312003,12-102-13,,G0576562,0135-1307,,03/10/2014,12:48:00,03/11/2014 10:00:00,,,,,,,,Cool/Acceptable,135,DNA PaxGene Tube,,,"-- Added 3/11/2014 2:36 PM by abc --Processed in purification",Small Box,3064,WB for DNA,Yes,3/11/2014 12:36 PM,,,,,,,0,,
24,,,,19012034,BAN2401-G000-201,I6720150214,G0521407,10901054,,03/04/2014,,03/11/2014 10:00:00,,,,,Female,,,Frozen Acceptable ,1090,EDTA Tube,2 (Baseline),,"Sample Name: PG (APOE4) Year of Birth: 1933",Small Box,,WB for DNA,No,3/11/2014 3:20 PM,,,,,,,0,28456103
 perl -07 -pe 's/\n\n//g' broken_cvs > fixed_cvs

no, I am still not getting it. There should be only one CRLF at the end of the line. :mad:

Then, try:

perl -0777 -pe 's/\n^(?!\d)/ /smg; s/(-- ) /$1/g' broken_cvs > fixed_cvs

There seem to be no CRLF in your sample, only LF

Try:

awk 'NR%2-1{gsub(/\r?\n/, FS)} NR>1{printf RS}1' RS=\" ORS= file

Any body here? I need help.

We have provided you with two more suggestions. We are waiting for you. Please post some feedback first.

I got it, the following works

awk 'NR%2-1{gsub(/\r?\n/, FS)} NR>1{printf RS}1' RS=\" ORS= file

---------- Post updated at 11:04 AM ---------- Previous update was at 10:44 AM ----------

I want to further edit the file, there are some lines with LF within double quotes though all lines ends with CRLF. How do I get rid of LF.

I think your code handles LF within quotes well.
The only potential problem is a "line too long", if there is a sequence of many lines without quotes.
The following seems a bit complicated but avoids the problem

awk -F\" '(NF%2==0) {if (append) {append=0; print buf sep $0; next} else {append=1; buf=sep=""}} (append) {sub("\r$",""); buf=buf sep $0; sep=OFS; next} {print}'

Replace sep=OFS by sep=ORS if you want to remove the LF within quotes in a CRLF-delimited file (as you originally requested.)