Change the delimiter from Comma to Pipeline

Arun_Mishra · December 17, 2012, 3:20am

Hello All,
I need to convert a csv file to pipeline delimiter file in UNIX. The data in file itself contains comma with double qouted qualifier apart from the comma separator. Let me know how to do it. Appreciate any help if awk can be used to do it.

Mentioned below is the sample record of file.

Sample Data:

Menstrain,A0022699,"Woodstock,IL,2",A0013195,"Rockford,IL",A0013243,"Chicago,IL","EMERGING,QUALITY"

pamu · December 17, 2012, 3:26am

This is one method..

$ awk -F '"' '{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}}1' OFS='"' file
Menstrain|A0022699|"Woodstock,IL,2"|A0013195|"Rockford,IL"|A0013243|"Chicago,IL"|"EMERGING,QUALITY"

Arun_Mishra · December 17, 2012, 3:33am

WIll that work for all. I mean this file will be extracted on monthly basis and it's not mandatory that the columns which now have double qoutes will have double qoutes when i download the file next month. Is that a generic code?

pamu · December 17, 2012, 3:41am

Now try

$ cat file
Menstrain,A0022699,"Woodstock,IL,2",A0013195,"Rockford,IL",A0013243,"Chicago,IL","EMERGING,QUALITY"
Menstrain,A0022699,Woodstock,IL,2,A0013195,Rockford,ILA0013243,Chicago,EMERGING,QUALITY

$ awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}}
NF==1{gsub(",","|")}1' OFS='"' file

Menstrain|A0022699|"Woodstock,IL,2"|A0013195|"Rockford,IL"|A0013243|"Chicago,IL"|"EMERGING,QUALITY"
Menstrain|A0022699|Woodstock|IL|2|A0013195|Rockford|ILA0013243|Chicago|EMERGING|QUALITY

Arun_Mishra · December 17, 2012, 3:52am

pamu:

Now try

$ cat file
Menstrain,A0022699,"Woodstock,IL,2",A0013195,"Rockford,IL",A0013243,"Chicago,IL","EMERGING,QUALITY"
Menstrain,A0022699,Woodstock,IL,2,A0013195,Rockford,ILA0013243,Chicago,EMERGING,QUALITY

$ awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}}
NF==1{gsub(",","|")}1' OFS='"' file

Menstrain|A0022699|"Woodstock,IL,2"|A0013195|"Rockford,IL"|A0013243|"Chicago,IL"|"EMERGING,QUALITY"
Menstrain|A0022699|Woodstock|IL|2|A0013195|Rockford|ILA0013243|Chicago|EMERGING|QUALITY

Pamu, Thanks very much for your answer.
Please check the details of the requirements below:

The next month i might get a record like this

"Menstrain,arr",A0022699,"Woodstock,IL",2,"A001,3195",RockfordIL,"A001,3243","Chicago,america",EMERGINGQUALITY

will your script work for the above ?

My point is the script should replace comma with pipeline at the same time it should check whether the column has comma inside the double qouted qualifier for each record and shouldnt replace those. A column will have double qoutes in the data only if it has comma inside it. Next month i may get different column which has comma inside it and being double qouted.

pamu · December 17, 2012, 3:55am

Why you don't try by yourself...?

Please use code tags for code and data sample.
And always provide every details about input and desired output.
Please try to give your desired output.

please check..

$ cat file
Menstrain,A0022699,"Woodstock,IL,2",A0013195,"Rockford,IL",A0013243,"Chicago,IL","EMERGING,QUALITY"
Menstrain,A0022699,Woodstock,IL,2,A0013195,Rockford,ILA0013243,Chicago,EMERGING,QUALITY
"Menstrain,arr",A0022699,"Woodstock,IL",2,"A001,3195",RockfordIL,"A001,3243","Chicago,america",EMERG INGQUALITY

$ awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}}
NF==1{gsub(",","|")}1' OFS='"' file
Menstrain|A0022699|"Woodstock,IL,2"|A0013195|"Rockford,IL"|A0013243|"Chicago,IL"|"EMERGING,QUALITY"
Menstrain|A0022699|Woodstock|IL|2|A0013195|Rockford|ILA0013243|Chicago|EMERGING|QUALITY
"Menstrain,arr"|A0022699|"Woodstock,IL"|2|"A001,3195"|RockfordIL|"A001,3243"|"Chicago,america"|EMERG INGQUALITY

Arun_Mishra · December 17, 2012, 4:18am

Sorry i didnt try that time. which i will take care in future before posting any thread. Wow Pamu ... Awesome.. great work.. that really worked.
Just a small help if the script can erase the double qoutes after the file has been converted to pipeline.

And if you have time please can you explain how the script works.

Scrutinizer · December 17, 2012, 4:26am

Slight variation:

awk 'NR%2{gsub(/,/,"|")}1' RS=\" ORS= file

pamu · December 17, 2012, 4:36am

awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}}
NF==1{gsub(",","|")}1' OFS="" file

awk -F '"' 'NF>1{ for(i=1;i<=NF;i+=2){gsub(",","|",$i)}}    # For FS='"' (doble quotes), If NF(number of fields) is more than 1 then use for loop and replace "," with "|" for all odd entries of fields.
Odd entries marked as Red.
Menstrain,A0022699,"Woodstock,IL,2",A0013195,"Rockford,IL"

NF==1{gsub(",","|")}1' OFS='"' file   # When file don't have any FS('"') in it then NF==1. then replace all the entries of "," with "|". and Set OFS='"' to regain all double quotes.

@Scrutinizer - That's cool piece of awk with the help RS and ORS..

Arun_Mishra · December 17, 2012, 5:03am

Thanks very much Pamu.... for taking your time and explaining.
@Scrutinizer... great work.. seems you guys a Einstein of awk.
Thanks very much for the help. I have one more script to be created .. can i raise a different thread or continue in this? Sorry i am pretty new to this very helpful website..

Scrutinizer · December 17, 2012, 5:09am

Hi Arun, you would need to create a new thread for a new problem..

Arun_Mishra · December 17, 2012, 5:12am

Ok Thanks @Scrutinizer.. And can you let me know what's the difference between your code and Pamu's code?

Scrutinizer · December 17, 2012, 5:18am

The difference is that Pamu uses a double quote to split fields in a record or line ( FS=\" ), and a for loop to go to every odd field in a record and remove the commas, whereas my version uses RS=\" to split the input file into small records and only removes the commas in odd records ( NR%2 ), which are the records that lie outside the double quotes...

--
@pamu: thanks

Arun_Mishra · December 17, 2012, 5:57am

urx02:mmadusr:/development/mma/az/data> awk 'NR%2{gsub(/,/,"|")}1' RS=\" ORS= DCIA_C150_R01_PLN_SEG_DATA.TXT > pp1.txt
awk: Input line
0000030075,IPA,4,AE cannot be longer than 3,000 bytes.
 The input line number is 44. The file is DCIA_C150_R01_PLN_SEG_DATA.TXT.
 The source line number is 1.

Sample records

0000000000,,3,MAIL ORDER,MAIL ORDER DATA,N/A,MAIL ORDER,N/A,
0000010001,CASH,1,CASH,CASH,N/A,CASH,N/A,
0000020002,FFS MED,2,AL MEDICAID,MEDICAID,N/A,AL MEDICAID,N/A,"Tier2,PA,NPF"
0000020003,FFS MED,2,ALASKA MEDICAID,MEDICAID,N/A,ALASKA MEDICAID,N/A,"Tier1,ST,PA"

Please find above the error i got while i used your code @Scrutinizer. may be i have not used it properly
Please help me out with this

---------- Post updated at 04:27 PM ---------- Previous update was at 04:03 PM ----------

It seems may be because i am using OS HP-UX . If any other reason please specify.

Scrutinizer · December 17, 2012, 7:28am

It could be that there are records that do not contain double quotes and that the record length therefore becomes too long for this version of awk on HP-UX.

Try this variation on Pamu's approach:

awk '{for(i=1;i<=NF;i+=2)gsub(/,/,"|",$i)}1' FS=\" OFS= file