Text qualifier issue causing data alignment problem

Hello Everyone,

I have a csv file with text qualifier as "" and data similar to below:

"1","abc","address1","US"
"2","def","address1 "characters in double-quote" address2","IND"
"3","ghi","address1","UK"

In above example, for record 2, we have an issue as in column3 contains double quotes within double quotes. Its causing issue in reading file and data is getting wrongly alligned for that particular record.

Is there a way to use awk or sed so that we can find the record with above issue and reject them to create a new file having only good records.

why dont you use the , (comma) as delimiter ?

what you are trying to do ?

The solution is quite simple: you search for records with more than 2 double-quotes between any two delimiter-characters (in your case ","). The following regexp should do this. Use grep to search for such lines and grep -v to search for valid lines:

/,[^,]*"[^,]*"[^,]*"/

Note that this will not find such lines if the problematic field is the first one. Search for such lines with the slightly altered regexp:

/^[^,]*"[^,]*"[^,]*"/

I hope this helps.

bakunin