Hi pchang,
As has been stated many times, your simple request is ambiguous. With your stated requirements and the 4 line input sample:
0000005335,"IBD","601725","6017257002503849","0430","153854","007907","0079070
00000","E0107725","2995 BL.DAGENAIS "H" LAVAL QCCA","0200","","","WD","
CH","","4001857090","","124","124",,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,"
APP","00","EXC","5"
there are at least 2**46 (i.e., 2 raised to the 46th power) different answers that meet your stated criteria. For example, one possible result that meets all of your stated requirements is:
0000005335,"IBD','601725','6017257002503849','0430','153854','007907','0079070
00000','E0107725','2995 BL.DAGENAIS "H" LAVAL QCCA','0200','','','WD','
CH','','4001857090','','124','124',,60.00,60.00,60.00,0.00,0.45,60.45,0.037500,'
APP','00','EXC','5"
You have to tell us what constitutes a quoted string. In most CSV format files (using comma as the field separator) with quoted strings, comma can appear as a regular character in a quoted string, newline can appear as a regular character in a quoted string, in fact anything except an unescaped quoting character and an unescaped escape character can appear in a quoted string. But you don't have an escape character and you have unescaped quoting characters in your quoted string. So we need unambiguous rules that specify which double-quote characters start a quoted field and which double-quote characters end a quoted field.
For example, if the following rules correctly state your requirements, I can give you an awk script that will do what you want:
- The input file is a text file. (By definition this means there are no null bytes in the file, there are no lines longer than LINE_MAX bytes, and (unless the file is an empty file) the last character in the file is a newline character.)
- The start of a double-quoted field occurs when the first character of a field is a double-quote character. (This character can be referred to as an opening double-quote.)
- The end of a double-quoted field is delimited by a double-quote character that is not an opening double-quote and that is immediately followed by a comma or a newline character. (This character can be referred to as a closing double-quote.)
- Fields shall be separated by a comma that is not in a double-quoted field.
- Any double-quote character in a double-quoted field other than the opening double-quote and the closing double-quote shall be converted to a single-quote character.
- Double-quote characters that are not an opening double-quote, not a closing double-quote, and not in a double-quoted field shall not changed.
- A record shall be terminated by a newline character that is not in a double-quoted field.
Do these rules accurately describe your input file format?
If they do, I'll clean up my awk script and post it.
If they don't, give us your set of UNAMBIGUOUS rules and maybe we'll be able to help you.