Problem with changing field separators in a file

I have a file with content as shown below.

cat t2 :

100,100,"X",1234,"12A",,,"ab,c"

Comma is the field seperator, however string fields will be within double quotes and comma within double quotes should not be treated as field seperator.

I am trying to replace this field seperator to a distinct character like a pipe or \001 and then perform some analysis.

I have used below perl command which is working correctly, but has some problem with performance. My file has about 7 Million rows and this command is taking about 45 mins.

cat t2 | perl -M'Text::ParseWords' -ne 'print (join("\001" => quotewords(",",0, $_)))' | cat -v
 
100^A100^AX^A1234^A12A^A^A^Aab,c

Looking forward for some advise on making this script run faster or if there is alternate approach using unix commands like AWK or SED..

.Do you want to replace ALL commas in your script to something else?

sed -i "s/\,/\|/g" infile

Changes all commas to the pipe

Before using sed -i, do it without the -i so you can see what it'll do prior to making the change

So

sed "s/\,/\|/g" infile
1 Like

Yes, only the field seperators. If there is a comma within double quotes, then leave it as it is.

Give a try to :

sed 's:\("[^",][^",]*\),\([^"]*"\):\1\2:g' infile

Note that if you have more than 1 coma within the double quote, it will only remove 1 so you may have to run it several times

# echo '100,100,"X",1234,"12A",,,"ab,c"'
100,100,"X",1234,"12A",,,"ab,c"
# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1\2:g'
100,100,"X",1234,"12A",,,"abc"
1 Like

Thanks for this,

100,100,"X",1234,"12A",,,"abc"

but what I am trying is to replace the field seperator comma alone with character like pipe or \001.

 
Original data
100,100,"X",1234,"12A",,,"ab,c"

So the output should look like

100|100|X|1234|12A|||ab,c

[/CODE]

Any further advise please ?

# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1#\2:g' | tr ',#' '|,'
100|100|"X"|1234|"12A"|||"ab,c"
# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1#\2:g;s:,:|:g;s:#:,:'
100|100|"X"|1234|"12A"|||"ab,c"
# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:,:|:g;s:\("[^"|][^"|]*\)|\([^"]*"\):\1,\2:g'
100|100|"X"|1234|"12A"|||"ab,c"

---------- Post updated at 04:03 PM ---------- Previous update was at 03:57 PM ----------

Try this :

sed 's:,:|:g;s:\("[^"|][^"|]*\)|\([^"]*"\):\1,\2:g' infile

The following assumes that the file format is as simple as it appears (no special rules such as how to quote quotes, etc):

BEGIN {
    delimiter = ","
    new_delimiter = "|"
}

{
    len = length($0)
    in_quotes = 0
    for (i = 1; i <= len; i++) {
        char = substr($0, i, 1)
        if (char == "\"") {
            in_quotes = (in_quotes ? 0 : 1)
            continue
        }
        if (char == delimiter && !in_quotes)
            char = new_delimiter
        printf("%s", char)
    }
    printf("\n")
}
$ echo '100,100,"X",1234,"12A",,,"ab,c"' | awk -f csv.awk 
100|100|X|1234|12A|||ab,c

Regards,
Alister

Another idea that could be dig a bit more :

fold -w 1 infile | awk '/"/{f=!f} f{ .... }' 
1 Like