Problem with changing field separators in a file

mk1216 · February 17, 2011, 7:33am

I have a file with content as shown below.

cat t2 :

100,100,"X",1234,"12A",,,"ab,c"

Comma is the field seperator, however string fields will be within double quotes and comma within double quotes should not be treated as field seperator.

I am trying to replace this field seperator to a distinct character like a pipe or \001 and then perform some analysis.

I have used below perl command which is working correctly, but has some problem with performance. My file has about 7 Million rows and this command is taking about 45 mins.

cat t2 | perl -M'Text::ParseWords' -ne 'print (join("\001" => quotewords(",",0, $_)))' | cat -v
 
100^A100^AX^A1234^A12A^A^A^Aab,c

Looking forward for some advise on making this script run faster or if there is alternate approach using unix commands like AWK or SED..

DC_Slick · February 17, 2011, 7:41am

.Do you want to replace ALL commas in your script to something else?

sed -i "s/\,/\|/g" infile

Changes all commas to the pipe

Before using sed -i, do it without the -i so you can see what it'll do prior to making the change

So

sed "s/\,/\|/g" infile

mk1216 · February 17, 2011, 8:45am

Yes, only the field seperators. If there is a comma within double quotes, then leave it as it is.

ctsgnb · February 17, 2011, 9:29am

Give a try to :

sed 's:\("[^",][^",]*\),\([^"]*"\):\1\2:g' infile

Note that if you have more than 1 coma within the double quote, it will only remove 1 so you may have to run it several times

# echo '100,100,"X",1234,"12A",,,"ab,c"'
100,100,"X",1234,"12A",,,"ab,c"
# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1\2:g'
100,100,"X",1234,"12A",,,"abc"

mk1216 · February 17, 2011, 9:41am

Thanks for this,

100,100,"X",1234,"12A",,,"abc"

but what I am trying is to replace the field seperator comma alone with character like pipe or \001.

 
Original data
100,100,"X",1234,"12A",,,"ab,c"

So the output should look like

100|100|X|1234|12A|||ab,c

[/CODE]

Any further advise please ?

ctsgnb · February 17, 2011, 10:03am

# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1#\2:g' | tr ',#' '|,'
100|100|"X"|1234|"12A"|||"ab,c"

# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:\("[^",][^",]*\),\([^"]*"\):\1#\2:g;s:,:|:g;s:#:,:'
100|100|"X"|1234|"12A"|||"ab,c"

# echo '100,100,"X",1234,"12A",,,"ab,c"' | sed 's:,:|:g;s:\("[^"|][^"|]*\)|\([^"]*"\):\1,\2:g'
100|100|"X"|1234|"12A"|||"ab,c"

---------- Post updated at 04:03 PM ---------- Previous update was at 03:57 PM ----------

Try this :

sed 's:,:|:g;s:\("[^"|][^"|]*\)|\([^"]*"\):\1,\2:g' infile

alister · February 17, 2011, 10:46am

The following assumes that the file format is as simple as it appears (no special rules such as how to quote quotes, etc):

BEGIN {
    delimiter = ","
    new_delimiter = "|"
}

{
    len = length($0)
    in_quotes = 0
    for (i = 1; i <= len; i++) {
        char = substr($0, i, 1)
        if (char == "\"") {
            in_quotes = (in_quotes ? 0 : 1)
            continue
        }
        if (char == delimiter && !in_quotes)
            char = new_delimiter
        printf("%s", char)
    }
    printf("\n")
}

$ echo '100,100,"X",1234,"12A",,,"ab,c"' | awk -f csv.awk 
100|100|X|1234|12A|||ab,c

Regards,
Alister

ctsgnb · February 17, 2011, 11:08am

Another idea that could be dig a bit more :

fold -w 1 infile | awk '/"/{f=!f} f{ .... }'