Removing commas within semicolon in a flat file

r_t_1601 · June 16, 2009, 8:40am

i am recieving a flat file ( comma seperated ) with comma in between double quotes in any of the source fields . i need to remove the comma in double quotes and process the file thereafter

fields in file

col1,col2,col3,col4

input can be any of the followng

rohan,rahul,kunal,"sw,ati"
rohan,rahul,"kun,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,swati

the output should be

rohan,rahul,kunal,swati

how can we achieve this ..
thanks in advance

ghostdog74 · June 16, 2009, 9:25am

if you have Python, you can use its csv module

#!/usr/bin/env python
import csv
filename = "file"
reader = csv.reader(open(filename),delimiter=",")
for row in reader: 
    for n,item in enumerate(row):
        row[n]=row[n].replace(",","")
    print ','.join(row)

output

# more file
rohan,rahul,kunal,"sw,ati"
rohan,rahul,"kun,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,swati

# ./test.py
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati

ahmad.diab · June 16, 2009, 9:37am

we already did this using sed before 2 hours !!!!!

sed 's/\(.*\),\(.*\)/\1\2/g'  file.txt

BR

r_t_1601 · June 16, 2009, 9:39am

can you please tell relevant unix code

ghostdog74 · June 16, 2009, 10:18am

its not that simple.

summer_cherry · June 17, 2009, 1:56am

May try below perl.

while(<DATA>){
	s/,(?=[^,"]+")//;
	s/"//g;
	print;
}
__DATA__
rohan,rahul,kunal,"sw,ati"
rohan,rahul,"kun,al",swati
rohan,"rahul",kunal,"sw,ati"
"ro,han",rahul,kunal,swati

balaji_red83 · June 17, 2009, 5:27am

Try the below code. This works even if you have comma more than once in a line.

 
use strict;
use warnings;
while(<DATA>) {
        s/\"(.*?)\,(.*?)\"/$1.$2/ge;
        print;
}
__END__
"ro,han",rahul,kunal,"sw,ati"
rohan,"r,ahul","kun,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,"s,wati"

tkleczek · June 18, 2009, 1:53pm

Try the following code:

#!/bin/sed -f
    s/"\([^"]*\)*"/"1\1"2/g
    s/"1\([^",]*\),\([^"]*\)"2/\1\2/g
    s/".//g

The above code will work on assumption that there is at most one coma in one quoted fragment. However, it will correctly handle the case in which there are quoted fragments without comas inside. So, for the following input

"aaa","bbb"

it will output

aaa,bbb

If, necessary i can also provide you with the version that supports multi comas in one fragment.

r_t_1601 · June 19, 2009, 2:26am

can anybdy help me with the relevant unix code for removing multiple instances of comma in double quotes in single row
in a comma delimited file
input

"ro,han","ra,hul","ku,nal","vishe,sh"
output

"rohan","rahul","kunal","vishesh"

ahmad.diab · June 19, 2009, 3:00am

what do you mean by relevant unix code??????

if you can use sed command use the below

sed 's/\"\(.*?\),\(.*?\)\"/\1\2/g'  inputfile.txt

rakeshawasthi · June 19, 2009, 3:04am

echo \"ro,han\",\"ra,hul\",\"ku,nal\",\"vishe,sh\" | tr -d ',' | sed 's/""/","/g'

ghostdog74 · June 19, 2009, 3:36am

this is not the requirement[COLOR="\#738fbf"]

---------- Post updated at 02:36 AM ---------- Previous update was at 02:27 AM ----------

awk 'BEGIN{ FS="," }
{
    s=""    
    for(i=1;i<=NF;i++){        
        if ( $i !~ /,/ && $i !~ /\"/ && !f ){
            printf $i","
        }
        if (f && $i ~ /\"$/){
            printf s $i","            
            s=""
            f=0
        }
        if(i==NF){ print ""}
        if ( $i ~ /^\"/ && $i !~/\"$/)  {
          f=1
          s=$i          
        }
        if (f && $i !~ /^\"|\"$/ ){
          s=s $i
        }
    }
}' file | sed 's/,$//'

output

# more file
rohan,rahul,"kun,,al","sw,a,ti"
rohan,rahul,"k,un,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,swati
rohan,rahul,kunal,swati
# ./testnew.sh
rohan,rahul,"kunal","swati"
rohan,rahul,"kunal",swati
rohan,"rahul",kunal,swati
"rohan",rahul,kunal,swati
rohan,rahul,kunal,swati

learn about awk with the gawk manual

tkleczek · June 19, 2009, 5:02am

I've come up with the following code for the multicomas case.

#!/bin/sed -f
	h
 	s/\("[^",]*\),\?\([^"]*"\)/\1\2/g
	x;G;:c;tc
	s/^\(.*\)\n\1$/\1/
	t
	D

It's a bit clumsy, but i cannot think of the other way to check whether a sed substitute command actually changed anything.

ghostdog74 · June 19, 2009, 5:10am

and unreadable.

obviously a wrong choice of tools. Better to use a good programming language.

ahmad.diab · June 19, 2009, 10:18am

r_t_1601 kindly use this code

sed 's/\("[^,]*\),\([^,]*"\)/\1\2/g' input.txt

BR

ghostdog74 · June 19, 2009, 7:24pm

won't work on more than 2 commas

bwhitehd · June 20, 2009, 2:36am

I have a working example using perl. Here is a sample execution.

$ cat input.txt
rohan,rahul,kunal,"sw,ati"
rohan,rahul,"k,un,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,swati
rohan,rahul,kunal,swati

$ ./script.pl
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati

I did make sure that I had a case with two commas between the quotes. The regex should handle cases with one or more commas.

I matched the quoted string, then fixed it and substituted it back in.

#!/usr/bin/perl -w
use warnings;
use strict;

open( FH, "input.txt" ) || die "Can't open the input file: $!";

while (<FH>) {
    chomp;
    m/"(\w+[,\w+]+)"/g;
    my $line = $1;
    $line =~ s/,//g;
    s/"(\w+[,\w+]+)"/$line/g;
    print $_. "\n";
}

close(FH);
exit 0;

Hope this helps.

B

ghostdog74 · June 20, 2009, 3:04am

@bwhited , i think OP wants to retain the double quotes

r_t_1601 · June 20, 2009, 3:22am

here is a sample record
50119785,IRFE Asia,IRFE:3123146,18-Jun-09,29-Jun-09,MODFOLLOWING,29-Jun-12,MODFOLLOWING,IDR,USD,30900000000,3000000,13
29758,"THE BANK OF TOKYO-MITSUBISHI UFJ, LTD.",N,,,,,GREEN,B,FLOAT,FIXED,--,10.25,,,,,,,,,6M,6M,,6M,A/360,A/360,MODFOLLOWING,M
ODFOLLOWING,USD,GBP,IDR,USD,GBP,IDR,MODFOLLOWING,MODFOLLOWING,USD,GBP,IDR,USD,GBP,IDR,,,,GBP,,,--,6M-USD-LIBOR-BBA,,0,,,,,0,0,
,2,NO,1329758,0,0,0,0,23100,0,0,23100,IDOSW,JKT,665,Cross-Currency Swap,NEW,D121360,N, x

there is a comma in ,"THE BANK OF TOKYO-MITSUBISHI UFJ, LTD." which i want to delete .. sadly none of the scripts provided has worked for me so far and its kinda urgent .. GURUS !! SOS

---------- Post updated at 02:22 AM ---------- Previous update was at 02:20 AM ----------

just to update its a comma delimited file and arnd 91 colms

tkleczek · June 20, 2009, 4:07am

Well, i checked this code against your example, and it seems to work properly.

#!/bin/sed -f
	h
 	s/\("[^",]*\),\?\([^"]*"\)/\1\2/g
	x;G;:c;tc
	s/^\(.*\)\n\1$/\1/
	t
	D