Removing commas within semicolon in a flat file

i am recieving a flat file ( comma seperated ) with comma in between double quotes in any of the source fields . i need to remove the comma in double quotes and process the file thereafter

fields in file

col1,col2,col3,col4

input can be any of the followng

rohan,rahul,kunal,"sw,ati"
rohan,rahul,"kun,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,swati

the output should be

rohan,rahul,kunal,swati

how can we achieve this ..
thanks in advance

if you have Python, you can use its csv module

#!/usr/bin/env python
import csv
filename = "file"
reader = csv.reader(open(filename),delimiter=",")
for row in reader: 
    for n,item in enumerate(row):
        row[n]=row[n].replace(",","")
    print ','.join(row)

output

# more file
rohan,rahul,kunal,"sw,ati"
rohan,rahul,"kun,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,swati

# ./test.py
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati

we already did this using sed before 2 hours !!!!!

sed 's/\(.*\),\(.*\)/\1\2/g'  file.txt

BR

can you please tell relevant unix code

its not that simple.

May try below perl.:slight_smile:

while(<DATA>){
	s/,(?=[^,"]+")//;
	s/"//g;
	print;
}
__DATA__
rohan,rahul,kunal,"sw,ati"
rohan,rahul,"kun,al",swati
rohan,"rahul",kunal,"sw,ati"
"ro,han",rahul,kunal,swati

Try the below code. This works even if you have comma more than once in a line.

 
use strict;
use warnings;
while(<DATA>) {
        s/\"(.*?)\,(.*?)\"/$1.$2/ge;
        print;
}
__END__
"ro,han",rahul,kunal,"sw,ati"
rohan,"r,ahul","kun,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,"s,wati"

Try the following code:

#!/bin/sed -f
    s/"\([^"]*\)*"/"1\1"2/g
    s/"1\([^",]*\),\([^"]*\)"2/\1\2/g
    s/".//g

The above code will work on assumption that there is at most one coma in one quoted fragment. However, it will correctly handle the case in which there are quoted fragments without comas inside. So, for the following input

"aaa","bbb"

it will output

aaa,bbb

If, necessary i can also provide you with the version that supports multi comas in one fragment.

can anybdy help me with the relevant unix code for removing multiple instances of comma in double quotes in single row
in a comma delimited file
input

"ro,han","ra,hul","ku,nal","vishe,sh"
output

"rohan","rahul","kunal","vishesh"

what do you mean by relevant unix code??????

if you can use sed command use the below

sed 's/\"\(.*?\),\(.*?\)\"/\1\2/g'  inputfile.txt
echo \"ro,han\",\"ra,hul\",\"ku,nal\",\"vishe,sh\" | tr -d ',' | sed 's/""/","/g'

this is not the requirement[COLOR="\#738fbf"]

---------- Post updated at 02:36 AM ---------- Previous update was at 02:27 AM ----------

awk 'BEGIN{ FS="," }
{
    s=""    
    for(i=1;i<=NF;i++){        
        if ( $i !~ /,/ && $i !~ /\"/ && !f ){
            printf $i","
        }
        if (f && $i ~ /\"$/){
            printf s $i","            
            s=""
            f=0
        }
        if(i==NF){ print ""}
        if ( $i ~ /^\"/ && $i !~/\"$/)  {
          f=1
          s=$i          
        }
        if (f && $i !~ /^\"|\"$/ ){
          s=s $i
        }
    }
}' file | sed 's/,$//'

output

# more file
rohan,rahul,"kun,,al","sw,a,ti"
rohan,rahul,"k,un,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,swati
rohan,rahul,kunal,swati
# ./testnew.sh
rohan,rahul,"kunal","swati"
rohan,rahul,"kunal",swati
rohan,"rahul",kunal,swati
"rohan",rahul,kunal,swati
rohan,rahul,kunal,swati


learn about awk with the gawk manual

I've come up with the following code for the multicomas case.

#!/bin/sed -f
	h
 	s/\("[^",]*\),\?\([^"]*"\)/\1\2/g
	x;G;:c;tc
	s/^\(.*\)\n\1$/\1/
	t
	D

It's a bit clumsy, but i cannot think of the other way to check whether a sed substitute command actually changed anything.

and unreadable.

obviously a wrong choice of tools. Better to use a good programming language.

r_t_1601 kindly use this code

sed 's/\("[^,]*\),\([^,]*"\)/\1\2/g' input.txt

BR

won't work on more than 2 commas

I have a working example using perl. Here is a sample execution.

$ cat input.txt
rohan,rahul,kunal,"sw,ati"
rohan,rahul,"k,un,al",swati
rohan,"rah,ul",kunal,swati
"ro,han",rahul,kunal,swati
rohan,rahul,kunal,swati

$ ./script.pl
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati
rohan,rahul,kunal,swati

I did make sure that I had a case with two commas between the quotes. The regex should handle cases with one or more commas.

I matched the quoted string, then fixed it and substituted it back in.

#!/usr/bin/perl -w
use warnings;
use strict;

open( FH, "input.txt" ) || die "Can't open the input file: $!";

while (<FH>) {
    chomp;
    m/"(\w+[,\w+]+)"/g;
    my $line = $1;
    $line =~ s/,//g;
    s/"(\w+[,\w+]+)"/$line/g;
    print $_. "\n";
}

close(FH);
exit 0;

Hope this helps.

  • B

@bwhited , i think OP wants to retain the double quotes

here is a sample record
50119785,IRFE Asia,IRFE:3123146,18-Jun-09,29-Jun-09,MODFOLLOWING,29-Jun-12,MODFOLLOWING,IDR,USD,30900000000,3000000,13
29758,"THE BANK OF TOKYO-MITSUBISHI UFJ, LTD.",N,,,,,GREEN,B,FLOAT,FIXED,--,10.25,,,,,,,,,6M,6M,,6M,A/360,A/360,MODFOLLOWING,M
ODFOLLOWING,USD,GBP,IDR,USD,GBP,IDR,MODFOLLOWING,MODFOLLOWING,USD,GBP,IDR,USD,GBP,IDR,,,,GBP,,,--,6M-USD-LIBOR-BBA,,0,,,,,0,0,
,2,NO,1329758,0,0,0,0,23100,0,0,23100,IDOSW,JKT,665,Cross-Currency Swap,NEW,D121360,N, x

there is a comma in ,"THE BANK OF TOKYO-MITSUBISHI UFJ, LTD." which i want to delete .. sadly none of the scripts provided has worked for me so far and its kinda urgent .. GURUS !! SOS

---------- Post updated at 02:22 AM ---------- Previous update was at 02:20 AM ----------

just to update its a comma delimited file and arnd 91 colms

Well, i checked this code against your example, and it seems to work properly.

#!/bin/sed -f
	h
 	s/\("[^",]*\),\?\([^"]*"\)/\1\2/g
	x;G;:c;tc
	s/^\(.*\)\n\1$/\1/
	t
	D