Replacing Comma delimiter coming inside the data.

arunvasu2 · May 11, 2011, 7:27am

Hello,

     I am having flat file \(Comma Delimiter\) and the data in the file is as given below.

EMPNO, ENAME, DESIGNATION, SALARY
10979, Arun Kumar, Cosultant, 35000
13555, Bidhu Shekar, Senior Consultant, 45000
15000, Kiran, Kumar, Senior, Consultant, 40000

           If you notice the 3rd record, there are Comma's presents in the column \(Column 2: Kiran, Kumar   Column 3: Senior, Consultant\). Please help me in fixing this problem?

Thanks in Advance!!

palanisvr · May 11, 2011, 7:37am

what exactly needs to be done

arunvasu2 · May 11, 2011, 7:59am

You can see that it's a comma delimiter. It means each and every record is identified by a comma.
But if you notice the last record, The 2nd column is separated by 2 commas, and also 3rd column is separated by 2 commas which means it contains 6 columns. But actually we are having only 4 columns.
How can we supress the commas that are present inside the data of a column?

aman23 · May 11, 2011, 8:05am

if you just want to suppress comma for this particular text, here's the command that works for me:

cat filename | sed 's/Kiran,/Kiran/g' | sed 's/Senior,/Senior/g'

bartus11 · May 11, 2011, 8:07am

Try:

perl -F, -ape 's/(\d, \w+),/\1/ && s/, (\w+, \d+$)/ \1/ if $#F==5' file

arunvasu2 · May 11, 2011, 8:16am

Thanks for the reply bartus11
Can we do the samething with SED or AWK command? If yes can you please give me the answer?

bartus11 · May 11, 2011, 8:34am

awk -F, 'NF==6{$0=gensub ("([0-9]+, [[:alpha:]]+),","\\1",1);$0=gensub (", ([[:alpha:]]+, [0-9]+$)"," \\1",1)}1' file

arunvasu2 · May 11, 2011, 10:35am

Hi bartus11,

          Can you please explain what the above 'awk' is doing?

Thanks to all who replied to my question

bartus11 · May 11, 2011, 10:45am

It is just the perl code translated to AWK ;). It removes the second comma from beginning, and second comma from the end of the line if number of fields is equal 6.

ctsgnb · May 11, 2011, 12:31pm

# cat tst
EMPNO, ENAME, DESIGNATION, SALARY
10979, Arun Kumar, Cosultant, 35000
13555, Bidhu Shekar, Senior Consultant, 45000
15000, Kiran, Kumar, Senior, Consultant, 40000

# sed '$s/,//4;$s/,//2' tst
EMPNO, ENAME, DESIGNATION, SALARY
10979, Arun Kumar, Cosultant, 35000
13555, Bidhu Shekar, Senior Consultant, 45000
15000, Kiran Kumar, Senior Consultant, 40000

But in my opinion you should care about formatting the file directly at generation step (SQL?) since - according to the example provided, there are no deterministic rule(**) for identifying which of the coma is superfulous (all the more if you have big files with many different values in field $2 and $3 and that those fields may contains some additionnal unexpected comas.)

(**)Example:

32134, Tom, Billy, Joe, Corporation, 90909

Could as well be :

32134, Tom Billy, Joe Corporation, 90909
32134, Tom, Billy Joe Corporation, 90909