How to use regex on particular column (Removing comma from particular column)?

Prathmesh · September 23, 2014, 6:17am

Hi,

I have pipe separated file which contains some data having comma(,) in it. I want to remove the comma(,) only from particular column without changing data in other columns.

Below is the sample data file, I want to remove the comma(,) only from 5th column.

$ cat file1
ABC | DEF, HIJ| LMN| opq,| 12,345|XYZ
PQR,123 | LMN| RWX|pqr| 6,12|LMN
13,67| PQR,123 | LMN| RWX|7,89|XY
LMN,SDF|ABC,S|OPQ|SLM,SD|9,00|P
123,23|45,789|12,567.04|78,765|6,90|9,00,875

I tried using sed as below but, I am not able to figure out why the used command is removing comma(,) from only 5th column for the 1st 4 lines and not removing any comma(,) from last line.

$ sed 's/\(.*\),\(.*\)/\1\2/g' file1
ABC | DEF, HIJ| LMN| opq,| 12345|XYZ
PQR,123 | LMN| RWX|pqr| 612|LMN
13,67| PQR,123 | LMN| RWX|789|XY
LMN,SDF|ABC,S|OPQ|SLM,SD|900|P
123,23|45,789|12,567.04|78,765|6,90|9,00875
$

I also tried using awk as below but, it is not removing comma(,). To be frank, I did not understand why I have to use "\\1" in awk, because in sed as used above I can use \1 for representing groups. It would be helpful if somebody can explain the difference in using regex in sed and awk for grouping data.

$ awk -F"|" '{print gensub(/\(.*\),\(.*\)/,"\\1""\\2",g,$5);}' file1
 12,345
 6,12
7,89
9,00
6,90

I am thinking whether we could use sed inside awk for particular column, but not sure how to do it.

Expected output is as follows:

ABC | DEF, HIJ| LMN| opq,| 12345|XYZ
PQR,123 | LMN| RWX|pqr| 612|LMN
13,67| PQR,123 | LMN| RWX|789|XY
LMN,SDF|ABC,S|OPQ|SLM,SD|900|P
123,23|45,789|12,567.04|78,765|690|9,00,875

Thanks in advance for any help on this.

Don_Cragun · September 23, 2014, 6:48am

The sed utility uses BREs (Basic Regular Expressions); awk uses EREs (Extended ...). But, you're making your RE much too complex for what you're trying to do in awk .

If you want to get rid of the 1st comma in the 5th field, use:

awk -F"|" '{sub(/,/,"",$5)}1' file1

and, if you want to get rid of all commas in the 5th field, use:

awk -F"|" '{gsub(/,/,"",$5)}1' file1

RavinderSingh13 · September 23, 2014, 7:59am

don cragun:

The sed utility uses BREs (Basic Regular Expressions); awk uses EREs (Extended ...). But, you're making your RE much too complex for what you're trying to do in awk .

If you want to get rid of the 1st comma in the 5th field, use:
awk -F"|" '{sub(/,/,"",$5)}1' file1
and, if you want to get rid of all commas in the 5th field, use:
awk -F"|" '{gsub(/,/,"",$5)}1' file1

Hello Don,

Above code will give the requested output but it wouldn't have OFS (Output field seprator) as | . So just adding the same to your suggestion.

awk -F"|" '{gsub(/\,/,X,$5)} 1' OFS="|"  Input_file ## For removing all commas in 5th field
OR
awk -F"|" '{sub(/\,/,X,$5)} 1' OFS="|"  Input_file  ## For removing 1st comma in 5th field

Thanks,
R. Singh

Prathmesh · September 23, 2014, 8:06am

Thanks Don. Solution was much simpler than I thought.

---------- Post updated at 05:36 PM ---------- Previous update was at 05:31 PM ----------

Thanks Ravinder. Can you please advise what X refers to in your code. It removes the comma, but bit of confused why it is used instead of "".

RavinderSingh13 · September 23, 2014, 8:12am

Hello Prathmesh,

It is a variable whose value is NULL so I am simply replaing the comma with a variable whose value is NULL.

We can also say as follows.

Thanks,
R. Singh

Prathmesh · September 23, 2014, 8:19am

Thanks Ravinder.

Don_Cragun · September 23, 2014, 12:44pm

ravindersingh13:

Hello Don,

Above code will give the requested output but it wouldn't have OFS (Output field seprator) as | . So just adding the same to your suggestion.
awk -F"|" '{gsub(/\,/,X,$5)} 1' OFS="|"  Input_file ## For removing all commas in 5th field
OR
awk -F"|" '{sub(/\,/,X,$5)} 1' OFS="|"  Input_file  ## For removing 1st comma in 5th field
Thanks,
R. Singh

Thanks Ravinder,
I obviously need to get more sleep.