Conditional replacement in CSV files

reddyr · July 31, 2014, 8:19pm

Hello,

I have many CSV files with variable number of rows and columns.

Sample of few problematic CSV files.

,,Price,Price,Price,Price,Price,Price,Price,Price,Price,Qty
Date,Sl,AAA,BBB,CCC,DDD,EEE,FFF,GGG,HHH,PriQueue,%busy
30/07/2014,1,AAA,BBB,CCC,DDD,EEE,FFF,GGG,HHH,NA,0
30/07/2014,2,34,64,1,1,0.63,210.5,2.4,73,NA,71
30/07/2014,3,36,62,1,1,0.62,206.7,2.3,72,NA,70
30/07/2014,4,29,70,1,0,0.78,258.5,14.2,91,NA,26

Few CSV files have same values in 2nd and 3rd rows (from column 3rd to 10th). For those problematic CSV files, we need to replace those fields (in 3rd row from 3rd column to 10th column) with 4th row's 3rd column to 10th column.

Desired output

,,Price,Price,Price,Price,Price,Price,Price,Price,Price,Qty
Date,Sl,AAA,BBB,CCC,DDD,EEE,FFF,GGG,HHH,PriQueue,%busy
30/07/2014,1,34,64,1,1,0.63,210.5,2.4,73,NA,0
30/07/2014,2,34,64,1,1,0.63,210.5,2.4,73,NA,71
30/07/2014,3,36,62,1,1,0.62,206.7,2.3,72,NA,70
30/07/2014,4,29,70,1,0,0.78,258.5,14.2,91,NA,26

Please advise!
Thanks a lot

SriniShoo · August 1, 2014, 12:06am

awk -F ',' '
NR == 2 {  p=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10); a=$0;
  getline; q=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10); b=$0;
  getline; r=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10);
  if(p == q) sub(q, r, b); print a; print b}1' file

Don_Cragun · August 1, 2014, 2:41am

srinishoo:

awk -F ',' '
NR == 2 {  p=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10); a=$0;
  getline; q=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10); b=$0;
  getline; r=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10);
  if(p == q) sub(q, r, b); print a; print b}1' file

This is a very nice solution as long as the sample input is representative of the actual data that will be processed. If fields 3 though 10 in line 3 contain any ERE special characters ( [\()*+?{|�$ ) or if those fields in line 4 contain any ampersand characters ( & ), the results will not be what was wanted.

If any of these are a problem, the OP needs to specify more clearly what characters are allowed in those fields on those lines.

reddyr · August 1, 2014, 3:52am

Thank you SriniShoo,

As Don Cragun correctly pointed out, unfortunately few CSVs contain % in those fields and the solution did not work. It works fine for the CSVs without special characters in those fields.

Pl advise. Many thanks!

Don_Cragun · August 1, 2014, 1:34pm

Please show us a sample 4 line file that didn't work. (Maybe I didn't get enough sleep last night, but I don't see why a % should cause a problem.) Please also give us the complete list of characters (other than upper- and lower-case alphabetic and numeric characters) that might appear in those fields on line3 and a similar list of characters that might appear in those fields on line 4.

What OS and version of awk are you using?

reddyr · August 1, 2014, 4:19pm

Apologies... Though the values same but there're spaces in 2nd row which I failed to notice because I opened in excel.

,,Price,Price,Price,Price,Price,Price,Price,Price,Price,Qty
Date,Time,%AAA,    %BBB,    %CCC,   %DDD,   %EEE,   %FFF, %GGG, %HHH, Unit Value
30/07/2014,23:26:00,%AAA,%BBB,%CCC,%DDD,%EEE,%FFF,%GGG,%HHH,NA
30/07/2014,23:31:00,53,37,0,10,0.56,93.7,5.3,88,NA
30/07/2014,23:36:00,52,36,0,12,0.55,92.2,6.4,92,NA
30/07/2014,23:41:00,53,37,0,10,0.56,93.8,4.4,85,NA
30/07/2014,23:46:00,64,33,1,3,0.82,136.7,4.6,91,NA

Desired output

,,Price,Price,Price,Price,Price,Price,Price,Price,Price,Qty
Date,Time,%AAA,    %BBB,    %CCC,   %DDD,   %EEE,   %FFF, %GGG, %HHH, Unit Value
30/07/2014,23:26:00,53,37,0,10,0.56,93.7,5.3,88,NA
30/07/2014,23:31:00,53,37,0,10,0.56,93.7,5.3,88,NA
30/07/2014,23:36:00,52,36,0,12,0.55,92.2,6.4,92,NA
30/07/2014,23:41:00,53,37,0,10,0.56,93.8,4.4,85,NA
30/07/2014,23:46:00,64,33,1,3,0.82,136.7,4.6,91,NA

So I don't think it's issue with awk - I'm using GNU awk 4.1.1

Is it possible to compare strings "without" spaces but NOT removing those spaces in the files.? Please advise.

Many thanks!

Don_Cragun · August 1, 2014, 8:48pm

Building on SriniShoo's code, try:

awk -F ',' '
NR == 2 {  p=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10); a=$0; nsp=p; gsub(/ /, "", nsp)
  getline; q=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10); b=$0
  getline; r=($3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10)
  if(nsp == q) sub(q, r, b); print a; print b}1' file

reddyr · August 2, 2014, 8:47am

Thanks, that worked.

Scrutinizer · August 4, 2014, 6:24pm

As Don Cragun pointed put, in general this method may fail if there are any characters in the string variable q that need to be escaped in order to be taken literally, since the string is interpreted by sub as an extended regular expression..

An alternative approach using string assignment:

awk '
  NR==2{
    h3=$3
  }
  NR==3 && $3==h3{
    getline p
    split(p,F)
    for(i=3; i<=10; i++) $i=F
    print $0 ORS p
    next
  }
  1
' FS=',[ \t]*' OFS=, file

The input field separator (FS) is used to remove leading space from the comparison.. If there is also trailing space, one could use:

FS='[ \t]*,[ \t]*' OFS=,