Mapping two files

goddevil · November 15, 2012, 7:35pm

I have a large comma delimited file from which i have to extract a specific column, compare it against a static file and replace the column value with the matching column from the static file.

The following are the files:

Main file:
mainfile.csv

test1234,,real,,,,open,,,,EU,,,,23/03/2012 12:03:00,,,,GB,,,,1187,0,B,TRY,10000,15/07...
test1234,,real,,,,open,,,,EU,,,,23/03/2012 12:03:00,,,,GB,,,,1187,0,B,FDG,10000,15/07...
test1234,,real,,,,open,,,,EU,,,,23/03/2012 12:03:00,,,,GB,,,,1187,0,B,HGC,10000,15/07...
test1234,,real,,,,open,,,,EU,,,,23/03/2012 12:03:00,,,,GB,,,,1187,0,B,MDL,10000,15/07...

Static file.
reffile

CODE,PRODUCT
UGE72453,TRY
NGU76290,FDG
HRY82662,HGC
LQY35689,MDL

I now want to replace the bolded 26th column in the main file with the code from the static file
The main file has about 1000 lines but only has the aforementioned [products.
I am using the following code but only a

        OLDIFS=$IFS
        IFS=","
        while read pr1 pr2
        do
                for i in `cat mainfile.csv | awk 'BEGIN { FS = ","; } { print $26}'`
                do
                        if [[ "$i" == *"$pr2"* ]]; then
                                 echo $i $pr1
          #                     sed 's/$i/$pr1/' $i;
                        fi
                done
        done < reffile
        IFS=$OLDIFS

When i execute the above code, only some of the lines get their value replaced.

Can you please advise check for pr2 in the for loop and replace by the corresponding pr1 in from the while? I would appreciate if someone can even advise me on a better way of doing this

Chubler_XL · November 15, 2012, 7:57pm

Try this:

awk -F, 'FNR==NR{C[$2]=$1;next}{$26=C[$26]}1' OFS=, reffile mainfile.csv

Yoda · November 15, 2012, 8:05pm

Using shell script:-

IFS=","
while read pr1 pr2
do
    sed "s/$pr2/$pr1/g" mainfile.csv > tmp; mv tmp mainfile.csv;
done < reffile

goddevil · November 15, 2012, 8:12pm

Cheers bipi.

I am looking for a way to map the code to the specific column as there may be other columns with the same characters and i would not like them touched.

For eg:

test1234,,real,,TRY97837290,,open,,,,EU,,,,23/03/2012 12:03:00,,,,GB,,,,1187,0,B,TRY     ,10000,15/07...
test1234,,real,,FDG09789238,,open,,,,EU,,,,23/03/2012 12:03:00,,,,GB,,,,1187,0,B,FDG           ,10000,15/07...
test1234,,real,,HGC02749267,,open,,,,EU,,,,23/03/2012 12:03:00,,,,GB,,,,1187,0,B,HGC          ,10000,15/07...
test1234,,real,,MDL92773671,,open,,,,EU,,,,23/03/2012 12:03:00,,,,GB,,,,1187,0,B,MDL           ,10000,15/07...

I suppose the fault is mine. I wasnt clear with my initial requirement as to why i am choosing this avenue of replacing a column. Apologies

Chubler_XL · November 15, 2012, 8:14pm

@bipinajith, yuck what happens if "TRY" appears in other fields on the line?

Yoda · November 15, 2012, 8:26pm

Chubler_XL It will mess up the data
I posted a solution which will work based on file content initially posted. I'm sorry if that is not what the requester want

goddevil · November 15, 2012, 8:27pm

I am getting the following error when i execute the command

awk: syntax error near line 1
awk: bailing out near line 1

Chubler_XL · November 15, 2012, 8:30pm

Solaris? Try nawk instead of awk

goddevil · November 15, 2012, 8:37pm

Yep. Solaris bash

I used nawk and am not getting any errors now but the string is not being replaced.

Please see my updated mainfile.csv.

The field 26 has blankspaces and thus will not exactly match. Thats why i am using

[[ ]] and comparing with *

---------- Post updated at 01:37 AM ---------- Previous update was at 01:34 AM ----------

Can you kindly explain what you're doing here?

 'FNR==NR{C[$2]=$1;next}{$26=C[$26]}1'

Chubler_XL · November 15, 2012, 8:43pm

If you don't mind the blank spaces being removed when field 26 is updated, this should work for you

nawk -F, '
FNR==NR{C[$2]=$1;next}
{sub(/ *$/,"",$26);$26=C[$26]}1' OFS=, reffile mainfile.csv

As for the explination:

FNR=Current file row number
NR=Global row number
when FNR==NR first file is being processed (ie reffile)

C[$2]=$1;next
Build a lookup array with refcode (2nd field) as index and replace value (1st field) as data

sub(/ *$/,"",$26)
replace any trailing spaces from field 26 (assumption is they are spaces and not tabs)

$26=C[$26]
lookup array with field 26 value and put data back into $26

goddevil · November 16, 2012, 12:12am

chubler_xl:

If you don't mind the blank spaces being removed when field 26 is updated, this should work for you
nawk -F, '
FNR==NR{C[$2]=$1;next}
{sub(/ *$/,"",$26);$26=C[$26]}1' OFS=, reffile mainfile.csv
As for the explination:

FNR=Current file row number
NR=Global row number
when FNR==NR first file is being processed (ie reffile)

C[$2]=$1;next
Build a lookup array with refcode (2nd field) as index and replace value (1st field) as data

sub(/ *$/,"",$26)
replace any trailing spaces from field 26 (assumption is they are spaces and not tabs)

$26=C[$26]
lookup array with field 26 value and put data back into $26

The code is working now. Thanks a mil for the explanation. It makes a lot of sense now

---------- Post updated at 05:12 AM ---------- Previous update was at 01:52 AM ----------

I am trying to adapt the code so that the replacement is done only if the pattern in $26 matches the pattern C[$2] in the below code.

Where do i insert an IF loop like if ( $26 == C[$2] )

nawk -F, '
FNR==NR{C[$2]=$1;next}
{sub(/ *$/,"",$26);$26=C[$26]}1' OFS=, reffile mainfile.csv

Chubler_XL · November 16, 2012, 1:15am

Again you probably want to trim $26 before doing a compare so:

nawk -F, ' 
    FNR==NR{C[$2]=$1;next} 
    {sub(/ *$/,"",$26);if($26==C[$2])$26=C[$26]}1' OFS=, reffile mainfile.csv

goddevil · November 16, 2012, 2:23am

No joy there.

When i insert the if condition there, the code doesnt work. Nothing is replaced meaning nothing is matched. C[$2] will have a different meaning since NR is now 2 right?