Get values from different columns from file2 when match values of file1

cgkmal · September 27, 2011, 5:24am

Hi everyone,

I have file1 and file2 comma separated both.

file1 is:

Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10
Code7,,,,,,,,,
Code5,,,,,,,,,
Code3,,,,,,,,,
Code9,,,,,,,,,
Code2,,,,,,,,,

file2 is:

CodeA1,Data1,4,Text1,Guide1,FRT,Name1,Prod1,Cat1,TYYF-981,105
CodeA2,Data2,24,Text2,Guide2,CVDA,Name2,Prod2,Cat2,RT-223,23
CodeA3,Data3,13,Text3,Guide3,FRT,Name3,Prod3,Cat3,HHJUI-12,42
CodeA4,Data4,6,Text4,Guide4,CVDA,Name4,Prod4,Cat4,TYYF-980,98
CodeA5,Data5,2,Text5,Guide5,FRT,Name5,Prod5,Cat5,RT-222,22
CodeA6,Data6,3,Text6,Guide6,CVDA,Name6,Prod6,Cat6,HHJUI-11,11
CodeA7,Data7,14,Text7,Guide7,FRT,Name7,Prod7,Cat7,TYYF-979,43
CodeA8,Data8,7,Text8,Guide8,CVDA,Name8,Prod8,Cat8,RT-221,65
CodeA9,Data9,3,Text9,Guide9,FRT,Name9,Prod9,Cat9,HHJUI-10,43
CodeA10,Data10,4,Text10,Guide10,CVDA,Name10,Prod10,Cat10,TYYF-978,99

I want to:
For every value in colum 1 of file1 that is in column 1 of file2 copy into corresponding line of file1 the values that are in some of the columns in file2 in the matching line in this way.

Copy into file1-Colum Value from file2 (for each matching line)
In col2 of file1- copy value from col5 of file2
In col3 of file1 - copy value from col2 of file2
In col4 of file1 - copy value from col4 of file2
In col9 of file1 - copy value from col3 of file2

so the output should be like this:

Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10
Code7,Guide7,Data7,Text7,,,,,14,
Code5,Guide5,Data5,Text5,,,,,22,
Code3,Guide3,Data3,Text3,,,,,42,
Code9,Guide9,Data9,Text9,,,,,43,
Code2,Guide2,Data2,Text2,,,,,23,

Maybe some body coul help me with this.

Many thanks in advance.

Regards

Franklin52 · September 27, 2011, 6:04am

Something like this?

awk -F, '
NR==FNR{idx=substr($1,1,4) substr($1,6,1); a[idx]=$5 FS $2 FS $4 ",,,,," $3 FS; next} 
FNR==1{print; next}
a[$1]{print $1 FS a[$1]}
' file2 file1

cgkmal · September 27, 2011, 6:40am

It's closed, the only issue is that if the file1 contain values not only in column1, in the output the values are being deleted.

example:
if file1 contain this:

Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10
Code7,,,,,455,,Test7,,
Code5,,,,,,,,,
Code3,,,,,FGT1,,,,
Code9,,,,,,,,,
Code2,,,,,lyo,,,,

the output should be as follow:

Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10
Code7,Guide7,Data7,Text7,,455,,Test7,14,
Code5,Guide5,Data5,Text5,,,,,2,
Code3,Guide3,Data3,Text3,,FGT1,,,13,
Code9,Guide9,Data9,Text9,,,,,3,
Code2,Guide2,Data2,Text2,,lyo,,,24,

and currently the script gives this output:

Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10
Code7,Guide7,Data7,Text7,,,,,14,
Code5,Guide5,Data5,Text5,,,,,2,
Code3,Guide3,Data3,Text3,,,,,13,
Code9,Guide9,Data9,Text9,,,,,3,
Code2,Guide2,Data2,Text2,,,,,24,

Many thanks for help so far.

The last column is returning different values, I get this:

polineni · September 27, 2011, 6:46am

@ cgkaml

question is not clear for me. please map the answer correctly

cgkmal · September 27, 2011, 7:11am

Hi polineni, thanks for answer.

I mean,

file1 is:

Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10
Code7,,,,,455,,Test7,,
Code5,,,,,,,,,
Code3,,,,,FGT1,,,,
Code9,,,,,,,,,
Code2,,,,,lyo,,,,

file2 is:

CodeA1,Data1,4,Text1,Guide1,FRT,Name1,Prod1,Cat1,TYYF-981,105
CodeA2,Data2,24,Text2,Guide2,CVDA,Name2,Prod2,Cat2,RT-223,23
CodeA3,Data3,13,Text3,Guide3,FRT,Name3,Prod3,Cat3,HHJUI-12,42
CodeA4,Data4,6,Text4,Guide4,CVDA,Name4,Prod4,Cat4,TYYF-980,98
CodeA5,Data5,2,Text5,Guide5,FRT,Name5,Prod5,Cat5,RT-222,22
CodeA6,Data6,3,Text6,Guide6,CVDA,Name6,Prod6,Cat6,HHJUI-11,11
CodeA7,Data7,14,Text7,Guide7,FRT,Name7,Prod7,Cat7,TYYF-979,43
CodeA8,Data8,7,Text8,Guide8,CVDA,Name8,Prod8,Cat8,RT-221,65
CodeA9,Data9,3,Text9,Guide9,FRT,Name9,Prod9,Cat9,HHJUI-10,43
CodeA10,Data10,4,Text10,Guide10,CVDA,Name10,Prod10,Cat10,TYYF-978,99

The output should be:

Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10
Code7,Guide7,Data7,Text7,,455,,Test7,14,
Code5,Guide5,Data5,Text5,,,,,2,
Code3,Guide3,Data3,Text3,,FGT1,,,13,
Code9,Guide9,Data9,Text9,,,,,3,
Code2,Guide2,Data2,Text2,,lyo,,,24,

I mean, the values in file1 shouldn't be deleted when matched values from file2 are copied.

If you test the current script, it copies in correct columns from file2 into file1 for each matched value in column1 of both files, but deletes any original value that
is present from column2 to column10 in file1 before to run the script.

Hope be clear enough.

Thanks for any help.

Regards

Franklin52 · September 27, 2011, 7:21am

Try this one:

awk -F, '
NR==FNR{idx=substr($1,1,4) substr($1,6,1); a[idx]=$5;b[idx]=$2;c[idx]=$4;d[idx]=$3; next}
FNR==1{print; next}
a[$1]{$2=a[$1]; $3=b[$1];$4=c[$1];$9=d[$1]; print}
' OFS=, file2 file1

cgkmal · September 27, 2011, 7:33am

Yes, yes Franklin, This version works.

The last question. How to delete in file2 the lines that don't return any match between col 1 of both files, I mean, that don't are file1?

It could be added some more code below the same awk script?

Many thanks really.

Regards.

Franklin52 · September 27, 2011, 7:48am

Try this:

awk -F, '
NR==1{next} 
NR==FNR{a[$1]; next} 
{idx=substr($1,1,4) substr($1,6,1)} 
idx in a' file1 file2

cgkmal · September 27, 2011, 9:06am

I'll try it Franklin.

Really thanks for help and time.

Best regards

Franklin52 · September 27, 2011, 9:16am

You're welcome as always cgkmal.

cgkmal · September 27, 2011, 8:36pm

I tested it and works, but only when the values in column A are of the form "CodeN", I don't get results if the values don't have the first 4 words in common.

e.g The values in column 1 of both files could be like this:

SVFDS-1, PPRTY, HHLJ-933.

Like serial numbers.

How can I modify the Franklin's script in order to match any string?

Thanks in advance.

Franklin52 · September 28, 2011, 2:26am

Difficult if some of the keys are equal and some not...

cgkmal · September 28, 2011, 2:45am

Hi Franklin,

I only changed the assignation of idx as below and it works for sample file.

awk -F, ' 
NR==FNR{idx=$1; a[idx]=$5;b[idx]=$2;c[idx]=$4;d[idx]=$3; next} FNR==1{print; next} 
a[$1]{$2=a[$1]; $3=b[$1];$4=c[$1];$9=d[$1]; print} 
' OFS=, file2 file1

I'm confused because with the real files (file1 40KB and file2 with 50MB) is not working, I'm not sure if it is something with encoding within files, because the samples I've created in linux, but the real csv files I don't know if were edited in Excel previously.

can I do something to force the encoding or endlines of real files that could affect that the script doesn't work as expected as it works for sample files?

Thanks again.

Franklin52 · September 28, 2011, 3:32am

If the files are from a windows environment you can try to remove the CR's first with:

tr -d '\r' < dosfile > unixfile

cgkmal · September 28, 2011, 7:00pm

Thanks Franklin, I'll try your suggestion.

Many thanks again.

Best regards

---------- Post updated at 05:00 PM ---------- Previous update was at 03:00 AM ----------

Hi again Fraklin,

It looks it works, but one issue I didn't notice.

In real file some fields appear incomplete because in file2, sometimes there are commas within some fields but they are surounded by double quotes to know that is the same field as below in red.

CodeA1,Data1,4,"Text1,AKS,DDDD",Guide1,"FRT, YYSFF, TERE",Name1,Prod1,Cat1,TYYF-981,105
CodeA2,Data2,24,Text2,Guide2,CVDA,Name2,Prod2,Cat2,RT-223,23
CodeA3,Data3,13,Text3,Guide3,"FRT,TTY, GERT",Name3,Prod3,Cat3,HHJUI-12,42
CodeA4,Data4,6,Text4,Guide4,CVDA,Name4,Prod4,Cat4,TYYF-980,98
CodeA5,Data5,2,Text5,Guide5,FRT,Name5,Prod5,Cat5,RT-222,22
CodeA6,Data6,3,Text6,Guide6,CVDA,Name6,Prod6,Cat6,HHJUI-11,11
CodeA7,Data7,14,Text7,Guide7,FRT,Name7,Prod7,Cat7,TYYF-979,43
CodeA8,Data8,7,Text8,Guide8,CVDA,Name8,Prod8,Cat8,RT-221,65
CodeA9,Data9,3,Text9,Guide9,FRT,Name9,Prod9,Cat9,HHJUI-10,43
CodeA10,Data10,4,Text10,Guide10,CVDA,Name10,Prod10,Cat10,TYYF-978,99

The current script treats that fields as different fields.

How can be fixed this?

Many thanks again.

Regards.

Franklin52 · September 29, 2011, 2:46am

Hmm... it becomes more complicated, you could use the scenario below:

Change the field separator temporary to ";"

awk -F "\"" ' {
  for(i=1;i<=NF;i+=2) {
    gsub(",", ";", $i)
  }
}1' OFS="\"" file2 > temp

Do your stuff here with the new field separator:

awk -F; ' 
NR==FNR{idx=$1; a[idx]=$5;b[idx]=$2;c[idx]=$4;d[idx]=$3; next} FNR==1{print; next} 
a[$1]{$2=a[$1]; $3=b[$1];$4=c[$1];$9=d[$1]; print} 
' OFS=, temp file1 > file3

And change the field separator to ","

awk -F "\"" ' {
  for(i=1;i<=NF;i+=2) {
    gsub(";", ",", $i)
  }
}1' OFS="\"" file3 > newfile

rm file3

cgkmal · September 29, 2011, 4:42am

Hi Franklin,

Nice code to change field separator. I modificated file1 and file2 created 2 temp files, then I run the 2nd code and the results now are correct and I stored in temp3.

Finally I passed a sed script (sed 's/;/,/g' temp3 > Output ) to replace field separator that remains only for first line.

Great help you provided me Franklin, many thanks for all.

Best regards

Franklin52 · September 29, 2011, 5:03am

No problem man, you're welcome!