Vlookup in Linux

shanul_karim · August 15, 2017, 2:56pm

Hello Everybody

I am looking for vlookup like functionality in Linux since two files I have are very big(1000MB each) and its not opening completely in excel.

Here the requirement

file1

 11,12,13
 16,14,12
 28,21,22
 22,23,24

file 3

 18,16,16
 14,12,12
 23,22,24
 16,11,13

here I need to do we lookup using 3rd column of file3

output file

 11,12,13  NA
 16,14,12  14,12,12
 28,21,22  NA
 22,23,24  23,22,24

I have tried below vlookup but its giving wrong output

$ cat vlookup.awk
 FNR==NR{
 a[$1]=$2
 next
 }
 { if ($1 in a) {print $1, a[$1]} else {print $1, "NA"} }

$ awk -f vlookup.awk file3 file1 | column -t
 11,12,13 NA
 16,14,12 NA
 28,21,22 NA
 22,23,24 NA

Request for your help to resolve this.

---------- Post updated at 01:56 PM ---------- Previous update was at 01:21 PM ----------

thanks as advice code tag added..

RudiC · August 15, 2017, 3:11pm

Try

awk '
FNR==NR {a[$1] = $0
         next
        }

        {print $0, a[$2]?a[$2]:"NA"
        }
' FS=, file3 file1
11,12,13 NA
16,14,12 14,12,12
28,21,22 NA
22,23,24 23,22,24

shanul_karim · August 15, 2017, 3:40pm

Hello RudiC

Thanks for the reply..

Just want to know whether this code work same for below files as well
Mean column 3 from CBUEBU_REF as reference.

I tried and got all NA

$ head -n 10 CBUEBU_REF
754123450,820017123454441,7512345201
701234512,820017312345440,7012345122
701234544,820017312345439,7012345443
703123450,820123458284423,7012345401
751234503,820017301234529,7512345032
751234562,820017312345928,7561234523
701234581,820017312345133,7012345811
753123458,820017123457041,7531234582
753123455,820017123459194,7531234553
701234522,820017123453601,7012345221

$ head -n 10 CUDB_REF

751234592,820017123451132,7512345921
701234559,820017312345133,7001123452
701234578,820017301234534,7001234583
700123454,820011234581410,7012345941
701234514,820017312345411,7012345142
701234574,820017123451412,7081234543
751234593,820017301234565,7581234531
755123453,820017312345909,7512345932
701234554,820017301234532,701234543
751234586,820017123456114,7551234561

RudiC · August 15, 2017, 3:51pm

So what would the desired output be? Which field should match which column?

shanul_karim · August 15, 2017, 4:10pm

Here are the files and need to compare column 3 with the same logic

 file3
754123450,820017123454441,7512345201
701234512,820017312345440,7012345122
701234544,820017312345439,7012345443
703123450,820123458284423,7012345401
751234503,820017301234529,7512345032
751234562,820017312345928,7561234523
701234581,820017312345133,7012345811
753123458,820017123457041,7531234582
753123455,820017123459194,7531234553
701234522,820017123453601,7012345221

 file1
751234512,821234501068703,7512345123
712345279,820017121234519,7123452791
712345222,820017301234501,7051234521
723458177,820017301234531,7521234572
751234520,820017312345441,7512345201
756123452,820017312345928,7512345623
751234545,820017123459194,7512345453
701234581,820017123459133,7012345811
701234539,820017123454493,7123459393
712345229,820017123459186,7123452291

The desired output files are attached

RudiC · August 15, 2017, 4:19pm

Applying the corrected script (replaced all field references with $3) to your most recent input samples yields exactly your desired output files. The former samples did NOT have any entry in field 3 in common, so "NA" would be correct.

shanul_karim · August 15, 2017, 4:22pm

ok, What is the corrected code now for these input and outputs..

RudiC · August 15, 2017, 4:37pm

Isn't that clear?Try

awk '
FNR==NR {a[$3] = $0
         next
        }

        {print $0, a[$3]?a[$3]:"NA"
        }
' FS=, file3 file1

shanul_karim · August 15, 2017, 11:33pm

its very much clear and I have tested and its working fine for my small input files.

Really thankful to you for all your support.

But while testing same logic for bigger input files. I saw some strange behaviour.

like

$ awk '
> FNR==NR {a[$3] = $0
>          next
>         }
>
>         {print $0, a[$3]?a[$3]:"NA"
>         }
> ' FS=, file8 file9 > FINAL1

eshaqur@IN00106004 ~/CNG_MSISDN_AUDIT
$ head -n 10 FINAL1
751234592,820011234501132,7512345921 NA
701234559,820011234501133,7012345592 NA
701234578,820011234501134,7012345783 NA
701234594,820012345481410,7123451941 NA
701234514,820012345481411,7071234542 NA
701234574,820012345481412,7012345743 NA
751234593,820012345771965,7512345931 758123453,820017123451965,7123459931
712345693,820012341234509,7123456932 755123453,821234501234509,7512345932
703123454,820012345802232,7035123453 703123454,820017312345232,7012345543
755123456,8200123454461146,71234520861 NA

whereas "712345692,820017123456132,7123456921" exists in both file8 and file9. So result should not be NA.

eshaqur@IN00106004 ~/CNG_MSISDN_AUDIT
$ grep "712345592,820017123451132,7123455921" file8
751234592,820017312345632,7512345921

eshaqur@IN00106004 ~/CNG_MSISDN_AUDIT
$ grep "751234592,821234501501132,7123455921" file9
751234592,821234501501132,7123455921

in file8 its in line 1792131 and in file9 its at 1st line.
what could be the reason of this NA now since data is at coloumn 3 but at row 1792131.

anything can improve in code that it include lines upto 3000000 search for vlookup of column 3.

---------- Post updated at 10:33 PM ---------- Previous update was at 04:28 PM ----------

shanul karim:

its very much clear and I have tested and its working fine for my small input files.

Really thankful to you for all your support.

But while testing same logic for bigger input files. I saw some strange behaviour.

like
$ awk '
> FNR==NR {a[$3] = $0
>          next
>         }
>
>         {print $0, a[$3]?a[$3]:"NA"
>         }
> ' FS=, file8 file9 > FINAL1 
eshaqur@IN00106004 ~/CNG_MSISDN_AUDIT
$ head -n 10 FINAL1
751234592,820011234501132,7512345921 NA
701234559,820011234501133,7012345592 NA
701234578,820011234501134,7012345783 NA
701234594,820012345481410,7123451941 NA
701234514,820012345481411,7071234542 NA
701234574,820012345481412,7012345743 NA
751234593,820012345771965,7512345931 758123453,820017123451965,7123459931
712345693,820012341234509,7123456932 755123453,821234501234509,7512345932
703123454,820012345802232,7035123453 703123454,820017312345232,7012345543
755123456,8200123454461146,71234520861 NA 
whereas "75123456,820012345601132,712345921" exists in both file8 and file9. So result should not be NA.
eshaqur@IN00106004 ~/CNG_MSISDN_AUDIT
$ grep "723450592,820123451501132,7123455921" file8
723450592,820123451501132,7123455921
eshaqur@IN00106004 ~/CNG_MSISDN_AUDIT
$ grep "723450592,820123451501132,7123455921" file9
723450592,820123451501132,7123455921
in file8 its in line 1792131 and in file9 its at 1st line.
what could be the reason of this NA now since data is at coloumn 3 but at row 1792131.

anything can improve in code that it include lines upto 3000000 search for vlookup of column 3.

Finally it's solved..Thanks again for support. Actually it's resolved by copy content of input files in text format manually using ultra edit. And then apply your awk command.