Command for vlookup function

Hello experts,

I have large text files that need to be arranged using a function like excel's vlookup. I have been playing with awk command but didn't really come up with a solution. Could anyone please help me out? Below are my datasets and expected output. Any help would be greatly appreciated.

<File 1>

contig-100_1 5
contig-100_30 10
contig-100_50 20
contig-100_40 15
contig-100_60 25

<File 2>

contig-100_1 "Bacteria"
contig-100_50 "Bacteria A"
contig-100_60 "Bacteria A, Fungi B"

<Expected output>

contig-100_1 5 Bacteria
contig-100_30 10 NA
contig-100_50 20 Bacteria A
contig-100_40 15 NA
contig-100_60 25 Bacteria A, Fungi B

While this is a commonly asked request, but that the quoted field contains the field separator (a space) makes it a little more complicated. But with a little preprocessing:

{
  sed -e 's/ "/"/' ${2} 
  echo '""*'
  sed -e 's/ /"/g'  ${1}
} \
| awk -F\" '$3 == "*" { mark++; next; } mark == "" { X[$1] = $2; next; } { print $1, $2, $1 in X ? X[$1] : "NA"; }'

which generates the expected results:

contig-100_1 5 Bacteria
contig-100_30 10 NA
contig-100_50 20 Bacteria A
contig-100_40 15 NA
contig-100_60 25 Bacteria A, Fungi B

Thanks derekludwig!

May I ask you one more question?
I tried your command with the following.
Script.sh File1 File2 > output

But the output looks odd. It has LN:10832 (any numbers) in the third column and I don't understand where those numbers came from. Please correct me if I am wrong with understanding your command. Sorry that I am so beginner.

Instead of invoking sed twice and awk once, try just using:

awk -F' *"' '
FNR == NR {	v[$1] = $2; next }
{		print $0, ($1 in v ? v[$1] : "NA") }
' "$2" FS=' ' "$1"

You haven't said what OS or shell you're using. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

Thanks Don Cragun!

I am working on Mac OS. I have executed your command with bash shell. But the third column in the output file is all "NA"...

You may need the double quotes in your processing to follow. Try

awk 'NR==FNR {T[$1]=$0;next} $1 in T {X=$1; $1=T[$1]; delete T[X]} 1; END {for (t in T) print T[t], "N/A"}' file1 file2
contig-100_1 5 "Bacteria"
contig-100_50 20 "Bacteria A"
contig-100_60 25 "Bacteria A, Fungi B"
contig-100_40 15 N/A
contig-100_30 10 N/A

Pipe it through a sort ing step if need be.

Dear RudiC,

That worked perfect. Thanks all for your help!

I'm also working on Mac OS X and when I run the script (using either bash or ksh ) I get exactly the output you said you wanted.

Please show us the output from the command:

head -5 File[12] | od -bc

so we can figure out why my suggestion doesn't work with your data.

Dear Don Cragun,

Sorry for my late reply. I am not very sure about what you meant by File[12]. Please see below.

head -5 File1[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 063 063 061 065 066
          c   o   n   t   i   g   -   1   0   0   _   3   3   1   5   6
0000020 061 040 066 012 143 157 156 164 151 147 055 061 060 060 137 063
          1       6  \n   c   o   n   t   i   g   -   1   0   0   _   3
0000040 070 064 061 062 061 040 066 012 143 157 156 164 151 147 055 061
          8   4   1   2   1       6  \n   c   o   n   t   i   g   -   1
0000060 060 060 137 063 067 066 070 066 071 040 071 067 012 143 157 156
          0   0   _   3   7   6   8   6   9       9   7  \n   c   o   n
0000100 164 151 147 055 061 060 060 137 064 063 070 061 067 061 040 061
          t   i   g   -   1   0   0   _   4   3   8   1   7   1       1
0000120 063 012 143 157 156 164 151 147 055 061 060 060 137 062 066 071
          3  \n   c   o   n   t   i   g   -   1   0   0   _   2   6   9
0000140 066 061 067 040 062 066 062 012
          6   1   7       2   6   2  \n

head -5 File2[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 062 060 062 066 011
          c   o   n   t   i   g   -   1   0   0   _   2   0   2   6  \t
0000020 042 162 157 157 164 042 012 143 157 156 164 151 147 055 061 060
          "   r   o   o   t   "  \n   c   o   n   t   i   g   -   1   0
0000040 060 137 063 063 060 064 011 042 162 157 157 164 042 012 143 157
          0   _   3   3   0   4  \t   "   r   o   o   t   "  \n   c   o
0000060 156 164 151 147 055 061 060 060 137 065 061 064 065 011 042 162
          n   t   i   g   -   1   0   0   _   5   1   4   5  \t   "   r
0000100 157 157 164 042 012 143 157 156 164 151 147 055 061 060 060 137
          o   o   t   "  \n   c   o   n   t   i   g   -   1   0   0   _
0000120 070 060 065 060 011 042 162 157 157 164 042 012 143 157 156 164
          8   0   5   0  \t   "   r   o   o   t   "  \n   c   o   n   t
0000140 151 147 055 061 060 060 137 070 064 066 071 011 042 162 157 157
          i   g   -   1   0   0   _   8   4   6   9  \t   "   r   o   o
0000160 164 042 012
          t   "  \n

head -5 output[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 063 063 061 065 066
          c   o   n   t   i   g   -   1   0   0   _   3   3   1   5   6
0000020 061 040 066 040 116 101 012 143 157 156 164 151 147 055 061 060
          1       6       N   A  \n   c   o   n   t   i   g   -   1   0
0000040 060 137 063 070 064 061 062 061 040 066 040 116 101 012 143 157
          0   _   3   8   4   1   2   1       6       N   A  \n   c   o
0000060 156 164 151 147 055 061 060 060 137 063 067 066 070 066 071 040
          n   t   i   g   -   1   0   0   _   3   7   6   8   6   9    
0000100 071 067 040 116 101 012 143 157 156 164 151 147 055 061 060 060
          9   7       N   A  \n   c   o   n   t   i   g   -   1   0   0
0000120 137 064 063 070 061 067 061 040 061 063 040 116 101 012 143 157
          _   4   3   8   1   7   1       1   3       N   A  \n   c   o
0000140 156 164 151 147 055 061 060 060 137 062 066 071 066 061 067 040
          n   t   i   g   -   1   0   0   _   2   6   9   6   1   7    
0000160 062 066 062 040 116 101 012
          2   6   2       N   A  \n

---------- Post updated at 10:29 AM ---------- Previous update was at 10:26 AM ----------

Dear RudiC,

Your command worked great like I said. But when I tried your script with small sized datasets (1-5 M), it worked fine. But it didn't work well on the bigger sized datasets (6 - 10 M). Do you have any suggestion on this? Any advice would be greatly appreciated.