Command for vlookup function

makelifeeasier · December 13, 2014, 2:30pm

Hello experts,

I have large text files that need to be arranged using a function like excel's vlookup. I have been playing with awk command but didn't really come up with a solution. Could anyone please help me out? Below are my datasets and expected output. Any help would be greatly appreciated.

contig-100_1 5
contig-100_30 10
contig-100_50 20
contig-100_40 15
contig-100_60 25

contig-100_1 "Bacteria"
contig-100_50 "Bacteria A"
contig-100_60 "Bacteria A, Fungi B"

contig-100_1 5 Bacteria
contig-100_30 10 NA
contig-100_50 20 Bacteria A
contig-100_40 15 NA
contig-100_60 25 Bacteria A, Fungi B

derekludwig · December 13, 2014, 2:51pm

While this is a commonly asked request, but that the quoted field contains the field separator (a space) makes it a little more complicated. But with a little preprocessing:

{
  sed -e 's/ "/"/' ${2} 
  echo '""*'
  sed -e 's/ /"/g'  ${1}
} \
| awk -F\" '$3 == "*" { mark++; next; } mark == "" { X[$1] = $2; next; } { print $1, $2, $1 in X ? X[$1] : "NA"; }'

which generates the expected results:

contig-100_1 5 Bacteria
contig-100_30 10 NA
contig-100_50 20 Bacteria A
contig-100_40 15 NA
contig-100_60 25 Bacteria A, Fungi B

makelifeeasier · December 13, 2014, 3:19pm

Thanks derekludwig!

May I ask you one more question?
I tried your command with the following.
Script.sh File1 File2 > output

But the output looks odd. It has LN:10832 (any numbers) in the third column and I don't understand where those numbers came from. Please correct me if I am wrong with understanding your command. Sorry that I am so beginner.

Don_Cragun · December 13, 2014, 3:45pm

Instead of invoking sed twice and awk once, try just using:

awk -F' *"' '
FNR == NR {	v[$1] = $2; next }
{		print $0, ($1 in v ? v[$1] : "NA") }
' "$2" FS=' ' "$1"

You haven't said what OS or shell you're using. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

makelifeeasier · December 13, 2014, 3:59pm

Thanks Don Cragun!

I am working on Mac OS. I have executed your command with bash shell. But the third column in the output file is all "NA"...

RudiC · December 13, 2014, 4:17pm

You may need the double quotes in your processing to follow. Try

awk 'NR==FNR {T[$1]=$0;next} $1 in T {X=$1; $1=T[$1]; delete T[X]} 1; END {for (t in T) print T[t], "N/A"}' file1 file2
contig-100_1 5 "Bacteria"
contig-100_50 20 "Bacteria A"
contig-100_60 25 "Bacteria A, Fungi B"
contig-100_40 15 N/A
contig-100_30 10 N/A

Pipe it through a sort ing step if need be.

makelifeeasier · December 13, 2014, 4:29pm

Dear RudiC,

That worked perfect. Thanks all for your help!

Don_Cragun · December 13, 2014, 5:17pm

I'm also working on Mac OS X and when I run the script (using either bash or ksh ) I get exactly the output you said you wanted.

Please show us the output from the command:

head -5 File[12] | od -bc

so we can figure out why my suggestion doesn't work with your data.

makelifeeasier · December 15, 2014, 10:29am

Dear Don Cragun,

Sorry for my late reply. I am not very sure about what you meant by File[12]. Please see below.

head -5 File1[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 063 063 061 065 066
          c   o   n   t   i   g   -   1   0   0   _   3   3   1   5   6
0000020 061 040 066 012 143 157 156 164 151 147 055 061 060 060 137 063
          1       6  \n   c   o   n   t   i   g   -   1   0   0   _   3
0000040 070 064 061 062 061 040 066 012 143 157 156 164 151 147 055 061
          8   4   1   2   1       6  \n   c   o   n   t   i   g   -   1
0000060 060 060 137 063 067 066 070 066 071 040 071 067 012 143 157 156
          0   0   _   3   7   6   8   6   9       9   7  \n   c   o   n
0000100 164 151 147 055 061 060 060 137 064 063 070 061 067 061 040 061
          t   i   g   -   1   0   0   _   4   3   8   1   7   1       1
0000120 063 012 143 157 156 164 151 147 055 061 060 060 137 062 066 071
          3  \n   c   o   n   t   i   g   -   1   0   0   _   2   6   9
0000140 066 061 067 040 062 066 062 012
          6   1   7       2   6   2  \n

head -5 File2[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 062 060 062 066 011
          c   o   n   t   i   g   -   1   0   0   _   2   0   2   6  \t
0000020 042 162 157 157 164 042 012 143 157 156 164 151 147 055 061 060
          "   r   o   o   t   "  \n   c   o   n   t   i   g   -   1   0
0000040 060 137 063 063 060 064 011 042 162 157 157 164 042 012 143 157
          0   _   3   3   0   4  \t   "   r   o   o   t   "  \n   c   o
0000060 156 164 151 147 055 061 060 060 137 065 061 064 065 011 042 162
          n   t   i   g   -   1   0   0   _   5   1   4   5  \t   "   r
0000100 157 157 164 042 012 143 157 156 164 151 147 055 061 060 060 137
          o   o   t   "  \n   c   o   n   t   i   g   -   1   0   0   _
0000120 070 060 065 060 011 042 162 157 157 164 042 012 143 157 156 164
          8   0   5   0  \t   "   r   o   o   t   "  \n   c   o   n   t
0000140 151 147 055 061 060 060 137 070 064 066 071 011 042 162 157 157
          i   g   -   1   0   0   _   8   4   6   9  \t   "   r   o   o
0000160 164 042 012
          t   "  \n

head -5 output[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 063 063 061 065 066
          c   o   n   t   i   g   -   1   0   0   _   3   3   1   5   6
0000020 061 040 066 040 116 101 012 143 157 156 164 151 147 055 061 060
          1       6       N   A  \n   c   o   n   t   i   g   -   1   0
0000040 060 137 063 070 064 061 062 061 040 066 040 116 101 012 143 157
          0   _   3   8   4   1   2   1       6       N   A  \n   c   o
0000060 156 164 151 147 055 061 060 060 137 063 067 066 070 066 071 040
          n   t   i   g   -   1   0   0   _   3   7   6   8   6   9    
0000100 071 067 040 116 101 012 143 157 156 164 151 147 055 061 060 060
          9   7       N   A  \n   c   o   n   t   i   g   -   1   0   0
0000120 137 064 063 070 061 067 061 040 061 063 040 116 101 012 143 157
          _   4   3   8   1   7   1       1   3       N   A  \n   c   o
0000140 156 164 151 147 055 061 060 060 137 062 066 071 066 061 067 040
          n   t   i   g   -   1   0   0   _   2   6   9   6   1   7    
0000160 062 066 062 040 116 101 012
          2   6   2       N   A  \n

---------- Post updated at 10:29 AM ---------- Previous update was at 10:26 AM ----------

Dear RudiC,

Your command worked great like I said. But when I tried your script with small sized datasets (1-5 M), it worked fine. But it didn't work well on the bigger sized datasets (6 - 10 M). Do you have any suggestion on this? Any advice would be greatly appreciated.