I have large text files that need to be arranged using a function like excel's vlookup. I have been playing with awk command but didn't really come up with a solution. Could anyone please help me out? Below are my datasets and expected output. Any help would be greatly appreciated.
While this is a commonly asked request, but that the quoted field contains the field separator (a space) makes it a little more complicated. But with a little preprocessing:
{
sed -e 's/ "/"/' ${2}
echo '""*'
sed -e 's/ /"/g' ${1}
} \
| awk -F\" '$3 == "*" { mark++; next; } mark == "" { X[$1] = $2; next; } { print $1, $2, $1 in X ? X[$1] : "NA"; }'
which generates the expected results:
contig-100_1 5 Bacteria
contig-100_30 10 NA
contig-100_50 20 Bacteria A
contig-100_40 15 NA
contig-100_60 25 Bacteria A, Fungi B
May I ask you one more question?
I tried your command with the following.
Script.sh File1 File2 > output
But the output looks odd. It has LN:10832 (any numbers) in the third column and I don't understand where those numbers came from. Please correct me if I am wrong with understanding your command. Sorry that I am so beginner.
Instead of invoking sed twice and awk once, try just using:
awk -F' *"' '
FNR == NR { v[$1] = $2; next }
{ print $0, ($1 in v ? v[$1] : "NA") }
' "$2" FS=' ' "$1"
You haven't said what OS or shell you're using. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .
Sorry for my late reply. I am not very sure about what you meant by File[12]. Please see below.
head -5 File1[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 063 063 061 065 066
c o n t i g - 1 0 0 _ 3 3 1 5 6
0000020 061 040 066 012 143 157 156 164 151 147 055 061 060 060 137 063
1 6 \n c o n t i g - 1 0 0 _ 3
0000040 070 064 061 062 061 040 066 012 143 157 156 164 151 147 055 061
8 4 1 2 1 6 \n c o n t i g - 1
0000060 060 060 137 063 067 066 070 066 071 040 071 067 012 143 157 156
0 0 _ 3 7 6 8 6 9 9 7 \n c o n
0000100 164 151 147 055 061 060 060 137 064 063 070 061 067 061 040 061
t i g - 1 0 0 _ 4 3 8 1 7 1 1
0000120 063 012 143 157 156 164 151 147 055 061 060 060 137 062 066 071
3 \n c o n t i g - 1 0 0 _ 2 6 9
0000140 066 061 067 040 062 066 062 012
6 1 7 2 6 2 \n
head -5 File2[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 062 060 062 066 011
c o n t i g - 1 0 0 _ 2 0 2 6 \t
0000020 042 162 157 157 164 042 012 143 157 156 164 151 147 055 061 060
" r o o t " \n c o n t i g - 1 0
0000040 060 137 063 063 060 064 011 042 162 157 157 164 042 012 143 157
0 _ 3 3 0 4 \t " r o o t " \n c o
0000060 156 164 151 147 055 061 060 060 137 065 061 064 065 011 042 162
n t i g - 1 0 0 _ 5 1 4 5 \t " r
0000100 157 157 164 042 012 143 157 156 164 151 147 055 061 060 060 137
o o t " \n c o n t i g - 1 0 0 _
0000120 070 060 065 060 011 042 162 157 157 164 042 012 143 157 156 164
8 0 5 0 \t " r o o t " \n c o n t
0000140 151 147 055 061 060 060 137 070 064 066 071 011 042 162 157 157
i g - 1 0 0 _ 8 4 6 9 \t " r o o
0000160 164 042 012
t " \n
head -5 output[12] | od -bc
0000000 143 157 156 164 151 147 055 061 060 060 137 063 063 061 065 066
c o n t i g - 1 0 0 _ 3 3 1 5 6
0000020 061 040 066 040 116 101 012 143 157 156 164 151 147 055 061 060
1 6 N A \n c o n t i g - 1 0
0000040 060 137 063 070 064 061 062 061 040 066 040 116 101 012 143 157
0 _ 3 8 4 1 2 1 6 N A \n c o
0000060 156 164 151 147 055 061 060 060 137 063 067 066 070 066 071 040
n t i g - 1 0 0 _ 3 7 6 8 6 9
0000100 071 067 040 116 101 012 143 157 156 164 151 147 055 061 060 060
9 7 N A \n c o n t i g - 1 0 0
0000120 137 064 063 070 061 067 061 040 061 063 040 116 101 012 143 157
_ 4 3 8 1 7 1 1 3 N A \n c o
0000140 156 164 151 147 055 061 060 060 137 062 066 071 066 061 067 040
n t i g - 1 0 0 _ 2 6 9 6 1 7
0000160 062 066 062 040 116 101 012
2 6 2 N A \n
---------- Post updated at 10:29 AM ---------- Previous update was at 10:26 AM ----------
Dear RudiC,
Your command worked great like I said. But when I tried your script with small sized datasets (1-5 M), it worked fine. But it didn't work well on the bigger sized datasets (6 - 10 M). Do you have any suggestion on this? Any advice would be greatly appreciated.