awk conditional operators- lookup value in 2nd file

Geneanalyst · October 22, 2018, 10:59am

I use conditional operators alot in AWK to print rows from large text files based on values in a certain column. For example:

awk -F '\t' '{ if ($1 == "A" || $1 == "C" ) print $0}'  OFS="\t" file1.txt > file2.txt

In this case every row is printed from file1 to file2 for which the column 1 value in file1 is either A or C.

This is fine if I'm working with a few conditional values in column 1 (2 in the above example), however, if there are many values to look up, say 40, it becomes cumbersome to write out 40 conditional values. In this case it would be more efficient to place all those conditional values in a separate file say file3. For example, file 3 could look like:

File3

BY1038
BY1039
BY2222
BY30142
BY30150
BY30161
BY30209
BY30211
CTS10416
CTS10700
CTS10879

Basically I would like to print all file1 rows to file2 for which the column1 value in file1 is one of those listed in file3

File1

 BY1038    23497067    C->G
 BY1039    14032743    A->T
 BY2222    2720236    A->G
 CTS493    8122429    A->G
 CTS5035    14708870    G->C
 CTS5268    15018138    A->G
 CTS5336    17367361    T->C
 BY30211    7625503    C->A
 CTS10416    16513268    T->C
 CTS10700    16651541    G->T
 CTS10879    14409483    A->C
 CTS11569    17841608    A->G
 CTS12657    15373303    C->T
 CTS150    2913202    G->A
 CTS3585    21225818    A->C
 CTS3837    14043980    T->C
 CTS4014    7064140    G->C
 CTS4652    8891051    C->T
 CTS482    15658416    G->A

Desired output (file2) should look like:

BY1038    23497067    C->G
 BY1039    14032743    A->T
 BY2222    2720236    A->G
CTS10416    16513268    T->C
 CTS10700    16651541    G->T
 CTS10879    14409483    A->C

Thanks in advance...

------ Post updated at 09:59 AM ------

I modified a code that I was using for something else, and it seems to work fine for printing columns1 and column2 of file1 to file2, however, I need the column3 value also printed.

awk 'NR == FNR {REP[$1] = $2; next} $1 in REP {$2 = REP[$1]} 1' OFS="\t" file1 file3 > file2

Here is what I get (only 1st 2 columns are printed)

file2

BY1038
 BY1039
 BY2222
 BY30142
 BY30150
 BY30161
 BY30209
 BY30211
CTS10416    19443432
CTS10700    22712606
CTS10879    22815955

vgersh99 · October 22, 2018, 10:59am

awk -F '\t' 'FNR==NR {f3[$1];next} $1 in f3' OFS="\t" file3 file1.txt > file2.txt

Geneanalyst · October 22, 2018, 11:05am

Sorry vgersh99, but your code doesn't produce the desired output. Only 2 rows were printed

vgersh99 · October 22, 2018, 11:13am

please provide the output of cat -vet file1.txt using code tags.
With the properly formatted file1.txt (tab-delimited fields), I get:

awk -F '\t' 'FNR==NR {f3[$1];next} $1 in f3' OFS="\t" gene3.txt gene1.txt
BY1038  23497067        C->G
BY1039  14032743        A->T
BY2222  2720236 A->G
BY30211 7625503 C->A
CTS10416        16513268        T->C
CTS10700        16651541        G->T
CTS10879        14409483        A->C

Geneanalyst · October 22, 2018, 12:21pm

Sorry vgersh99, my bad. My file3 had some leading spaces which after cleaning up produced the correct output.

Thanks for the code!

------ Post updated at 11:21 AM ------

Do you mind posting an explanation of the steps inside the code

vgersh99 · October 22, 2018, 12:37pm

awk -F '\t' '
# when FNR==NR (only true for the FIRST file on cli),
# populate array f3 indexed by $1 (first field) in the file (file3)
# next - advance to the next input record/line in file3 without taking any
# following actions
 FNR==NR {f3[$1];next} 

# we get here for the file(s) following the first file on cli
# if $1 (first field) in the line/record is present in array f3,
# output the whole record in the current/f1 file
# this could be rewritten (long-hand) as $1 in f3 { print $0}
 $1 in f3' OFS="\t" file3 file1.txt