I thought I had this figured out but was wrong so am humbly asking for help.
The task is to add an additional column to FILE 1 based on records in FILE 2.
The key is in COLUMN 1 for FILE 1 and in COLUMN 1 OR COLUMN 2 for FILE 2.
I want to add the third column from FILE 2 to the beginning of FILE 1 so that the new FILE shows for example:
DESIRED FILE 3
1:13109 G_T t g -0.4127 0.1042 7.52e-05 ?---??????-? rs540538026
FILE 1
1:1057989 G_T t g 0.3000 0.0662 5.909e-06 ??++++++???+
1:11007 C_T t c 0.2874 0.0710 5.19e-05 ?????+++???+
1:2190612 A_G a g 1.1252 0.2605 1.561e-05 ???????????+
1:13109 G_T t g -0.4127 0.1042 7.52e-05 ?---??????-?
1:3674534 G_T t g -0.4187 0.1073 9.559e-05 ?---??????-?
1:6932407 A_G a g 1.4977 0.3322 6.535e-06 ???????????+
1:6938780 C_T t c -1.3632 0.3274 3.135e-05 ???????????-
1:7171050 A_G a g 0.0537 0.0134 6.091e-05 ?+++?-++++++
1:8960594 C_T t c -0.9273 0.2319 6.344e-05 ???????????-
1:12203508 C_T t c -1.4228 0.3469 4.111e-05 ???????????-
Thank you -this is closer to a solution than I've been in several days.
I tried out the script and did some manual sanity checks.
The code correctly identifies both column 1 and column 2 values in FILE 2. However, it only seems to add the column 3 value in FILE 2 if the matched value in FILE 2 was found in column 1.
I am wondering if this part of the code needs to be modified? Does $1 in F22 refer to the first column in the created matrix?
$1 in f21 { print $0, f21[$1];next }
$1 in f22 { print $0, f21[$1] }
If there is no match in any of the columns the row is eliminated from the output, which actually isn't much of a problem though.
And now it works like a charm and produced exactly the output I was looking for. Thanks !!
---------- Post updated at 01:01 PM ---------- Previous update was at 12:50 PM ----------
The main question is solved thanks to vgersh99. I have a bonus question, if I would like to run this awk line on multiple FILES 1 using the same reference FILE2, would something along these lines do the trick?
for i in *.txt ; do
awk '
FNR==NR {
f21[$1]=$3
f22[$2]=$3
next
}
$1 in f21 { print $0, f21[$1];next }
$1 in f22 { print $0, f22[$1] }
' FILE2.txt $i
$i > $i.pruned
done
With a VERY sloppy interpretation of "along these lines" you might come close to the desired result, once you corrected the syntax / redirection error in the before-last line, and accepted the higher resource cost as you run the script multiple times.
Why not sth. along THIS line :
awk '
FNR==NR {f21[$1]=$3
f22[$2]=$3
next
}
$1 in f21 {print $0, f21[$1] > (FILENAME ".pruned")
next
}
$1 in f22 {print $0, f22[$1] > (FILENAME ".pruned")
}
' file2.txt file[^2].txt
-I suspected the suggested code was sloppy -I'm a newbie.
As I understand, this part of your suggestions tells to take the column 1 of the f21 table. Then add the .pruned extension to the stdout file? Or does it process all files with the .pruned extension?
$1 in f21 {print $0, f21[$1] > (FILENAME ".pruned") next
To be more clear, I have 400 of FILE 1 that should be matched to the FILE 2 table, of which there is only 1. The filename looks as in the below example. I would like to match all of the below FILE1 without having run them each at a time. They all have the same file extension. The resulting files should get an additional extension .pruned.
FILE1_VEGF.tbl.filtered.tab
FILE1_TL1A.tbl.filtered.tab
FILE1_MMP13.tbl.filtered.tab
FILE1_KYNUR.tbl.filtered.tab
+398 more files
I also don't understand this part
FILE2 FILE1[^2].txt
Does the ^ mean that the files are combined?
Is it possible to use wildcard definition e.g. *.tab to process many different versions of FILE1?
PLEASE start becoming exact and consistent when posting your problems here, also across posts, making samples match what you say in the text.
People (not only) in here tend to refer to the samples if the text is not too clear...
Also, stick to lower or upper case syntax as, in *nix, these are not equivalent: "FILE2" != "file2"!
For example, your for i in *.txt won't match ANY of your "FILE1..." names in post#7.
We don't print to stdout, but redirect it to a filename composed of the original name (which awk provides in the FILENAME variable) and the ".pruned" extension (pls be aware that *nix doesn't have the concept of file name "extensions").
That's shell's file name globbing: match any char EXCEPT "2" in the directory's entries. c.f. man bash