Awk: adding fields after matching $1

Dear AWK-experts!

I did get stuck in the task of combining files after matching fields, so I'm still awkward with learning AWK.

There are 2 files: one containing 3 columns with ID, coding status, and score for long noncoding RNAs:

file1 (1.txt) (>5000 lines)

TCONS_00017358;noncoding;0.968717
TCONS_00024657;noncoding;0.995485
TCONS_00005907;noncoding;0.808883 
TCONS_00013445;noncoding;0.998353 
TCONS_00001277;noncoding;0.934182 
TCONS_00047071;noncoding;0.919281 
TCONS_00040152;noncoding;0.988357 

The second file contains the same ID ($1 == $1) and a corresponding Gene name in field $2:

file2 (2.txt)

TCONS_00024657;KCNC2 
TCONS_00005907;CDH11 
TCONS_00013445;TAF12 
TCONS_00047071;MYO19 
TCONS_00040152;KCNC2 
TCONS_00017358;PAIP1 
TCONS_00001277;CHAD4 

I would like to add the Gene name ($2 from file2) as an additional column to file1 in a newly generated file3. Every Gene name is located somewhere in file2. Because of asymmetric distributed duplicates in file2 a simple sort after ID and cat doesn't work. At the moment I'm trying.

awk -F ";" > 3.txt 'NR == FNR {   
               _[$1] = $1; _[$2] = $2   
               next 
               } 
                  if (_[$1] == $1)  {   
                  print $0, _[$2]   
              }' 2.txt 1.txt  

Any help would be awesome!!!

Try:

awk -F";" 'NR==FNR{a[$1]=$2;next}{$4=a[$1]}1' OFS=";" 2.txt 1.txt
1 Like

Many thanks! It's perfect, going to study the magic of the working code.

TCONS_00017358;noncoding;0.968717;PAIP1 
TCONS_00024657;noncoding;0.995485;KCNC2 
TCONS_00005907;noncoding;0.808883;CDH11 
TCONS_00013445;noncoding;0.998353;TAF12 
TCONS_00001277;noncoding;0.934182;CHAD4
TCONS_00047071;noncoding;0.919281;MYO19 
TCONS_00040152;noncoding;0.988357;KCNC2

---------- Post updated at 03:21 PM ---------- Previous update was at 02:18 PM ----------

The duplicates seem to mess it up:( The script perfectly works on my example (now deemed to be incomplete), but if it get's to this:

file1 (ID + coding status + score)

TCONS_00000001;noncoding;0.833737
TCONS_00000002;coding;0.00522615
TCONS_00000004;coding;0.0284272
TCONS_00000006;noncoding;0.982945

file2 (ID + gen names)

TCONS_00000001;DDX11L1
TCONS_00000001;DDX11L1
TCONS_00000001;DDX11L1
TCONS_00000002;OR4F5
TCONS_00000004;LOC100133331
TCONS_00000004;LOC100133331
TCONS_00000004;LOC100133331
TCONS_00000004;LOC100133331
TCONS_00000003;LOC100132062
TCONS_00000003;LOC100132062
TCONS_00000003;LOC100132062
TCONS_00000005;OR4F3
TCONS_00000006;LOC643837
TCONS_00000006;LOC643837
TCONS_00000006;LOC643837
TCONS_00000010;LOC643837
TCONS_00000010;LOC643837
TCONS_00000010;LOC643837
TCONS_00000010;LOC643837
TCONS_00000010;LOC643837
TCONS_00000010;LOC643837
TCONS_00000010;LOC643837
TCONS_00000012;LOC643837
TCONS_00000012;LOC643837
TCONS_00000012;LOC643837
TCONS_00000012;LOC643837
TCONS_00000012;LOC643837
TCONS_00000012;LOC643837
TCONS_00000014;LOC643837
TCONS_00000014;LOC643837
TCONS_00000014;LOC643837
TCONS_00000014;LOC643837
TCONS_00000014;LOC643837
TCONS_00000016;LOC643837
TCONS_00000016;LOC643837
TCONS_00000016;LOC643837
TCONS_00000016;LOC643837
TCONS_00000017;LOC643837
TCONS_00000017;LOC643837
TCONS_00000017;LOC643837
TCONS_00000017;LOC643837
TCONS_00000015;LOC643837
TCONS_00000015;LOC643837
TCONS_00000015;LOC643837
TCONS_00000015;LOC643837
TCONS_00000015;LOC643837
TCONS_00000015;LOC643837
TCONS_00000013;LOC643837
TCONS_00000013;LOC643837
TCONS_00000013;LOC643837
TCONS_00000013;LOC643837
TCONS_00000013;LOC643837
TCONS_00000013;LOC643837
TCONS_00000013;LOC643837
TCONS_00000011;LOC643837
TCONS_00000011;LOC643837
TCONS_00000011;LOC643837
TCONS_00000011;LOC643837
TCONS_00000011;LOC643837
TCONS_00000011;LOC643837
TCONS_00000011;LOC643837
TCONS_00000011;LOC643837
TCONS_00000009;LOC643837
TCONS_00000009;LOC643837
TCONS_00000009;LOC643837
TCONS_00000009;LOC643837
TCONS_00000009;LOC643837
TCONS_00000009;LOC643837
TCONS_00000009;LOC643837
TCONS_00000009;LOC643837
TCONS_00000008;LOC643837
TCONS_00000008;LOC643837
TCONS_00000008;LOC643837
TCONS_00000008;LOC643837

the script output changes to:

TCONS_00000001;noncoding;0.833737   
TCONS_00000002;coding;0.00522615   
TCONS_00000004;coding;0.0284272   
TCONS_00000006;noncoding;0.982945

That should not be - it should use the last occurrence of $1 for the replacement. In your above case, the $2 to the respective $1 are identical, so this would not matter anyhow. Applying Bartus11's proposal to your two files yields

TCONS_00000001;noncoding;0.833737;DDX11L1
TCONS_00000002;coding;0.00522615;OR4F5
TCONS_00000004;coding;0.0284272;LOC100133331
TCONS_00000006;noncoding;0.982945;LOC643837
  • which is fine from my perspective...

Thanks! I'm checking for format issues in the input files!

---------- Post updated at 08:22 PM ---------- Previous update was at 04:18 PM ----------

Still can't get it working:( With 40 lines things run smoothly but with 45000 lines - invain.

---------- Post updated at 08:27 PM ---------- Previous update was at 08:22 PM ----------

Could there be a size limit with AWK? (file1 1.5 MB, file2 10 MB)

Unlikely, 45000 lines is nothing. What is your OS and version? What isn't working? Do you get output like you described in post #3, or is it different ? Does it work when you download your own samples in post #3 and run Bartus11' suggestion on them?

kben, post following output (using your target, "big" files):

head 1.txt | cat -ev
head 2.txt | cat -ev
1 Like

Thanks!!! It's a damn formatting issue. On Textwrangler and Fraise the files appear clean, but "head file | cat -ev" shows "^M" as file seperators in files saved on a Mac (and originating from a linux cluster).

test$ head 1.txt | cat -ev
TCONS_00001761;coding;2,18E-01^MTCONS_00004368;coding;8,83E-01^MTCONS_00009964;coding;1,78E-03^MTCONS_00006220;coding;3,94E-05^MTCONS_00025442;coding;1,17E-03^MTCONS_00022769;coding;8,06E-05^MTCONS_00028104;coding;1,04E-01^M...

test$ head 2.txt | cat -ev
TCONS_00000001;DDX11L1^MTCONS_00000001;DDX11L1^MTCONS_00000001;DDX11L1^MTCONS_00000002;OR4F5^MTCONS_00000004;LOC100133331^M...

Run:

tr -d '\r' < 1.txt > 1.txt.fixed
tr -d '\r' < 2.txt > 2.txt.fixed
1 Like

Now I'm trying to get the "^M"-free file content back into columns.

TCONS_00001761;coding;2,18E-01;TCONS_00004368;coding;8,83E-01;TCONS_00009964;coding;1,78E-03;TCONS_00006220;coding;3,94E-05;TCONS_00025442;coding;1,17E-03;TCONS_00022769;coding;8,06E-05;TCONS_00028104;coding;1,04E-01;TCONS_00001332;coding;0.323523;TCONS_00019267;coding;3,00E-14;TCONS_00047499;coding;1,53E-01;TCONS_00036481;coding;3,00E-14;TCONS_00029157;coding;3,00E-14;TCONS_00006069;coding;3,24E-04;TCONS_00024502;coding;3,00E-14;TCONS_00028328;coding;2,21E-01;TCONS_00015161;noncoding;0.727496;TCONS_00033850;coding;5,49E-03;TCONS_00030020;noncoding;0.900574;TCONS_00003686;coding;0.00313452;TCONS_00042001;coding;3,77E-06;TCONS_00014006;coding;2,31E-03;TCONS_00039845;coding;0.0854184;TCONS_00006099;coding;8,96E-07;TCONS_00026444;noncoding;0.768354;TCONS_00013501;coding;6,58E-01;TCONS_00027619;coding;0.0326396;TCONS_00019013;coding;3,98E-06;TCONS_00030372;coding;3,00E-14;TCONS_00018101;coding;0.453702;TCONS_00014309;coding;3,42E-01;TCONS_00011427;coding;1,48E-05;TCONS_00046933;coding;1,86E-04;TCONS_00030752;coding;4,34E-04;TCONS_00029573;coding;0.02172;TCONS_00022475;coding;0.159075;TCONS_00025188;coding;6,35E-06;TCONS_00028360;coding;0.00379242;TCONS_00006622;coding;0.0117992;TCONS_00044979;coding;0.322987;TCONS_00029922;noncoding;0.975846;TCONS_00006000;noncoding;0.980548;TCONS_00028073;coding;3,00E-14;TCONS_00034565;coding;2,24E-05;TCONS_00042371;coding;3,93E-07;TCONS_00000095;coding;2,83E-02;TCONS_00006227;coding;2,32E-07;TCONS_00012737;coding;8,40E-06;TCONS_00026492;coding;2,68E-02;TCONS_00037061;coding;1,90E-04;TCONS_00037894;coding;1,93E-02;TCONS_00009621;noncoding;0.820937;TCONS_00046675;coding;2,56E-01;TCONS_00010598;coding;0.154984;TCONS_00019870;coding;5,02E-05;TCONS_00036710;coding;0.104934;TCONS_00004992;coding;3,00E-14;TCONS_00030192;noncoding;0.952787;TCONS_00032739;coding;0.119714;TCONS_00013170;coding;3,00E-14;TCONS_00011654;coding;0.345494;TCONS_00017338;coding;0.476682;TCONS_00044312;coding;0.0181988;TCONS_00010433;coding;2,18E-05;TCONS_00024244;coding;7,61E-02;TCONS_00012301;noncoding;0.993394;TCONS_00004278;coding;0.00276243;TCONS_00039637;noncoding;0.607583;TCONS_00036444;coding;3,00E-14;TCONS_00030146;coding;0.00364833;TCONS_00030592;coding;9,39E-03;TCONS_00011816;coding;2,91E-04;TCONS_00044901;coding;0.203805;TCONS_00002945;coding;0.48936;TCONS_00042063;coding;1,20E-08;TCONS_00028949;coding;1,11E-01;TCONS_00022958;coding;0.00317738;TCONS_00029883;coding;3,00E-14;TCONS_00026688;coding;0.0461262;TCONS_00027145;coding;0.0542434;TCONS_00005559;coding;4,60E-05;TCONS_00022170;coding;6,32E-03;TCONS_00016267;coding;0.0355581;TCONS_00047258;coding;2,21E-04;TCONS_00038771;coding;4,89E-10;TCONS_00040979;noncoding;0.971459;TCONS_00032362;coding;1,45E+00;TCONS_00031102;coding;0.0426105;TCONS_00024253;noncoding;0.806975;TCONS_00013717;coding;0.129065;TCONS_00025786;coding;7,29E-04;TCONS_00000435;coding;4,85E-02;TCONS_00032683;coding;7,94E-06;TCONS_00020668;coding;0.00324151;TCONS_00003583;noncoding;0.611441;TCONS_00019680;noncoding;0.662171;TCONS_00017036;coding;7,99E-01;TCONS_00005339;coding;8,64E-03;TCONS_00002753;coding;1,02E+00;TCONS_00029862;coding;5,92E-04;TCONS_00025100;coding;0.258693;TCONS_00039934;coding;7,34E-03;TCONS_00008884;coding;1,61E-02;TCONS_00039152;noncoding;0.551458;TCONS_00033926;coding;6,09E-07;TCONS_00008328;noncoding;0.887499;TCONS_00043013;coding;0.0242812;TCONS_00017807;coding;3,00E-14;TCONS_00031465;coding;1,51E-04;TCONS_00003043;coding;6,12E-07;TCONS_00020339;coding;1,58E-07;TCONS_00011637;coding;4,62E-04;TCONS_00026818;coding;3,92E-02;TCONS_00027277;coding;0.0031664;TCONS_00027121;coding;2,61E-03;TCONS_00014963;coding;4,25E-03;TCONS_00032455;coding;1,27E-01;TCONS_00024237;coding;4,26E-09;TCONS_00024931;coding;1,18E-08;TCONS_00009840;noncoding;0.798276;TCONS_00021981;coding;2,44E-04;TCONS_00043107;coding;0.00564581;TCONS_00031555;coding;0.0115258;TCONS_00003642;noncoding;0.946168;TCONS_00046054;coding;0.131067;TCONS_00009405;noncoding;0.589966;TCONS_00023326;coding;

I'm looking at "column" and "pr" at the moment.

Hmm, it seems that the previous file didn't contain Windows style line ends, but something else (OSX maybe?). If you still have the original files (containing ^M characters), then run this on them:

tr '\r' '\n' < 1.txt > 1.txt.really.fixed
tr '\r' '\n' < 2.txt > 2.txt.really.fixed
1 Like

Many thanks again! :b: Now everything runs smoothly. Your advice was a very big help for a unix novice! This will change (partly) :slight_smile:

TCONS_00033802;coding;1,48E-05;PCDHGA4
TCONS_00012432;coding;1,12E-05;TOX4
TCONS_00041012;coding;5,15E-07;VPS41
TCONS_00020923;noncoding;0.91566;ZNF85
TCONS_00018944;coding;8,62E-04;KRT39
TCONS_00029495;noncoding;0.52531;EBLN2
TCONS_00014097;coding;3,00E-14;MAPKBP1

Note: CR-terminated lines ( \r ) are typically MacOs 9 or older