Match substring from a column of the second file

yifangt · September 8, 2017, 1:01pm

I want to merge the lines by matching substring of the first file with first column of the second file.

file1:

S00739A_ACAGTG_L001_R1.fq.gz
S00739A_ACAGTG_L001_R2.fq.gz
S00739B_GCCAAT_L001_R1.fq.gz
S00739B_GCCAAT_L001_R2.fq.gz
S00739D_GTGAAA_L001_R1.fq.gz
S00739D_GTGAAA_L001_R2.fq.gz
S00739E_ATCACG_L001_R1.fq.gz
S00739E_ATCACG_L001_R2.fq.gz

file2:

S00739A WT-1
S00739B WT-2
S00739D mt-1
S00739E mt-2

output:

S00739A_ACAGTG_L001_R1.fq.gz WT-1
S00739A_ACAGTG_L001_R2.fq.gz WT-1
S00739B_GCCAAT_L001_R1.fq.gz WT-2
S00739B_GCCAAT_L001_R2.fq.gz WT-2
S00739D_GTGAAA_L001_R1.fq.gz mt-1
S00739D_GTGAAA_L001_R2.fq.gz mt-1
S00739E_ATCACG_L001_R1.fq.gz mt-2
S00739E_ATCACG_L001_R2.fq.gz mt-2

This is my try, but this part has problem A[1]=$0 that makes the script invalid:

awk 'FNR==NR {split($1, A, "_"); A[1]=$0; next} {print A[$1], $2}' file1 file2

Thanks for any help!

RavinderSingh13 · September 8, 2017, 1:12pm

Hello yifangt,

Could you please try following and let me know if this helps you.

awk 'FNR==NR{a[$1]=$2;next} ($1 in a){print $0,a[$1]}'  Input_file2  FS="_"  Input_file1

Output will be as follows.

S00739A_ACAGTG_L001_R1.fq.gz WT-1
S00739A_ACAGTG_L001_R2.fq.gz WT-1
S00739B_GCCAAT_L001_R1.fq.gz WT-2
S00739B_GCCAAT_L001_R2.fq.gz WT-2
S00739D_GTGAAA_L001_R1.fq.gz mt-1
S00739D_GTGAAA_L001_R2.fq.gz mt-1
S00739E_ATCACG_L001_R1.fq.gz mt-2
S00739E_ATCACG_L001_R2.fq.gz mt-2

Thanks,
R. Singh

yifangt · September 8, 2017, 1:29pm

No, there is no output.
I am using GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.2)

RavinderSingh13 · September 8, 2017, 1:38pm

Hello yifangt,

Not sure, it worked perfectly fine for me as per POST#2 only, so there could be 2 possibilities in my point of view.

I- Either there could be carriage characters present into your Input_file, you could try with command cat -v Input_file , if you see carriage return characters then you could use command tr -d '\r' < Input_file > temp_file && mv temp_file Input_file .
II- Second option could be in case on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

Kindly do let me know how it goes then.

Thanks,
R. Singh

MadeInGermany · September 8, 2017, 1:59pm

Did you give the input files in the correct order?
It is important to first process the "mapping file", then the file that corresponds to the output file.
And the FS="_" is to be set after the "mapping file".

yifangt · September 8, 2017, 2:06pm

Thanks Ravi!
There was an issue with the weird carriage return in one of the files. It worked fine now!
I understand you changed the order of the files, and you used FS="_" for the second file, and hope this respond to what MadeinGermany reminded me.
But, I still have the question why the part A[1]=$0 in my script does not work.

RavinderSingh13 · September 8, 2017, 2:14pm

Hello yifangt,

So in your code why A[$1] is not working because when Input_file1 is being read then $1 will be whole line eg--> S00739A_ACAGTG_L001_R1.fq.gz and when Input_file2 is being read then $1 will be S00739A , so that is why A[$1]'s value will always be NULL and it will not print anything over there, kindly do let me know if you have any queries on same.

Thanks,
R. Singh

yifangt · September 8, 2017, 2:31pm

I meant A[1] = $0 for the mapping part as I thought A is the array from split(). So that later A[$2] will get what I want by $2 as the key/subscript of the array. What did I miss?

RavinderSingh13 · September 8, 2017, 2:36pm

Hello yifangt,

So here A[1] means array named A whose index is 1(digit one) and value is A[1]=current line of Input_file1 . So now when you try to print A[$2] or A[$1] then it means it will look for $2/$1 's value from current line from Input_file2 into array A (eg--> A[S00739A] ) which is NOT present at all in array A . Thus it will NOT print anything then. Kindly do let me know if I was NOT clear, will try to explain more on same then.

Thanks,
R. Singh

MadeInGermany · September 8, 2017, 3:30pm

split makes A[1] A[2] ...
Nothing that you can lookup via a column.
Your initial attempt would need a second array that is string-addressed.

awk 'FNR==NR {split($1, A, "_"); B[A[1]]=$0; next} {print B[$1], $2}' file1 file2

But now the output corresponds to file2.

RudiC · September 8, 2017, 3:35pm

awk 'FNR==NR {split($1, A, "_"); A[1]=$0; next} {print A[$1], $2}' file1 file2

Yes, A is the receiving array of the split() function. It has index values 1 .. 4 (which never will match $1 nor $2 in file2) and is overwritten for every line read from the input file, so after reading the entire file1 it will hold the last line in A[1] and the residual fields in A[2] till A[4], never to be matched by following records from file2.
Plus, with file2 being the last file worked upon, the output - should it be generated at all - would have four lines only.

yifangt · September 8, 2017, 3:47pm

I figured out the problem which should be:

awk 'FNR==NR {split($1, A, "_"); B[A[1]]=$0; next} {print B[$1],$2} ' file1 file2

Thanks again!
Aha, I am soooooo glad I got the same as MadeInGermany!

Thanks for all your input!

RudiC · September 8, 2017, 4:05pm

Are you aware that you don't get your desired output from post#1 with your approach in post#12?
It would yield four lines only, and all R1.fq.gz would have disappeared

yifangt · September 8, 2017, 5:03pm

Thanks RudiC!
I was too excited to notice the problem, which is a serious bug for sure.
Seems the files order must be changed because of the similarity of the _R1/R2.fq.gz file names.

awk 'FNR==NR {B[$1]=$2; next} split($1, A, "_"){print $0, B[A[1]]}' file2 file1

However, if I do not change the files order, how to fix this bug if possible, Please?
Thanks again!

RudiC · September 9, 2017, 5:33am

It is possible, but that solution wouldn't lend itself naturally. You'd need to suck in the entire first file - which I presume is larger - into RAM and either store it in two arrays (key and entire record) or have an algorithm search the key later. Then, for every line in file two, you'd need to run through ALL keys to find ALL occurrences of the key for printout. All that can become lengthy for huge files.