I want to search a string/substring from the second column in file in another file and append the first found record in second file to the end of the record in the first file. Both files are tab delimited.
All lines with KOG in col13 do not need to be searched as it will not be found.
Here the logic in my head which needs to translate into code.
In all lines which does not contain the keyword 'KOG' in the column 13 of file1
Extract substring for searching:when second column has values starting with
sp| or tr| ,, example when value is sp|P32770|NRP1_YEAST , string to be searched is P32770..
when value is tr|N1PNC6|N1PNC6_MYCP1 , string to be searched is N1PNC6
if the second column does not start with sp| or tr| ... it has values like NP_001059837 or AEW46684 then the entire string needs to be searched.
If searched string is found, append the entire first matching line in file2 to the end of the corresponding record in file1 after a tab.
A lot of the second column values repeat, so it would be good if the search for the same value is done only once. For example AEW46684 occurs in the second column of file1 119 times, so searching it just once might save computation with these huge files.
Since there are many columns in the samples I have attached sample inputs and output.
Not very clean though but try this.
You can optimize it to some extend depending on which ever file you think might be bigger.
The highlighted part is what you were asking for, the substring logic.
ahamed, the code has been running for 6 hrs without any output so far, file1 is 29 MB and file2 is 8 GB..is there a way to speed up things? Also I think searching the same string just once will save a lot of time.
RudiC, I did test with smaller files, it works fine,,,
Ahamed, the second code has started producing appropriate output,,,the first code still hasn't..
Thank you again, i will wait it out..it should be done in 5-6 days going by the amount of output..