Match strings in 2 different files

Hi,

i am trying to match strings from 2 different files based on position like below:-

file1 (tab delimited)

f07270 lololol  fff
u12730 gggddd  dddkkrr  mmm

file2 (not tab delimited)

%f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfss
%g13450 GDIDFLRIP%ILITEAPPRKgsfgsgsf
%d08880 pve1_%00%39444gsgfsg
0 tog(%s)gsfgfss  

file3 (output)

%f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfss
0 tog(%s)gsfgfss  

the code:

awk 'FNR==NR{_[substr($0,1,6)]; next}{for(i in _){if(i==substr($0,1,7)){print $0}}}' file1 file2 > file3

am i missing something as i just got blank file for the output?. Kindly help, thanks.

awk 'FNR==NR{_[substr($0,1,6)]; next}{for(i in _){if(i==substr($0,1,7)){print $0}}}' file1 file2 > file3

Should be:

awk 'FNR==NR{_[substr($0,1,6)]; next}{for(i in _){if(i==substr($0,2,6)){print $0}}}' file1 file2 > file3

Or shorter:

awk 'FNR==NR{_[$1]; next} substr($0,2,6) in _' file1 file2 > file3
1 Like

OR

$ awk -F'[% ]' 'FNR==NR{A[$1];next}$2 in A' file1 file2
1 Like

Works only if the second file has a space as field separator as in the provided example, but who knows?

1 Like

Yes Franklin52, thats true. but I thought it would be better if we do not use substr($0,2,6) since we really do not know how many char ? whether its of fixed length or not in real data.

---edit---

OR this could be an alternative if space separated

$ awk  'FNR==NR{A[$1];next}substr($1,2) in A'  file1 file2
1 Like

Hi guys,

it takes some time for me to reply as i am still working on the codes and my data. Actually the codes are working fine but i have another issue. My file2 is huge and contains variety of data. The issue is not all of the data are printed out due to "new line". I want all data to be printed out starting from "%" and will stop when the next "%" is reached. for example,

correct output file

%f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfss
ffffffffffgggggggggggggrrrrrrrrrrrrrwwwwww333r22356676
ujhjhkjkuuuuuuuuu2228888fddddddddddddddddererererer30tog
0 tog(%s)gsfgfss
ffffffffddddddddddssssssssssffffffffffgggggggggghhhhhhhhy
ssssssssssssfdfgggggggghhhhhhhhhhhhhhhhhhhhrrrrreee

for the given codes, i got results only until end of line. i checked awk manual from book and internet and tried couple of things by amending your codes to get the desired output but failed. am sure there should be a simple solution for this but i just couldnt figure it out. my mistake to give u short sample of data

Could you specify samples of the input files that would need to lead to that output?

Hi,

sorry for the confusion. Below is another simple sample from my data:-

file1

file2

I bold the matched ids. the output should be all the data in file2 where the id (the first 5 char after "%") matches id in file1 as follows:

file3

in other words, i have huge data in file2 with multiple lines which could be continuous and separated by "\n". Therefore, i need to have a code that will print from the first char started by "%" and stop when it reaches the next "%", once the id matches with file1 ids. Thanks

Try : it works for input in #8

$ awk 'FNR == NR{A[$1]; next}/^\%/{f = (substr($1,2,6) in A) ? 1 : 0}f' file1 file2

%f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfsshhhhddddddddddhd234@678
dffffffffffgggggggggggggrrrrrrrrrrrrrwwwwww333r22356676dddddassfssfsfrdd
ujhjhkjkuuuuuuuuu2228888fddddddddddddddddererererer30
%q5548d alalalaaaaaaaaaaaaaaaa(aaaaaa)faagaaaaadddddddddddddd%68dd
ddddddddddddddd 
1 Like

Hi Akshay,

Your codes work perfect!!! Thanks so much...i don't understand some of them. can u pls explain the one that i color in red?

awk 'FNR == NR{A[$1]; next}/^\%/{f = (substr($1,2,6) in A) ? 1 : 0}f' file1 file2

thanks... :slight_smile:

f = (substr($1,2,6) in A) ? 1 : 0 --> if 2nd char onwards, 6 six char from field 1 of file2 is found in array A of file 1 then, f will be set to 1 otherwise 0.

1 --> true and,
0 --> false

whenever its true, that is f=1 it will print line.

1 Like

Hi,

Got it.. thanks a lot!! :))