Match strings in 2 different files

redse171 · December 13, 2013, 12:47pm

Hi,

i am trying to match strings from 2 different files based on position like below:-

file1 (tab delimited)

f07270 lololol  fff
u12730 gggddd  dddkkrr  mmm

file2 (not tab delimited)

%f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfss
%g13450 GDIDFLRIP%ILITEAPPRKgsfgsgsf
%d08880 pve1_%00%39444gsgfsg
0 tog(%s)gsfgfss

file3 (output)

%f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfss
0 tog(%s)gsfgfss

the code:

awk 'FNR==NR{_[substr($0,1,6)]; next}{for(i in _){if(i==substr($0,1,7)){print $0}}}' file1 file2 > file3

am i missing something as i just got blank file for the output?. Kindly help, thanks.

Franklin52 · December 13, 2013, 1:51pm

awk 'FNR==NR{_[substr($0,1,6)]; next}{for(i in _){if(i==substr($0,1,7)){print $0}}}' file1 file2 > file3

Should be:

awk 'FNR==NR{_[substr($0,1,6)]; next}{for(i in _){if(i==substr($0,2,6)){print $0}}}' file1 file2 > file3

Or shorter:

awk 'FNR==NR{_[$1]; next} substr($0,2,6) in _' file1 file2 > file3

Akshay_Hegde · December 13, 2013, 2:09pm

OR

$ awk -F'[% ]' 'FNR==NR{A[$1];next}$2 in A' file1 file2

Franklin52 · December 13, 2013, 2:18pm

Works only if the second file has a space as field separator as in the provided example, but who knows?

Akshay_Hegde · December 13, 2013, 2:34pm

Yes Franklin52, thats true. but I thought it would be better if we do not use substr($0,2,6) since we really do not know how many char ? whether its of fixed length or not in real data.

---edit---

OR this could be an alternative if space separated

$ awk  'FNR==NR{A[$1];next}substr($1,2) in A'  file1 file2

redse171 · December 13, 2013, 7:30pm

Hi guys,

it takes some time for me to reply as i am still working on the codes and my data. Actually the codes are working fine but i have another issue. My file2 is huge and contains variety of data. The issue is not all of the data are printed out due to "new line". I want all data to be printed out starting from "%" and will stop when the next "%" is reached. for example,

correct output file

%f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfss
ffffffffffgggggggggggggrrrrrrrrrrrrrwwwwww333r22356676
ujhjhkjkuuuuuuuuu2228888fddddddddddddddddererererer30tog
0 tog(%s)gsfgfss
ffffffffddddddddddssssssssssffffffffffgggggggggghhhhhhhhy
ssssssssssssfdfgggggggghhhhhhhhhhhhhhhhhhhhrrrrreee

for the given codes, i got results only until end of line. i checked awk manual from book and internet and tried couple of things by amending your codes to get the desired output but failed. am sure there should be a simple solution for this but i just couldnt figure it out. my mistake to give u short sample of data

Scrutinizer · December 13, 2013, 9:31pm

Could you specify samples of the input files that would need to lead to that output?

redse171 · December 13, 2013, 10:27pm

Hi,

sorry for the confusion. Below is another simple sample from my data:-

file1

file2

I bold the matched ids. the output should be all the data in file2 where the id (the first 5 char after "%") matches id in file1 as follows:

file3

in other words, i have huge data in file2 with multiple lines which could be continuous and separated by "\n". Therefore, i need to have a code that will print from the first char started by "%" and stop when it reaches the next "%", once the id matches with file1 ids. Thanks

Akshay_Hegde · December 14, 2013, 12:11am

Try : it works for input in #8

$ awk 'FNR == NR{A[$1]; next}/^\%/{f = (substr($1,2,6) in A) ? 1 : 0}f' file1 file2

%f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfsshhhhddddddddddhd234@678
dffffffffffgggggggggggggrrrrrrrrrrrrrwwwwww333r22356676dddddassfssfsfrdd
ujhjhkjkuuuuuuuuu2228888fddddddddddddddddererererer30
%q5548d alalalaaaaaaaaaaaaaaaa(aaaaaa)faagaaaaadddddddddddddd%68dd
ddddddddddddddd

redse171 · December 14, 2013, 12:34am

Hi Akshay,

Your codes work perfect!!! Thanks so much...i don't understand some of them. can u pls explain the one that i color in red?

awk 'FNR == NR{A[$1]; next}/^\%/{f = (substr($1,2,6) in A) ? 1 : 0}f' file1 file2

thanks...

Akshay_Hegde · December 14, 2013, 12:42am

f = (substr($1,2,6) in A) ? 1 : 0 --> if 2nd char onwards, 6 six char from field 1 of file2 is found in array A of file 1 then, f will be set to 1 otherwise 0.

1 --> true and,
0 --> false

whenever its true, that is f=1 it will print line.

redse171 · December 14, 2013, 12:46am

Hi,

Got it.. thanks a lot!! :))