search a regular expression and match in two (or more files) using bash

Dear all,

I have a specific problem that I don't quite understand how to solve. I have two files, both of the same format:

XXXXXX_FIND1 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla

(return)
XXXXXX_FIND2 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla

(return)
etc...

The problem is that each entry is randomly swapped, for example in file 1 there is XXXXXX_FIND1, XXXXXX_FIND3, XXXXXX_FINDX mixed, as well as in file 2, but scrambled.
What I want to do is create a new file and match entries like:

XXXXXX_FIND1 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla

(return)
XXXXXX_FIND1 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla

(return)
XXXXXX_FIND2 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla

(return)
XXXXXX_FIND2 bla bla bla
bla
bla
bla
bla
bla
bla
bla
bla
bla

Note that:
1) I don't know the letters/numbers in FIND1, FIND2 etc. But these match between the files, and they are always five.
2) There are entries that do not match; those should not be considered
3) bla is for information that does not match, and sometimes some entries have more lines of "bla"!

Is this possible to do with bash or awk?

Thank you in advance!

Try:

cat file1 file2 | perl -n0e 'while(/.{6}_(FIND\d+).*?========\n/sg){$h{$1}.=$&};print $h{$_} for (sort keys %h)'
1 Like

Thanks for the quick reply, Bartus! However, I should emphasize that FIND1 , FIND2 etc are not like that. For example they can be ABC1D, RTGQ1 etc. So, a random combination of numberals and letters...

Therefore, the only common characteristic between the files is the seperation of each entry and within the entry what is after the XXXXXX_ , which is composed of 5 characters and this should match between the entries from each file...

Thanks again for the help!

Try this:

cat file1 file2 | perl -n0e '$h{$1}.=$& while(/X{6}_(\w{4}\d).*?========\n/sg);print $h{$_} for (sort keys %h)'

still not working...
I would like to show two example output files, so that you can have a better idea of the output (see attached archive). I am new to scripting and I think that I didn't describe the problem precisely.

Thanks a lot for seeing through my problem.

OK, so I can see that those "XXXXXX" weren't literal X characters. So how should those records be sorted? Only based on the part after "_"? If that is the case, then try this:

cat file1 file2 | perl -n0e '$h{$1}.=$& while(/.{6}_(\w+).*?=+\n/sg);print $h{$_} for (sort keys %h)'

the thing is that, if the files are cat, then i lose information about where the entries previously were (file 1 or file 2).
So, I would like to have matches only between entries from A and B files.
Furthermore, I can see that the problem has another dimension, the entry after the _ is not unique. Therefore, an additional way is to match the string between tabs 7 and 8 of the line where the XXXXX_XXXXX is.
I think this should be matched first and then, when this is matched, refine the matches according to the _XXXXX. If entries are not matched, these should not be included in the output...

Well, I guess we didn't understand each other for now. I thought you wanted to get all the entries from both files sorted based on XXXXXX_RANDOM string. If that is not the case, then please be precise on what you want to get, providing sample data consisting of few records from both files and desired output for that sample.

that's because we come from different disciplines! :slight_smile:

I have a short example with a readme file attached to this post as you have suggested... Hope it helps more, and we can understand each other!

Try this script:

#!/usr/bin/perl
open A,"$ARGV[0]";
open B,"$ARGV[1]";
local $/;
$_=<A>;
while (/^[^_]+_(\w+)([^\t]+\t){5}([^\t]+).*?=+\n/msg){
  $h{"$1$3"}=$& if ! $h{"$1$3"};
}
$_=<B>;
while (/^[^_]+_(\w+)([^\t]+\t){5}([^\t]+).*?=+\n/msg){
  print $h{"$1$3"} . $& if grep /$1$3/,keys %h;
}

Run it as: ./script.pl file1.txt file2.txt > output

1 Like

this is magic! Thanks, bartus! This is exactly what I wanted!
if you have time, can you please explain me the script?
thanks again man!