Shell script to search all files for every string in another file

Hello All

I have a pattern.txt file in source directory ((/project/source/) in linux server and data looks like:

123abc17
234cdf19
235ifg20

I have multiple log files in log directory (/project/log/) in linux server and data for one log file looks like:

<?xml version="1.0" processid is 123abc17
read successfully at 20161109093456
<?xml version="1.0" process id is 986bng21
read successfully at 20161109093459
message id aazzkk110 is 123abc17
message id aakjahsdk110 is 234cdf19
<?xml version="1.0" processid is 235ifg20
read successfully at 20161109093456
<?xml version="1.0" process id is 987skj29

I want to grep every single string in the source file - pattern.txt against all log files in log directory
and populate the output in a output.txt file in directory (/project/output/) data should look like below:

<?xml version="1.0" processid is 123abc17
read successfully at 20161109093456
message id aazzkk110 is 123abc17
message id aakjahsdk110 is 234cdf19
<?xml version="1.0" processid is 235ifg20
read successfully at 20161109093456

If a string is matched against any log file, I want to check if line starts with '<?xml version' then i want to populate second line also in the output file along with matching line
otherwise I just want to populate the matching line.

Please help me in writing a unix command / shell script to achieve it.

grep can use a pattern file:

grep -f /project/source/pattern.txt $(  find /path/to/files/ -type -name '*.log' )

Hi Jim

I ran below command and I got the following message about arguments.

grep -f /project/source/pattern.txt $(  find /project/log/ -type -name '*.log' ) 

'find: Arguments to -type should contain only one letter'

Please suggest me what changes do I need to make.

An f argument is missing.

You might get by with:

grep -f /project/source/pattern.txt $(  find /path/to/files/ -name '*.log' )

as long as you only have regular files with names ending with .log , but I would guess that Jim accidentally just dropped the letter f to only apply grep to regular files with names ending with .log .

grep -f /project/source/pattern.txt $(  find /path/to/files/ -type f -name '*.log' )

If there are lots of log files to be searched and/or they have long pathnames, you might be safer using:

find /path/to/files/ -name '*.log' -exec grep -f /project/source/pattern.txt {} +

which should avoid any cases where the first two commands might fails with an "argument list too long" error.

Note, however, that any of these will just print the names of the files containing a line matching a line that from your pattern file. The grep utility that you requested be used does not contain sufficient power to select and print a varying number of lines following a matched pattern. For that you need something more like awk :

#!/bin/ksh
pattern_file="/project/source/pattern.txt"
log_dir="/project/log"

find "$log_dir" -type f -name '*.log' -exec awk '
FNR == NR {
	patterns[++patternsc] = $0
	next
}
/^<[?]xml version/ {
	n = 2
}
{	for(i = 1; i <= patternsc; i++)
		if($0 ~ patterns)
			break
	if(i <= patternsc)
		p = ($0 ~ /^<[?]xml version/) ? 2 : 1
}
p {	print
	p--
}' "$pattern_file" {} +

This was written and tested using a Korn shell on macOS 10.12.1, but will work with any shell that accepts Bourne shell syntax. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

Hi Don

Thank you for the script, when I executed it I was seeing all the matching lines against the patterns in the pattern.txt file but even if the line is starting with

 <?xml version 

its not populating the second line. Could you please help me with with that.

Hi,

For given input / output samples, Don solution nice solution works perfectly :b::b:

Here is another try:

awk 'NR==FNR{a[NR]=$0;b=NR;next} { for(i=1;i<=b;i++) { if ( $0 ~ a && $0 ~ /<[?]xml version/) { print $0;getline;print $0; }  else if ( $0 ~ a ) print $0 } } ' pattern.txt file.log

It gives desired output as posted in #1

@Don : Can you help me to understand your code better ?

/^<[?]xml version/ {
	n = 2
}

Is it just for condition/action block because n is not used after that ?

p {	print
	p--
}'

Use of p in red here is unclear .

p will be set to 2 if the line matching any pattern begins with <?xml... , or 1 otherwise. When printing it is counted down until it reaches 0, thus printing 2 or 1 line, resp.
The n = 2 doesn't seem to be necessary for the correct output; maybe a relict from other attempt to the solution?

1 Like

Try also (shamelessly stealing from Don Cragun's proposal)

find "$log_dir" -type f -name '*.log' -exec awk '
FNR == NR       {patterns[$0]
                 next
                }
/^<\?xml v.*n/  {getline X; $0 = $0 RS X
                }
                {for (i in patterns)
                   if ($0 ~ i) print
                }
'  "$pattern_file" {} +

Hi R�diger,
Note that the last line of the sample input in post #1 was:

<?xml version="1.0" process id is 987skj29

If 987skj29 had been included in the pattern file my code would print that line while the code you suggested in post #9 would not. Of course both of our scripts might pick up the 1st (perhaps non-matching line) from a subsequent log file if that sample line had not been in the last log file processed in a batch by find -exec . To avoid that problem, I probably should have included an additional three lines in my script, changing:

/^<[?]xml version/ {
	n = 2
}

to:

FNR == 1 {
	p = 0
}
/^<[?]xml version/ {
	n = 2
}

to stop an erroneous printing of a continuation line when we switch to a new input log file.

Hi pred55,
With the sample input you provided in post #1, the script I suggested produces the output:

<?xml version="1.0" processid is 123abc17
read successfully at 20161109093456
message id aazzkk110 is 123abc17
message id aakjahsdk110 is 234cdf19
<?xml version="1.0" processid is 235ifg20
read successfully at 20161109093456

exactly matching the output you requested in post #1.

Please post a sample of a sequence of lines in your real data (from pattern.txt and from one of your *.log files that was not processed correctly by the code I suggested.

1 Like

Thanks for pointing a weakness out that would need additional measures to prevent errors from happening. But: it WOULD print out that line but with a false second line - the last read X which would keep its value when getline failed.
Possible remedies: Check the getline status if (1 == getline X) $0 = $0 RS X or set X = "" .

1 Like

Thank you all for your suggestions, I got the results.