Finding all files based on pattern

Lakshman_Gupta · December 19, 2014, 12:37am

Hi All,

I need to find all files in a directory which are containing specific pattern. Thing is that file name should not consider if pattern is only in commented area.

all contents which are under /* */ are commented
all lines which are starting with -- or if -- is a part of some sentence then all words after -- are commented in that line

for eg. I need to find all files which are containing insurance_no

file1-- this file should qualify for our search

where insurance_no=TGT.insurance_no 
-- insurance_no is unique no.

select * from 
table t1,t2
where t1.name=t2.name
and t.asset>2000 --and insurance_no <> "2521"

/* based on insuranace_no cutomer full
details can be find out*/

file2-- this file should not be qualified for our search as insurance_no is in commented area only

-- insurance_no is unique no.

select * from 
table t1,t2
where t1.name=t2.name
and t.asset>2000 --and insurance_no <> "2521"

/* based on insuranace_no cutomer full
details can be find out*/

the commands that i have tried so far is below but it not working as it considers those files also (in this case file2 also)which are containing only commented inurance_no also

find . -name "*.*" -exec grep -l "insurance_no" {} \; 2>/dev/null

find . -name "*.*" |xargs -n1 -I {} sh -c 'grep insurance_no {}|grep -v ".*--.*insurance_no.*"|grep -v ".*/\*.*insurance_no.*\*/" '

Thanks in advance for all your guidance /help.

RavinderSingh13 · December 19, 2014, 12:55am

Hello Lakshman Gupta,

We can look for string where insurance_no and as I can see example only file1 has that not file2, so we can look for that string.
Following may help you to find these kind of files, let me know if you have any queries please.

find -type f -exec grep "where insurance_no" {} \; -print 2>/dev/null

Output will be as follows.

where insurance_no=TGT.insurance_no
./search_file1

Thanks,
R. Singh

Lakshman_Gupta · December 19, 2014, 1:19am

Hi Ravi,

Thanks for your time this "where" can change and it replaced with some other characters as we have to scan through 1000 of files. So this will not be genearlised solution

RavinderSingh13 · December 19, 2014, 2:26am

Hello Lakshman_Gupta,

Could you please try following and let us know if this helps.

find -type f -exec grep '[^a-zA-Z0-9]insurance_no=[^a-zA-Z0-9]*'  {} \; -print 2>/dev/null

Output is as follows.

where insurance_no=TGT.insurance_no
./search_file1

EDIT: Even I have made a file as follows and above command is working fine for that too.

cat ./search_file3
908where insurance_no=90TGT.insurance_no
-- insurance_no is unique no.
select * from
table t1,t2
where t1.name=t2.name
and t.asset>2000 --and insurance_no <> "2521"
/* based on insuranace_no cutomer full
details can be find out*/

After running the command we will get following results.

find -type f -exec grep '[^a-zA-Z0-9]insurance_no=[a-zA-Z0-9]*'  {} \; -print 2>/dev/null
908where insurance_no=90TGT.insurance_no
./search_file3
where insurance_no=TGT.insurance_no

Thanks,
R. Singh

Don_Cragun · December 19, 2014, 3:26am

You haven't said much about your definition of "pattern".

Are you performing case sensitive matches?

Can the pattern match any text, or does the pattern have to match entire "words"? If you're limiting it to words, what defines a word boundary?

Will your patterns ever contain any characters that are special in a BRE or ERE?

Will your patterns ever contain any characters that are special in a filename matching pattern?

Will your patterns ever contain any whitespace characters? (If so, does the pattern need to be matched if the pattern extends across line boundaries?)

Do you just need to process all of the regular files in a single directory? Or do you need to process all of the regular files in a file hierarchy rooted in a directory?

Do you just want the names of files that contain the (uncommented) pattern for which you're searching? Or, do you want the filename and the lines that contain the pattern? If you want the lines containing the pattern; do you want entire lines or can it just be lines with the comments discarded?

ongoto · December 19, 2014, 4:14am

Centos 6 / bash
This seems to work also.

find -type f -exec egrep -l '^\w.*[^- ]insurance_no.*$' '{}' \; 
./file3
./file1

Edit:
corrected spelling... didn't change anything here.
...based on insuranace_no cutomer

Lakshman_Gupta · December 19, 2014, 5:59am

Thanks Don !!!

For your questions here is response

Are you performing case sensitive matches? Yes

Can the pattern match any text, or does the pattern have to match entire "words"? If you're limiting it to words, what defines a word boundary? YES it can match any pattern like it can either insurance_no or a.insurance_no= or <>insurance_no or b.insurance_no or ,insurance_no,

Will your patterns ever contain any characters that are special in a BRE or ERE? NO

Will your patterns ever contain any characters that are special in a filename matching pattern? NO

Will your patterns ever contain any whitespace characters? (If so, does the pattern need to be matched if the pattern extends across line boundaries?) NO whitespace

Do you just need to process all of the regular files in a single directory? Or do you need to process all of the regular files in a file hierarchy rooted in a directory? /file hierarchy rooted in a directory

Do you just want the names of files that contain the (uncommented) pattern for which you're searching? Or, do you want the filename and the lines that contain the pattern? If you want the lines containing the pattern; do you want entire lines or can it just be lines with the comments discarded? Looking out for the name of files which are containing the matched pattern(Uncommented one)

Ravi,
I am analyzing your solution by giving more testcases thanks for your time

I just changed the content of test_2.txt as below and its name should be returned now but its not returning.

-- insurance_no is unique no.

select *,insurance_no from
table t1,t2
where t1.name=t2.name
and t.asset>2000 --and insurance_no <> "2521"

/* based on insuranace_no cutomer full
details can be find out*/

Meanwhile i was trying below

find . -name "*.*" |xargs  -n1 -I {} sh -c 'a=`grep insurance_no {}|grep -v ".*--.*insurance_no.*"|grep -v ".*/\*.*insurance_no.*\*/"`;if [ -z "$a" ] ;then echo "1">/dev/null ; else echo {} ;fi'

but its reaching xargs limits(error below) for file name which are smaller its working fine..

xargs: Maximum argument size with insertion via {}'s exceeded

RudiC · December 19, 2014, 6:53am

Try this

awk     '               {sub (/--.*$/,"")}
         /\/\*/,/\*\//  {next}
         $0 ~ PAT       {print FILENAME}
        ' PAT="insurance_no" file*

It removes comments first (assuming /* ... */ commenting out full lines) and checks for the pattern then; may need refinements to deal with Don Cragun's questions.

Lakshman_Gupta · December 19, 2014, 9:27am

Hi RudiC,

Thanks for doing in awk way..
I just ran this and getting below error, tried to correct syntactical error by myself but wasn't able to

bash-3.2$ awk     '               {sub (/--.*$/,"")}
>          /\/\*/,/\*\//  {next}
>          $0 ~ PAT       {print FILENAME}
>         ' PAT="insurance_no" test_*
awk: syntax error near line 1
awk: illegal statement near line 1
awk: syntax error near line 3
awk: bailing out near line 3

junior-helper · December 19, 2014, 9:55am

I bet it's a Solaris 10 system, because I get the exact messages using default awk .

Try /usr/xpg4/bin/awk or nawk

Lakshman_Gupta · December 19, 2014, 10:43am

Yes exactly junior-helper Thanks a lot..

Don_Cragun · December 19, 2014, 8:55pm

Hopefully, RudiC's suggested awk script got you started down a workable path. Unfortunately, with an input file like:

pattern /* comment */
pattern2 /* start
continue comment
This >>> pattern <<< should never be seen.
continue comment
end */ pattern3
/* comment1 */ pattern4 /* comment 2 */

I believe that if your search pattern is pattern , RudiC's script will not find any of the four occurrence of pattern in the above file that are not in comment fields.

You didn't mention anything about quoted strings. If -- or /* and */ do not denote comments if they are single quoted or double quoted (as in a shell script or C code), the following script won't work either. (If you need something that will ignore comments found in quoted strings, maybe you can use the following as a guide on how to attack that problem; but I won't volunteer to do that for you here. A general parser like that is too much like work for me to offer to do it for free. ;))

The following script will work with any ksh , with /usr/xpg4/bin/sh , /usr/xpg6/bin/sh , or with bash (if bash is installed on your Solaris system). First copy the following into a file named NoCommentPattern.awk :

# Check to see if we already had a match in this file...
nf > 0 {if(FNR == 1)	nf = 0
	else		next
}
d {	printf("===%d%d\t%s\n", nf, ssc, $0)
}
# Strip out any comments (or skip line completely if we're in the middle of a
# multi-line comment.
{	if(ssc) {
		# An earlier line had an unclosed comment starting with "/*"...
		if(s = index($0, "*/")) {
			$0 = substr($0, s + 2)
			if(d) printf("Updated $0:\n\t%s\n", $0)
			ssc = 0
		} else	next
	}
	# Search for "/*...*/" and "--" comments.
	while(match($0, "[-][-]|[/][*]")) {
		if(substr($0, RSTART, 1) == "-") {
			# Found -- comment; throw away the rest of the line...
			if(RSTART == 1) {
				if(d) printf("Comment line deleted.\n")
				next
			}
			$0 = substr($0, 1, RSTART - 1)
			if(d) printf("Updated $0:\n\t%s\n", $0)
			break
		}
		# Found start of "/*" comment; look for the end of comment...
		if(s = index(substr($0, RSTART + 2), "*/")) {
			# End found, delete comment from line and look for more.
			$0 = (RSTART > 1 ? substr($0, 1, RSTART - 1) : "") \
				substr($0, RSTART + s + 3)
			if(d) printf("Updated $0:\n\t%s\n", $0)
		} else {# We found the start of a "/*...*/" commment but not
			# the end.  Process the part of this line before the
			# comment...
			ssc = 1
			if(RSTART == 1) {
				if(d) printf("Comment line deleted.\n")
				next
			}
			$0 = substr($0, 1, RSTART - 1)
			if(d) printf("Updated $0:\n\t%s\n", $0)
			break
		}
	}
}
# Look for pattern in current line after comments have been stripped.
index($0, P) {
	# Found it...
	print FILENAME
	nf = 1
}

and create a script (for this example, call it findpat ) containing:

#!/usr/xpg4/bin/sh
pat=${1:-insurance_no}
if [ $# -gt 1 ]
then	debug=1
else	debug=0
fi
find . -type f -exec /usr/xpg4/bin/awk -v P="$pat" -v d="$debug" -f NoCommentPattern.awk {} +

and make it executable:

chmod +x findpat

Then the command:

./findpat

or:

./findpat "insurance_no"

will search for any regular files containing insurance_no in the directory hierarchy rooted in the current directory that is not in a comment and print the names of any files that meet these conditions.

If you invoke it with two or more arguments:

./findpat "Search Pattern" debug

it will print lots of debugging information while it searches for matching files so you can see the lines it is processing and how it strips out comments before looking for the pattern. Once you understand how it works, you can make the script run a little bit faster if you strip out the debugging code.

Note that if you run this script in a directory other than where you place the file NoCommentPattern.awk , you'll need to modify the script to use an absolute pathname to where this file is located. This script should work even if there are spaces or tabs in your search pattern, but it will not find it if your pattern matches text that starts on one line and continues onto the next line.

If someone else wants to try this on a system where awk includes support for the nextfile function, this script can be made a lot faster by using it instead of setting nf = 1 when a match is found, reading the remainder of the file, and setting nf back to zero when the 1st line of the next file is found.

Lakshman_Gupta · December 22, 2014, 7:10am

Thanks Don for this useful script!!!

Hi Rudic,

Was just building on your awk commands three things still i am not able to sort out.

1) How to use this awk command recursively in all subdirectory. as of now i am thinking to utilize find something like below

find . -name "*.*" -exec /usr/xpg4/bin/awk     '               {sub (/--.*$/,"")}
         {sub('/\/\*/,/\*\//',"")}
         $0 ~ PAT       {print FILENAME}
        ' PAT="insurance_no" {} \;

2) if there is multiple occurence of the insurance_no( outside the comment )for each occurence its resulting filename. To just resolve this i put sort -u after pipe. but any other way in awk itself to limit if first occurence then give file name and come out search for another file

bash-3.2$ find . -name "test_*" -exec /usr/xpg4/bin/awk     '               {sub (/--.*$/,"")}
>          {sub('/\/\*/,/\*\//',"")}
>          $0 ~ PAT       {print FILENAME}
>         ' PAT="RQST_ID" {} \;
./test_2.txt
./test_2.txt
./test_1.txt
./test_1.txt
./test_1.txt

3) As of my understanding this awk command is literally going inside each file and replacing the commented part and then searching for the pattern. in this way performance will degrade. Can we do this in vice versa way first literally search pattern and then replace commented part and then see if there is still pattern in un-commented part.

RudiC · December 22, 2014, 7:33am

1) Yes, that should work
2) If your awk has the exitfile statement, place that right after the print stmt. If not, you need to create a logical construct like Don Cragun did.
3) It has to read the file either way. Not removing the comments first would again raise the need to create a logical contstruct, and it would not increase performance