Recursive find / grep within a file / count of a string

Hi All,

This is the first time I have posted to this forum so please bear with me. Thanks also advance for any help or guidance.

For a project I need to do the following.

  1. There are multiple files in multiple locations so I need to find them and the location. So I had planned to use
    cd LOCATION;
    find . -name "FILENAME.TXT" -type f -print > $HOME/list_of_locations.txt

this gives my paths in this format ./dir1/dir2/dir3/FILENAME.txt

  1. Each one of these files is of a different format and the only way to work out the different format is to count the number of occurances of the "|" string in each file.

I can either use head -l to take first row and count the number of occurences of the "|" character or else grep the "|" in all rows and divide by the wc -l (number of lines). My preference is on the most efficient.

  1. I want to produce a new file listing the full path and the number of occurrences of the "|" character so then I can process the .txt file later. If the number of occurences can somehow be concatenated onto the list_of_locations.txt in 1 or else a new file created with this information.

So what I am asking:

Is there a quick way of doing this?
Using find . -name is very slow - but looks like there is no other way as I am doing a recursive search across subdirectories.
Is there a better way to interogate my .txt file to find out how many "|" characters there are?
Is there a better way to put all of this into a UNIX script?

Thanks in advance for any help you can give either code snippit or advice.

Regards,
Charlie.

You can do all of that in one line:

find /pathA /pathB ... /pathN -name "filename" -print0 |xargs -0 awk -F\| 'FNR==1 {print FILENAME, NF}'

This will look into the list of locations for the filename(s) you specified and print them out, separated by a "0"- char. xargs will collect them all and run awk on this list. awk will open each file, and print full path and field count from the first line. Redirect as desired.
As I am not aware of how to skip the remainder of the file and go on to the next one, there is some optimization potential. Trials with close("-") right after the print statement showed a little improvement in execution time, but I'm not sure if it does the right thing. EDIT: It does not; returns -1 error code.
Anybody out there knowing about skipping to the next file in awk's argument list?

RudiC's suggestion is close, but misses on a couple of points. Since no pathname operands are given to awk, all of the filenames printed by awk will be an empty string. And, if there are x field separators on a line, there are x+1 fields.

The -print0 find primary and the -0 option to xargs are not defined by the standards, so they might not be available on your implementation.

A portable way to do what I believe was requested is:

find . -name 'FILENAME.TXT' -exec awk -F'|' 'FNR==1{printf("%s %d\n", FILENAME, NF-1)}' {} +

Some implementations of awk have a nextfile statement (like next, but while next restarts processing on the next line, nextfile restarts processing on the first line of the next file). If your awk has this non-standard extension, the following will be much more efficient for long input files:

find . -name 'FILENAME.TXT' -exec awk -F'|' '{printf("%s %d\n", FILENAME, NF-1);nextfile}' {} +

-------------------------------
Note that the comment I made about Rudi's proposal not printing pathnames is totally bogus. The xargs utility will add the pathname operand to awk as it invokes awk. :o

Thank you, Don, for commenting on my proposal.

At least with the combination of find and awk implemented on my linux system, there's a full path listing avalable, including filenames containing spaces:

find /var/log -iname \*.log -print0 |xargs -0 awk  -F\| 'FNR==1 {print FILENAME, NF}'
/var/log/auth.log 1
/var/log/dist-upgrade/history.log 0
. . .
/var/log/x y.log 3
/var/log/kern.log 1

Yes. Still I thought the number of fields to be more relevant than the number of separators. Might have been premature.

Works, and satisfies the standards, but:

time find . . . -print0 |xargs -0 awk  -F\| '. . .'
real    0m0.034s
time find . . . -exec awk -F\| '. . .' {} \;
real    0m0.208s

Special thanks for this; I was looking for that or an equivalent; unfortunately not available on my system.

Hi Rudi,
Yes, but note that by skipping the -print (or -print0) and the invocation of xargs, awk is still given the full pathname as an operand (even if there are spaces, tabs, or newlines included in the pathname).

Agreed. But it wasn't what Charlie6742 asked for.

Not surprising since what you timed runs awk once for each input file.
But note that I specified:

find . . . -exec awk -F\| '. . .' {} +

not:

find . . . -exec awk -F\| '. . .' {} \;

With the + instead of the \; find shouldn't execute awk any more times than xargs would and we avoid needing to start xargs at all.

Rats ... missed that. Absolutely right, plays in the same league:

time find . . .  -exec awk -F\| ' . . . ' {} +
. . .
real    0m0.034s
1 Like

Thanks guys. I have played with all the methods you suggested but it does not seem to give me any output. It works without errors - but just doesn't give output. I should have said I am using the bash shell - could some of these commands not be working properly on my setup? Is there a way I can set it up so it works as you have it.

If it helps - this is the message it gives me for one of the options that doesn't work.

find . -name "a.txt" -exec /usr/bin/awk -F'|' '{printf("%s %d\n", FILENAME, NF-1);nextfile}' {} +
./dir1/a.txt 40
awk: illegal statement 603430
record number 1

Once again thanks in advance for looking at this and so quickly - its really appreciated.

Charlie

What system are you using? If it is a Solaris system, try using /usr/xpg4/bin/awk or nawk instead of /usr/bin/awk.

If it is a Solaris system, there is also a good chance that nextfile isn't supported, but that should have generated a clearer error message. Have you tried the other form:

find . -name 'a.txt' -exec awk -F'|' 'FNR==1{printf("%s %d\n", FILENAME, NF-1)}' {} +

Hi Guys, thanks again for swift responses on this:). I managed to solve my problem - so thought I would share - feel free to comment. Also have another question. So just for everyone's benefit in the code below I am

  1. Doing recursive find of the .txt files
    e.g. /folder1/folder2/folder3/a.txt
  2. Pulling out the first row of each of the .txt files
    e.g. from a.txt take only first row abc|def|efg
  3. Count number of | strings into temp2 variable
  4. Count number of / strings from the path name into temp3
  5. I only want to output paths which have a set number of / strings (only 9 and 10)

SO HERE'S THE NEW QUESTION:
so for the check on the number of / string I have the OR clause but what actually happens is that it finds paths which have 10 strings it it and outputs them but does not look at any with 9 in them. equally if I swap the code around it find the paths with 9 in them but does not look at any with 10. Is there a better way to do this other than split up the if statement?

find . -name .snapshot -prune -o -name "*.txt" -print|while read i
do
    temp1=`grep "|" "${i}"|head -1`
    temp2=`echo ${temp1} | awk -F"|" '{c += NF - 1} END {print c}'`
    temp3=`echo ${i} | awk -F"/" '{c += NF - 1} END {print c}'`
    echo $temp3
    if [ "$temp3" = "10" -o "&temp3" = "9" ]; then
      echo "${i} , ${temp3} , ${temp2}"
      echo
    fi
done

Does your find provide the -mindepth and -maxdepth options? That would help in the first place. If it does not, why don't you filter find 's output before reading it into the i variable?