Using awk on multiple files in a directory

SkySmart · July 26, 2013, 10:03pm

so i have a file system:

/data/projects

in this file system, there's about 300 files. on all files in this directory, i'm running:

egrep -r 'Customer.*Processed' /data/projects/*

is there an efficient (fast) awk way of searching through each file in the directory and providing an output such as the following:

Desired output format:

/data/projects/file01,300lines,130lines matching 'Customer.*Processed'
/data/projects/file02,40lines,13lines matching 'Customer.*Processed'
/data/projects/file03,3000lines,1879 lines matching 'Customer.*Processed'
......
......

i'm looking for a awk command that is efficient. if i smack something together, i'm pretty sure it wont be efficient. so i'm hoping someone has a better way of doing this:

awk '/Customer.*Processed/' /data/projects/*

Yoda · July 26, 2013, 10:20pm

I would recommend using grep instead of awk.

grep should perform way better than awk in pattern matching.

alister · July 26, 2013, 10:52pm

You must be using a very good grep and a very bad awk if you see a significant difference in simply printing matching lines.

Not only do I not see a big difference, but awk wins one of the tests.

Using a pair of GNU implementations (neither of which is renowned for speed):

$ awk --version | head -n1
GNU Awk 4.1.0, API: 1.0 (GNU MPFR 3.1.2, GNU MP 4.3.2)

$ grep --version | head -n1
GNU grep 2.6.3

Fixed string:

$ time seq 500000 | grep -c 434
2484

real    0m15.266s
user    0m14.685s
sys     0m0.061s

$ time seq 500000 | grep -Fc 434
2484

real    0m15.266s
user    0m14.919s
sys     0m0.015s

$ time seq 500000 | awk '/434/ {++i} END {print i}'
2484

real    0m14.813s
user    0m14.888s
sys     0m0.030s

Regular expression with wildcard:

$ time seq 500000 | grep -c '4.*4'
73535

real    0m14.844s
user    0m15.968s
sys     0m0.015s

$ time seq 500000 | awk '/4.*4/ {++i} END {print i}'
73535

real    0m15.047s
user    0m14.998s
sys     0m0.076s

Regards,
Alister

RudiC · July 27, 2013, 12:56pm

I think this is as close as you can get:

awk     'FNR == 1               {if (NR > 1) {print fn, "text1", fnr, "text2", nl}
                                 fn=FILENAME; fnr = 1; nl = 0}
                                {fnr = FNR}
         /customer.*processed/  {nl++}
         END                    {print fn, "text1", fnr, "text2", nl}
        ' file?

As you want the line count per file, you need to read every file entirely; I don't see much chance to improve on speed...

SkySmart · July 27, 2013, 2:04pm

rudic:

I think this is as close as you can get:
awk     'FNR == 1               {if (NR > 1) {print fn, "text1", fnr, "text2", nl}
   fn=FILENAME; fnr = 1; nl = 0}
   {fnr = FNR}
   /customer.*processed/  {nl++}
   END                    {print fn, "text1", fnr, "text2", nl}
   ' file?
As you want the line count per file, you need to read every file entirely; I don't see much chance to improve on speed...

thank you!!!

this worked perfectly. is there anyway i can instruct awk to do exactly what you're doing here, but to treat any file it finds that isn't plain text (i.e. gzip files) in a different way?

like, for instance, the grepping for the string wont work on files that are gzipped. i do know you can use the following for reading gzip files:

( gunzip -cd /path/to/file.gz ; cat /path/to/file ) | grep

the problem im having is being able to incorporate this command into your awk command so it is kicked off ONLY when the awk comes across a file that isn't plain text.

makes sense?

RudiC · July 28, 2013, 12:30pm

Why don't you gunzip all files upfront and then apply the awk script to the entire directory?

SkySmart · July 28, 2013, 2:28pm

actually that's the least of my problems now. i believe i'll be able to figure that out at the end. but the only other question i have is, lets say the first time i run this command, i get and output similar to this:

first run:
/data/projects/file01,300lines,130lines matching 'Customer.*Processed'

(note, this is just one file out of many that would be in the output.)

now, the above output is saved to a file called /tmp/results.txt
the second time i run this command, say 5 minutes later, there'd be a line in the output similar to:

second run:
/data/projects/file01,410lines,139lines matching 'Customer.*Processed'

now, i dont want to search through each file again. i want to begin from the point where the last scan left off.

in the first run, there were 300 lines in the file named '/data/projects/file01. I want it so that, the next time i run the script, awk can begin from line 301 to the end of the file. and i want to have this happen for all the files it finds in the directory. this way, only the first run will be slow. all runs after that will be fast.

here's my attempt to modify your code:


lastlinenumber=$(awk -F"," '{print $2}' /tmp/results.txt | sed 's/lines//g')

awk    -v LLNUM=${lastlinenumber}  'FNR == 1               {if (NR > 1) {print fn, "text1", fnr, "text2", nl}
                                 fn=FILENAME; fnr = 1; nl = 0}
                                {fnr = FNR}
         /customer.*processed/  && NR>LLNUM {nl++}
         END                    {print fn, "text1", fnr, "text2", nl}
        ' file?

if while comparing the most recent list of files in the latest scan, it finds a file that didn't exist in the previous scan, it'll scan that file in its entirety because it would be considered new.

RudiC · July 28, 2013, 5:09pm

a) You have to read lastline AND filename when you want to run the script on multiple files. This could be done in awk, reading results.txt as the first file and storing the lines in an array.
b) The influence on execution speed will be negligible (esp. doing it the way you proposed above) as awk still needs to read every single line just to count them...

SkySmart · July 28, 2013, 6:44pm

what if we feed the last known number of lines of each file to awk. that way awk can begin reading that file from that point on.? i can do this in bash but it'll require a few lines of code...which i think will be inefficient.

alister · July 28, 2013, 8:19pm

Neither bash nor awk can do this because neither provides an interface to the lseek() system call. And, if either did, it would only be useful if all lines were of the same length, or if a byte count is saved instead of a line count.

dd can seek on a shared file descriptor on behalf of a subsequent process (such as awk), but that would only be useful when working with one file at a time.

Regards,
Alister