Passing multiple files to awk for processing in bash script

shree11 · June 12, 2014, 7:15am

Hi,

I'm using awk command in bash script. I'm able to pass multiple files to awk for processing.The code i can use is as below(sample code)

#!/bin/bash
awk -F "," 'BEGIN { 
...
...
...
}' file1 file2 file3

In the above code i'm passing the file names manually and it is fine till my processing files are less. Suppose if i have a 1000's of files in a directory, then how can i process them using awk at a time?
Can i point to a directory so that the awk will take all the files present in that directory and allowing me not to supply all the filenames to awk command?
Is it possible?

Thanks,
Shree

Akshay_Hegde · June 12, 2014, 7:25am

Use wildcard, whenever FNR==1 file is changing, you can process individual file here, if array is used you can delete it even, if all together you want to process then process in END block
Try this you will get idea..

awk 'FNR==1{print FILENAME; ++i}END{print "total files read : ",i}' file*

CarloM · June 12, 2014, 7:26am

You could just let the shell expand the filenames:

awk stuff <dirname>/*

If you have a lot of files (enough to exceed the maximum length of an argument list) then you could use find with -exec or xargs:

find <dirname> -type f | xargs awk stuff

shree11 · June 12, 2014, 7:33am

Okay, <dirname>/* can be used to process multiple files.

Suppose if i want to process files present in a hadoop HDFS , can i direcctly process using awk script like below as u suggested ?

awk { 
stuff
}'  "hadoop fs -ls /user/user/data/file*"

CarloM · June 12, 2014, 7:44am

awk '{ 
stuff
}'  $(hadoop fs -ls /user/user/data/file*)

should pass the list of filenames to awk. I'm not familiar with hadoop though - would you need to execute a hadoop command to access the file contents?

Scrutinizer · June 12, 2014, 8:02am

If you have many files and ARG_MAX is exceeded (You'll see a message like: awk: arg list too long ) , then it depends on the script if xargs can be used, though. xargs may need to call the awk script multiple times depending on the number of files, so the outcome will be wrong if for example your script is calculating a grand total number, for which it needs the content of all those files.

Of course using cat * to concatenate the files and feeding that output into awk's stdin brings no solace either, since cat has the same restrictions.

But these restrictions could be circumvented with a construct like this:

for i in *
do
  cat "$i"
done |
awk -F "," 'BEGIN { 
...
...
...
}'

shree11 · June 13, 2014, 4:55am

@CarloM,

I tried with the code:

awk '{ 
stuff
}' file1.txt $(hadoop fs -ls /user/user/data/file2.txt)

where file1 -> reading from the local
fiel2 -> reading the file from HDFS(hadoop)

But it's saying:

Not getting why it is giving error for file1 although file1.txt is exists in the local directory.When i tried removing the part $(hadoop fs -ls /user/user/data/file2.txt) and copy the file2.txt to local and give the local path like }' file1.txt file2.txt its working fine.

So it's not allowing to take the file directly from hdfs to process using awk command . How how can i do it? is there any alternative for this?
NOTE: i don't want to copy the hadoop files to local directory to acheive this

Thanks

CarloM · June 16, 2014, 8:11am

I imagine the hadoop ls output isn't just a list of filenames - you'd need to pre-process it to get just the names.

However, a quick look at a hadoop man page seems to imply that you can't access the files directly, so you would need a local copy of each to do it by file anyway.

If you're not using the filename (or other per-file processing) then you could cat the files into awk. Something like:

hadoop fs -cat /user/user/data/file* | awk '{stuff}'

(assuming hadoop cat doesn't add anything to the output, else you'd need to pre-process it)

You could also adapt Scrutinizer's suggestion if you need per-file processing.