In the above code i'm passing the file names manually and it is fine till my processing files are less. Suppose if i have a 1000's of files in a directory, then how can i process them using awk at a time?
Can i point to a directory so that the awk will take all the files present in that directory and allowing me not to supply all the filenames to awk command?
Is it possible?
Use wildcard, whenever FNR==1 file is changing, you can process individual file here, if array is used you can delete it even, if all together you want to process then process in END block
Try this you will get idea..
should pass the list of filenames to awk. I'm not familiar with hadoop though - would you need to execute a hadoop command to access the file contents?
If you have many files and ARG_MAX is exceeded (You'll see a message like: awk: arg list too long ) , then it depends on the script if xargs can be used, though. xargs may need to call the awk script multiple times depending on the number of files, so the outcome will be wrong if for example your script is calculating a grand total number, for which it needs the content of all those files.
Of course using cat * to concatenate the files and feeding that output into awk's stdin brings no solace either, since cat has the same restrictions.
But these restrictions could be circumvented with a construct like this:
for i in *
do
cat "$i"
done |
awk -F "," 'BEGIN {
...
...
...
}'
where file1 -> reading from the local
fiel2 -> reading the file from HDFS(hadoop)
But it's saying:
Not getting why it is giving error for file1 although file1.txt is exists in the local directory.When i tried removing the part $(hadoop fs -ls /user/user/data/file2.txt) and copy the file2.txt to local and give the local path like }' file1.txt file2.txt its working fine.
So it's not allowing to take the file directly from hdfs to process using awk command . How how can i do it? is there any alternative for this?
NOTE: i don't want to copy the hadoop files to local directory to acheive this
I imagine the hadoop ls output isn't just a list of filenames - you'd need to pre-process it to get just the names.
However, a quick look at a hadoop man page seems to imply that you can't access the files directly, so you would need a local copy of each to do it by file anyway.
If you're not using the filename (or other per-file processing) then you could cat the files into awk. Something like: