perform 3 awk commands to multiple files in multiple directories

amarn · October 26, 2011, 2:40pm

Hi,

I have a directory /home/datasets/ which contains a bunch (720) of subdirectories called hour_1/ hour_2/ etc..etc.. in each of these there is a single text file called (hour_1.txt in hour_1/ , hour_2.txt for hour_2/ etc..etc..) and i would like to do some text processing in them.

Each of these text files contains records (where this record is unique and there are no duplicates) and i want to initially separate each of these records into its own file and name it based on the second field (where the $2 field is an identifier and have this form : cust_xxx_yyy of the record...I'm currently doing this (example for file hour_1/hour_1.txt) :

(1)

awk '{print $0 > $2".txt"}' hour_1.txt

which results to multiple .txt files starting with cust_

then i want to have all these files as a single column file, therefore i do this:

(2)

awk '{print >  "n_"FILENAME}' RS=" " cust_*

and finally i want to remove the first 3 records of the newly created files thus i do the following:

(3)

awk 'FNR>3 {print > "fin_"FILENAME}' n_cust*

I know that there might be an easier way of doing this even for a single directory, but is there a way to write a universal script and perform these 3 commands in all the directories?

Thanks in advance!

felipe.vinturin · October 26, 2011, 2:48pm

You can try to use the script I sent you in this post as a base script:

find <Path> -name "hour_*.txt" -type f | \
while read fname
do
	fileBaseName=`basename "${fname}"`
	fileDirName=`dirname "${fname}"`

	echo "fileBaseName: [${fileDirName}][${fileBaseName}] - fname[${fname}]"
done

This may help you as a starting point. =o)

amarn · October 26, 2011, 3:42pm

thank you again felipe.vinturin for your quick response

I had (and still have your code severely in mind) but if i use the find<Path> part wouldn't this need to be iterated through a loop in order to access the specific directory out of the 720 in the main directory? (please be aware that i'm a newbie in shell scripting:) )

so the main directory is : /home/datasets/

and in there there are 720 directories....by using the find <path> in order to access the single .txt file (and then in the do-done put these 3 awk commands) wouldn't i have to call every time the find tool to find again the path (e.g. /home/datasets/hour_1/ then /home/datasets/hour_2/ etc..etc..)?

thanks again

---------- Post updated at 02:42 PM ---------- Previous update was at 02:14 PM ----------

hi felipe again, i actually have tried your code by doing this:

!usr/bin/sh

find /home/tester/datasets/ -name "hour_*.txt" -type f | \

while read fname
do
	fileBaseName = `basename "${fname}" `
	fileDirName = `dirname "${fname}" `

#	echo "fileBaseName: [${fileDirName}][${fileBaseName}] - fname[${fname}]"

	awk '{print $0 > $2".txt"}' fileBaseName
	awk '{print > "n_"FILENAME}' RS= " " "cust_*.txt"
	awk 'FNR>3 {print > "fin_"FILENAME}' "n_cust*.txt"

done

i get errors on fileBaseName and fileDirName and as expected some errors in the awk commands....is the fileBaseName and fileDirName failure has to do that i'm under cygwin?

cheers and thank you again!

felipe.vinturin · October 27, 2011, 6:49am

Hi,

You were facing an error because you were using only the filename, not the filename and path and also, the variable names must be between: ${}

find /home/tester/datasets/ -name "hour_*.txt" -type f | \
while read fname
do
	fileBaseName = `basename "${fname}" `
	fileDirName = `dirname "${fname}" `

#	echo "fileBaseName: [${fileDirName}][${fileBaseName}] - fname[${fname}]"

	awk -v outputPath="${fileDirName}" '{print $0 > outputPath "/" $2 ".txt"}' "${fname}"
	awk -v outputPath="${fileDirName}" '{print > outputPath "/" "n_" FILENAME}' RS= " " "${fileDirName}/cust_*.txt"
	awk -v outputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_" FILENAME}' "${fileDirName}/n_cust*.txt"
done

This version uses the paths and filenames.

One more comment, I have not tested it!

I hope it helps.

amarn · October 27, 2011, 7:04am

Hi,

It seems that the problem occurs due to the fileBaseName and fileDirName...they don't actually hold any values and i'm keep getting the error:

are these variables embedded and globally used by a shell script or just your own?

thank you again

CarloM · October 27, 2011, 7:16am

The error is because there are spaces in the variable assignment - change it to:

	fileBaseName=`basename "${fname}" `
	fileDirName=`dirname "${fname}" `

felipe.vinturin · October 27, 2011, 7:17am

When I copied your script, I did not see that there was a space between the variable name, equal sign and the command:

find /home/tester/datasets/ -name "hour_*.txt" -type f | \
while read fname
do
	fileBaseName=`basename "${fname}"`
	fileDirName=`dirname "${fname}"`

#	echo "fileBaseName: [${fileDirName}][${fileBaseName}] - fname[${fname}]"

	awk -v outputPath="${fileDirName}" '{print $0 > outputPath "/" $2 ".txt"}' "${fname}"
	awk -v outputPath="${fileDirName}" '{print > outputPath "/" "n_" FILENAME}' RS= " " "${fileDirName}/cust_*.txt"
	awk -v outputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_" FILENAME}' "${fileDirName}/n_cust*.txt"
done

fileBaseName = `basename "${fname}"` # Wrong
fileBaseName=`basename "${fname}"`   # Correct

amarn · October 27, 2011, 7:35am

Thank you to both CarloM and felipe.vinturin! The first part works fine, but the last two awk commands seem to not find the requested files. The first command successfully generates all the cust_*.txt files but the other two cannot execute...is there another way of expressing it?

thanks again

felipe.vinturin · October 27, 2011, 7:44am

Some suggestions to help you solve your problem/errors:
-----> Try to execute the script for only one "hour"
-----> Put this in the begining of your script: "set -xv" and debug it
-----> Put some "echo" commands also to debug it

amarn · October 27, 2011, 10:00am

Hi,

I've been going through some trial-error procedures...

basically :

!usr/bin/sh

set -xv

find /home/tester/datasets_26_10_11/hour_1/ -name "cust_*.txt" -type f | \


while read fname
do
	fileBaseName=`basename "${fname}"`
	fileDirName=`dirname "${fname}"` 

	echo "fileBaseName: [${fileDirName}][${fileBaseName}] - fname[${fname}]"

	echo "now working on: [${fname}] with [${fileBaseName}]"
	#this is right! - for the shell script being in the same directory , not outside
	awk -v outputPath="${fileDirName}" 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' ${fileBaseName}
	echo "fileBasename AFTER FIRST AWK : [${fileBaseName}]"

        #the following not tested yet
	awk -v outputPath="${fileDirName}" -v inputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_"FILENAME}' ${fileBaseName}


done

works only if the file is in the actual hour_1/ directory...if not in there i get an awk error saying that it cannot read the file e.g:

awk: fatal : cannot open `file cust_1209_3000'  for reading (no such file or directory)

so, it can actually find this particular file but cannot read it...i believe it has something to do with the ` or ' surrounding it.....any suggestions?

vgersh99 · October 27, 2011, 10:27am

couple of things:

#!/usr/bin/sh
awk -v outputPath="${fileDirName}" 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' "${fileBaseName}"
awk -v outputPath="${fileDirName}" -v inputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_"FILENAME}' "${fileBaseName}"

CarloM · October 27, 2011, 10:29am

awk -v outputPath="${fileDirName}" 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' ${fileBaseName}

You still need $fname (i.e. full path) for the input file to awk.

amarn · October 27, 2011, 11:09am

Dear both thank you for your help , i've tried both your approaches, unfortunately none works

based on vgersh99

#!/usr/bin/sh

set -xv

find /home/tester/datasets_26_10_11/hour_1/ -name "cust_*.txt" -type f | \


while read fname
do
	fileBaseName=`basename "${fname}"`
	fileDirName=`dirname "${fname}"` 

	echo "fileBaseName: [${fileDirName}][${fileBaseName}] - fname[${fname}]"

#	awk -v outputPath="${fileDirName}" '{print $0 > outputPath "/" $2".txt"}' "${fname}"
#	awk -v outputPath="${fileDirName}" 'FNR>3 {print > outputPath "/"
"fin_"FILENAME}' "${fileDirName}/n_cust*"

	echo "now working on: [${fname}] with [${fileBaseName}]"
	
	
# testing only first awk command
	awk -v outputPath="${fileDirName}" 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' "${fileBaseName}"


	echo "fileBasename AFTER FIRST AWK : [${fileBaseName}]"

#second awk command
#	awk -v outputPath="${fileDirName}" -v inputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_"FILENAME}' ${fileBaseName}

done

based on CarloM

#!/usr/bin/sh

set -xv

find /home/tester/datasets_26_10_11/hour_1/ -name "cust_*.txt" -type f | \


while read fname
do
	fileBaseName=`basename "${fname}"`
	fileDirName=`dirname "${fname}"` 

	echo "fileBaseName: [${fileDirName}][${fileBaseName}] - fname[${fname}]"

#	awk -v outputPath="${fileDirName}" '{print $0 > outputPath "/" $2".txt"}' "${fname}"
#	awk -v outputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_"FILENAME}' "${fileDirName}/n_cust*"

	echo "now working on: [${fname}] with [${fileBaseName}]"
	
	
	awk -v outputPath="${fileDirName}" -v inputPath="${fileDirName}" 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' ${fileBaseName}


	echo "fileBasename AFTER FIRST AWK : [${fileBaseName}]"

#	awk -v outputPath="${fileDirName}" -v inputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_"FILENAME}' ${fileBaseName}

done

and in both cases i get the awk error again

thanks again for the quick responses

felipe.vinturin · October 27, 2011, 11:17am

Please, put a full example (for an "hour", eg hour_1), with all filenames, directory structure and file contents.

After this we will be able to help you. =o)

CarloM · October 27, 2011, 11:28am

I think you misunderstood - giving awk a variable name 'inputPath' doesn't do anything by itself.

Change

awk -v outputPath="${fileDirName}" -v inputPath="${fileDirName}" 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' ${fileBaseName}

to:

awk -v outputPath="${fileDirName}" 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' ${fname}

amarn · October 27, 2011, 11:40am

Hi CarloM,

I've tried that as well (by having my shell script outside the hour_1/ directory) and i get an error for all the cust_*.txt files - this is why i put cust_xxxx.yyyy.txt) :

awk: cmd. line1 (FILENAME=/home/tester/dataset/hour_1/cust_xxxx.yyyy.txt FNR=1) cannot redirect to `/home/tester/dataset/hour_1/n_/home/tester/datasets/hour_1/cust_xxxx.yyyy.txt

---------- Post updated at 10:40 AM ---------- Previous update was at 10:36 AM ----------

ok (this for felipe.)

i have a directory /home/datasets/ which contains 720 directories of hours

e.g. : hour_1/ hour_2/ ....... up to hour_720/

(Example for the hour_1/ which applies to all the hour_*/ i mentioned above)

for the first hour of an experiment i have a a folder named hour_1/
in this folder there is a file called hour1.txt which was broken down record by record and resulted into many cust_xxx_yyy.txt files (particularly for the hour the number of the cust_* files is 1160).

an example of a cust_xxxx_yyyy

(every cust file after the $3 field (in this case after number 12 has varying number of fields)

what i want to do is:

put the Record Separator in every cust_xxx_yyy.txt as RS = " " in order to make it a single column file and then remove the first 3 records from every file

hand by hand i can apply the following two awk commands for setting the RS and then removing the first 3 records

doing the Record Separator

awk '{print >  "n_"FILENAME}' RS=" " cust_*

Removing the first three records

awk 'FNR>3 {print > "fin_"FILENAME}' n_cust*

and i want to apply this for all the hour_*/ directories

however, i'm now working only in hour_1/ and i get the errors i mentioned in my previous posts.

thanks again

CarloM · October 27, 2011, 11:53am

I'm afraid that didn't clarify things at all.

What is the code you're currently running, exactly what's in the directory you're running it in, and exactly what output do you get?

EDIT: Actually, just try this:

find /home/tester/datasets/ -name "hour_*.txt" -type f | \
while read fname
do
	fileBaseName=`basename "${fname}"`
	fileDirName=`dirname "${fname}"`

	awk -v outputPath="${fileDirName}" '{print $0 > outputPath "/" $2 ".txt"}' "${fname}"
	awk -v outputPath="${fileDirName}" '{print > outputPath "/" "n_" FILENAME}' RS= " " "${fileDirName}"/cust_*.txt
	awk -v outputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_" FILENAME}' "${fileDirName}"/n_cust*.txt
done

amarn · October 27, 2011, 12:14pm

the latest code im running after your suggestion is (execAWK.sh) :

this script is placed in /home/tester/datasets/

where the /home/tester/datasets_26_10_11/ contains all the hour_*/ directories i explained in my previous post

#!/usr/bin/sh

set -xv

find /home/tester/datasets_26_10_11/ -name "cust_*.txt" -type f | \


while read fname
do
	fileBaseName=`basename "${fname}"`
	fileDirName=`dirname "${fname}"` 

	echo "fileBaseName: [${fileDirName}][${fileBaseName}] - fname[${fname}]"


	echo "now working on: [${fname}] with [${fileBaseName}]"
	

#first awk command	
	awk -v outputPath="${fileDirName}" 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' "${fname}"


	echo "fileBasename AFTER FIRST AWK : [${fileBaseName}]"

#second awk command - not included for now
#	awk -v outputPath="${fileDirName}" -v inputPath="${fileDirName}" 'FNR>3 {print > outputPath "/" "fin_"FILENAME}' ${fileBaseName}

done

the output i'm getting is this (i'm just copying part of it since it is extremely big):

now working on: [/home/tester/datasets_26_10_11/hour1/cust_1064_219239.txt] with [cust_1064_219239.txt]
+ awk -v outputPath=/home/tester/datasets_26_10_11/hour1 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' /home/tester/datasets_26_10_11/hour1/cust_1064_219239.txt
+ clip
awk: cmd. line:1: (FILENAME=/home/tester/datasets_26_10_11/hour1/cust_1064_219239.txt FNR=1) fatal: can't redirect to `/home/tester/datasets_26_10_11/hour1/n_/home/tester/datasets_26_10_11/hour1/cust_1064_219239.txt' (No such file or d
irectory)
+ read fname
basename "${fname}"
++ basename /home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt
+ fileBaseName=cust_1072_220262.txt
dirname "${fname}"
++ dirname /home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt
+ fileDirName=/home/tester/datasets_26_10_11/hour1
+ echo 'fileBaseName: [/home/tester/datasets_26_10_11/hour1][cust_1072_220262.txt] - fname[/home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt]'
fileBaseName: [/home/tester/datasets_26_10_11/hour1][cust_1072_220262.txt] - fname[/home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt]
+ echo 'now working on: [/home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt] with [cust_1072_220262.txt]'
now working on: [/home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt] with [cust_1072_220262.txt]
+ awk -v outputPath=/home/tester/datasets_26_10_11/hour1 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' /home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt
+ clip
awk: cmd. line:1: (FILENAME=/home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt FNR=1) fatal: can't redirect to `/home/tester/datasets_26_10_11/hour1/n_/home/tester/datasets_26_10_11/hour1/cust_1072_220262.txt' (No such file or d
irectory)
+ read fname
basename "${fname}"
++ basename /home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt
+ fileBaseName=cust_1077_222034.txt
dirname "${fname}"
++ dirname /home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt
+ fileDirName=/home/tester/datasets_26_10_11/hour1
+ echo 'fileBaseName: [/home/tester/datasets_26_10_11/hour1][cust_1077_222034.txt] - fname[/home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt]'
fileBaseName: [/home/tester/datasets_26_10_11/hour1][cust_1077_222034.txt] - fname[/home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt]
+ echo 'now working on: [/home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt] with [cust_1077_222034.txt]'
now working on: [/home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt] with [cust_1077_222034.txt]
+ awk -v outputPath=/home/tester/datasets_26_10_11/hour1 'BEGIN{RS =" ";}{print > outputPath "/" "n_"FILENAME}' /home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt
+ clip
awk: cmd. line:1: (FILENAME=/home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt FNR=1) fatal: can't redirect to `/home/tester/datasets_26_10_11/hour1/n_/home/tester/datasets_26_10_11/hour1/cust_1077_222034.txt' (No such file or d
irectory)
+ read fname
basename "${fname}"
++ basename /home/tester/datasets_26_10_11/hour1/cust_1080_222291.txt
+ fileBaseName=cust_1080_222291.txt
dirname "${fname}"
++ dirname /home/tester/datasets_26_10_11/hour1/cust_1080_222291.txt
+ fileDirName=/home/tester/datasets_26_10_11/hour1
+ echo 'fileBaseName: [/home/tester/datasets_26_10_11/hour1][cust_1080_222291.txt] - fname[/home/tester/datasets_26_10_11/hour1/cust_1080_222291.txt]'
fileBaseName: [/home/tester/datasets_26_10_11/hour1][cust_1080_222291.txt] - fname[/home/tester/datasets_26_10_11/hour1/cust_1080_222291.txt]
+ echo 'now working on: [/home/tester/datasets_26_10_11/hour1/cust_1080_222291.txt] with [cust_1080_222291.txt]'
now working on: [/home/tester/datasets_26_10_11/hour1/cust_1080_222291.txt] with [cust_1080_222291.txt]

i hope this helps, thank you again

felipe.vinturin · October 27, 2011, 12:23pm

The problem is because awk's FILENAME points to the filename and path, not only the filename!

Try to change FILENAME to:

# E.g. awk '{ns=split(FILENAME, arr, "/"); print arr[ns]}' <infile>
# Change whenever you find FILENAME, change it by: arr[ns], but don't forget the: ns=split(FILENAME, arr, "/")

Code:

awk -v outputPath="${fileDirName}" 'BEGIN{RS =" ";}{ns=split(FILENAME, arr, "/"); print > outputPath "/" "n_" arr[ns]}' "${fname}"

CarloM · October 27, 2011, 12:27pm

EDIT: What felipe said :).

So this should work:

find /home/tester/datasets/ -name "hour_*.txt" -type f | \
while read fname
do
	fileBaseName=`basename "${fname}"`
	fileDirName=`dirname "${fname}"`

	awk -v outputPath="${fileDirName}" '{print $0 > outputPath "/" $2 ".txt"}' "${fname}"
	awk -v outputPath="${fileDirName}" '{ns=split(FILENAME, arr, "/"); print > outputPath "/" "n_" arr[ns]}' RS= " " "${fileDirName}"/cust_*.txt
	awk -v outputPath="${fileDirName}" 'FNR>3 {ns=split(FILENAME, arr, "/"); print > outputPath "/" "fin_" arr[ns]}' "${fileDirName}"/n_cust*.txt
done