nawk newbie

bing · March 31, 2011, 5:17am

hi,

i've got a number of csv files (e.g. 1.csv, 2.csv ...) in a directory to process them.

I need to read them fast and using nawk, to carry out reg expression matching, i would then output those files into their respective files (e.g. 1.csv.out, 2.csv.out).

in my codes:

while ((getline< "/workspace/folder/1.csv")>0)
regex="abx"
if ($6 ~ regex)
{
print $1 $3 $4 >> "/workspace/folderout/1.csv.out"
}

it works when the file name is hard coded, for loop doesnt seems to work.
i had:
"for FILE in /workspace/folder*.csv" before the begin

but it gives me syntax err at the for ..

pls help :o

rdcwayx · March 31, 2011, 5:54am

regex="abx"
cd /workspace/folderout

for file in *.cvs
do
   nawk -v str=$regex '{if ($6 ~ v) print $1,$3,$4 > FILENAME ".out"}' $file
done

ctsgnb · March 31, 2011, 5:59am

Please post an example of input and expected output then people will be able to provide a more accurrate help

---------- Post updated at 11:59 AM ---------- Previous update was at 11:56 AM ----------

A double quote is missing : ".out"

bing · March 31, 2011, 10:15am

hi thanx,

cd /workspace/folderout

din work properly as the input file is overwritten, i'd like to keep the original file too ..

FILENAME ".out"

there's a syntax err for this, i think FILENAME cant be manipulate in the single quotes.

sgruenwald · March 31, 2011, 11:38am

regex="abx"; for file in *.cvs; do awk -v str=$regex '{if ($6==str) print $1,$3,$4}' $file >$file.out; done

rdcwayx · April 1, 2011, 7:27am

bing:

hi thanx,
cd /workspace/folderout
din work properly as the input file is overwritten, i'd like to keep the original file too ..
FILENAME ".out"
there's a syntax err for this, i think FILENAME cant be manipulate in the single quotes.

FILENAME is build-in var in awk, no need quotes around it.

bing · April 5, 2011, 10:42am

This is how the script looks like

regex="abx"; for file in *.csv; do nawk -v str=$regex '{if ($6 ~ str) print $1,$3,$4}' $file >$file.out; done

i need to do regular expression using '~'.

as i've many csv files, is there a way to make the above run faster other than running the script multiple times simultaneously? :wall::wall:

ctsgnb · April 5, 2011, 10:59am

Maybe something like :

eval egrep -e '$regex' *.csv >/tmp/regex_csv.out

with a little more parsing could get the info as well.
This would only generate one *.out file, but the info would still be in

alister · April 5, 2011, 11:56am

bing:

This is how the script looks like
regex="abx"; for file in *.csv; do nawk -v str=$regex '{if ($6 ~ str) print $1,$3,$4}' $file >$file.out; done
i need to do regular expression using '~'.

as i've many csv files, is there a way to make the above run faster other than running the script multiple times simultaneously? :wall::wall:

If i/o (hard drive access, for example) is the bottleneck, and if all the files are on the same device, then your simultaneous instances of the script will be contending for the same limited i/o resource, which will likely result in a drop in performance.

Regardless of the bottleneck, using find with exec-+ will involve much less process creation -- which can be expensive if your files are many and a substantial portion of the work if they are also small -- thereby speeding things up (perhaps substantially).

regex=abx
find . -type f -name '*.csv' -exec awk -v re="$regex" '
    FNR == 1 { close(f); f=FILENAME".out"  }
    $6 ~ re { print $1, $3, $4 > f }
' {} +

Regards,
Alister