nawk newbie

hi,

i've got a number of csv files (e.g. 1.csv, 2.csv ...) in a directory to process them.

I need to read them fast and using nawk, to carry out reg expression matching, i would then output those files into their respective files (e.g. 1.csv.out, 2.csv.out).

in my codes:

while ((getline< "/workspace/folder/1.csv")>0)
regex="abx"
if ($6 ~ regex)
{
print $1 $3 $4 >> "/workspace/folderout/1.csv.out"
}

it works when the file name is hard coded, for loop doesnt seems to work.
i had:
"for FILE in /workspace/folder*.csv" before the begin

but it gives me syntax err at the for ..

pls help :o

regex="abx"
cd /workspace/folderout

for file in *.cvs
do
   nawk -v str=$regex '{if ($6 ~ v) print $1,$3,$4 > FILENAME ".out"}' $file
done
1 Like

Please post an example of input and expected output then people will be able to provide a more accurrate help

---------- Post updated at 11:59 AM ---------- Previous update was at 11:56 AM ----------

A double quote is missing : ".out"

1 Like

hi thanx,

cd /workspace/folderout

din work properly as the input file is overwritten, i'd like to keep the original file too ..

FILENAME ".out"

there's a syntax err for this, i think FILENAME cant be manipulate in the single quotes.

regex="abx"; for file in *.cvs; do awk -v str=$regex '{if ($6==str) print $1,$3,$4}' $file >$file.out; done
1 Like

FILENAME is build-in var in awk, no need quotes around it.

This is how the script looks like

regex="abx"; for file in *.csv; do nawk -v str=$regex '{if ($6 ~ str) print $1,$3,$4}' $file >$file.out; done

i need to do regular expression using '~'.

as i've many csv files, is there a way to make the above run faster other than running the script multiple times simultaneously? :wall::wall:

Maybe something like :

eval egrep -e '$regex' *.csv >/tmp/regex_csv.out

with a little more parsing could get the info as well.
This would only generate one *.out file, but the info would still be in

If i/o (hard drive access, for example) is the bottleneck, and if all the files are on the same device, then your simultaneous instances of the script will be contending for the same limited i/o resource, which will likely result in a drop in performance.

Regardless of the bottleneck, using find with exec-+ will involve much less process creation -- which can be expensive if your files are many and a substantial portion of the work if they are also small -- thereby speeding things up (perhaps substantially).

regex=abx
find . -type f -name '*.csv' -exec awk -v re="$regex" '
    FNR == 1 { close(f); f=FILENAME".out"  }
    $6 ~ re { print $1, $3, $4 > f }
' {} +

Regards,
Alister