How to make awk command faster?

Peu_Mukherjee · September 5, 2017, 4:52am

I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster.

 awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq> ${NLAP_TEMP}/hist2.final

The hist1.out file looks as below

 rp01_2017,1002302_43,1,103,0074,0,0,0,0,0,0,18,9994
rp01_2017,1002302_43,1,103,0077,0,0,0,0,0,0,18,9999
rp01_2018,1002302_43,1,103,0074,0,0,0,0,0,0,9,9994
rp01_2018,1002302_43,1,103,0077,0,0,0,0,0,0,9,9999
rp10_2017,1002302_43,1,103,0074,0,0,0,0,0,0,16,9994
rp10_2017,1002302_43,1,103,0077,0,0,0,0,0,0,16,9999
rp10_2018,1002302_43,1,103,0074,0,0,0,0,0,0,4,9994
rp10_2018,1002302_43,1,103,0077,0,0,0,0,0,0,4,9999
rp18_2017,1002302_43,1,103,0074,0,0,0,0,0,0,14,9994
rp18_2017,1002302_43,1,103,0077,0,0,0,0,0,0,14,9999

---------- Post updated at 03:52 AM ---------- Previous update was at 03:16 AM ----------

 awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq> ${NLAP_TEMP}/hist2.final

apmcd47 · September 5, 2017, 5:07am

Would using sed be any quicker?

sed -n '/,9999$/ s///p' ${NLAP_TEMP}/hist1.out|sort -u -T ${NLAP_TEMP}> ${NLAP_TEMP}/hist2.final

Basically if it matches the last field it deletes that field and then prints the modified line. Also by using sort -u rather than sort | uniq you reduce the number of processes in the pipeline by one.

Andrew

MadeInGermany · September 5, 2017, 9:49am

A BEGIN section is only executed once (at the beginning).
And maybe sort can be skipped by just eliminating duplicates?

awk 'BEGIN { FS=OFS="," } ($13 == "9999" && !($0 in s)) { s[$0]; print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out > ${NLAP_TEMP}/hist2.final

The following variant works like the previous sed solution

awk '(sub(/,9999$/,"") && !($0 in s)) { s[$0]; print }' ${NLAP_TEMP}/hist1.out > ${NLAP_TEMP}/hist2.final

RudiC · September 6, 2017, 1:41am

Hoping that sort has advanced algorithms:

sort -t, -k13r file | awk -F, 'sub (/,9999$/, _) {print; next} {exit}'
rp01_2017,1002302_43,1,103,0077,0,0,0,0,0,0,18
rp01_2018,1002302_43,1,103,0077,0,0,0,0,0,0,9
rp10_2017,1002302_43,1,103,0077,0,0,0,0,0,0,16
rp10_2018,1002302_43,1,103,0077,0,0,0,0,0,0,4
rp18_2017,1002302_43,1,103,0077,0,0,0,0,0,0,14

Peu_Mukherjee · September 7, 2017, 7:39am

Thank you all for your response

The below command has really helped in reducing the time.Would this also sort the rows or do we need to use sort after this?

awk 'BEGIN { FS=OFS="," } ($13 == "9999" && !($0 in s)) { s[$0]; print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out > ${NLAP_TEMP}/hist2.final

RudiC · September 7, 2017, 10:41am

That script would not sort the rows as it was not asked to do.

It would be interesting to see a comparison between the different approaches. Could you time each and post the results?

Corona688 · September 7, 2017, 12:12pm

It reduced time by not sorting it. That's liable to be what took the lion's share of the time.

If you need it sorted, and need it sorted faster, point sort to a different disk for temporary space with -T /path/to/folder. Using a different disk for temp space will increase the speed your data can be read.

GNU sort also has a --parallel option, but this is not much help unless you have extraordinarily fast disks.

Peu_Mukherjee · September 8, 2017, 2:17am

Sure, I will post the time for each of the commands.

Just wanted to check , if the below sort be faster , if I give the temp folder path or should I change the path to some other folder?

sort -T ${NLAP_TEMP} -u ${NLAP_TEMP}/hist1.out > ${NLAP_TEMP}/hist2.final; VerifyExit

bakunin · September 8, 2017, 3:06am

See, every single disk can do only one thing at a time: reading a byte somewhere means it can't read (or write) a byte somewhere else at that time.

Temporary files are (at least) written once and (at least) read once, your input file is (at least) read once and your output file is written once. For all these tasks you want to involve different disks, so that, while one file is being read or written, another might also be read or written at the same time.

This should answer your question: you want (ideally) for all three involved files separate disks. Perhaps the fastest disk should be assigned to the temporary file because it is probably read and written the most often.

I hope this helps.

bakunin

Peu_Mukherjee · September 8, 2017, 6:30am

I could see NLAP_TEMP is fastest directory , I added the same in sort command , but it seems not working, the awk command just takes 7 minutes , the issue is only with sort, which is taking long time .

sort -T ${NLAP_TEMP} -u ${NLAP_TEMP}/aplymeas5d.dyn.out.tmp1 > ${NLAP_HOME}/backup/aplymeas5d.dyn.final1

---------- Post updated at 05:30 AM ---------- Previous update was at 03:29 AM ----------

Please let me know how can I make sort faster, the file size is 4 GB and the sorting is taking 3 hours. we have only one disk in TEMP folder and 50GB space.

RudiC · September 8, 2017, 6:42am

You received several hints in this thread on how to accelerate the sort process. What are the results of either? Did you consult man sort for additional options?

rbatte1 · September 8, 2017, 7:19am

Did you check that the directories are on different physical disks? By that, you need to check that they are separate filesystems and where those filesystems are built from, not just that the directories are different. What you may think of as a single update to a file will cause multiple updates on the disk. There is at least:-

the actual disk block for the data
the file's inode update with the last modified time
the directory (for a new file or rename) and it's inode
the filesystem superblock (usually plural) when you get a new disk block from the free list by creating or extending the file

You also have to consider contention from other processing and if this is using NFS mounted filesystems, then you have the overhead of network traffic to bring into it.

I don't know how you have your disks provisioned. Can you explain it? If it is SAN, then that might be more difficult to speed up and depends on the disk at the back-end, the fibre capacity etc. At the other extreme, a PC with a single disk is just going to have contention even if you have a large disk cache.

Overall, if you have lots of data it is just going to take a while. I doubt I will be able to better the suggestions from my fellow learned members. How big is your input file anyway? (in bytes and records) If you try to do too much processing in one chunk, then you may also exhaust memory and cause your server to page/swap. Keeping this to discreet steps may alleviate that bottleneck but may cost more in disk IO. It is difficult to tell.

If you don't mind the 13th field still being there (given that they are all to be 9999 ) you might be able to save a little by stripping it right back and doing this:-

grep -E ",9999$" hist1.out | sort -uT ${NLAP_TEMP} > hist2.final

That -u flag on the sort saves the process and therefore all the memory (risk of paging/swapping) and passing the data between them, so that might help.

I hope that this is useful, but there will always be a limit we will hit.

Robin

Peu_Mukherjee · September 12, 2017, 5:03am

I tried all the options, but sort is not returning faster.

Since we have one CPU, should we go for splitting the file, then sorting individual files and then merging into a single file?

Corona688 · September 12, 2017, 11:22am

What about all the questions people asked you?

What physical disks are your various folders on?

If you don't know, trying random folders is unlikely to help.