Bash script search, improve performance with large files

SDohmen · March 28, 2019, 6:37am

Hello,

For several of our scripts we are using awk to search patterns in files with data from other files. This works almost perfectly except that it takes ages to run on larger files. I am wondering if there is a way to speed up this process or have something else that is quicker with the searching.

The part that i use is as follows:

awk -F";" '
NR==FNR         {id[$0]
                 next
                }
                {for (SP in id) if (tolower($0) ~ SP)    {print > "'"$PAD/removed_woord.csv"'"
                                                 next
                                                }
                }
                {print > "'"$PAD/filtered_winnaar_2.csv"'"
                }
' $PAD/prijslijst_filter.csv $PAD/lowercase_winnaar.csv

I got this piece of programming also from this forum but i added the tolower part myself since not always it seem to get all results from the main file. 1 important part is that the results from filtering need to be saved in another file. The filtered file only contains the not found lines of course.

RudiC · March 28, 2019, 7:05am

Your above awk script is so minimalistic that it's hard to dream up a dramatic improvement.
Did you try

grep -f  $PAD/prijslijst_filter.csv $PAD/lowercase_winnaar.csv

for a performance comparison?

RudiC · March 28, 2019, 7:42am

You might want to build an "alternation regex", with not too many keywords, and modify the matching slightly. Compare performance of

awk '
NR==FNR                 {SRCH=SRCH DL $0
                         DL = "|"
                         next
                        }
tolower($0) ~ SRCH      {print > "'"$PAD/removed_woord.csv"'"
                         next
                        }

                        {print > "'"$PAD/filtered_winnaar_2.csv"'"
                        }
' file3 file4 

real    0m2,328s
user    0m2,318s
sys    0m0,005s

to this

time awk '
NR==FNR         {id[$0]
                 next
                }
                {for (SP in id) if (tolower($0) ~ SP)    {print > "'"$PAD/removed_woord.csv"'"
                                                 next
                                                }
                }
                {print > "'"$PAD/filtered_winnaar_2.csv"'"
                }
' file3 file4
real    0m17,038s
user    0m16,995s
sys    0m0,025s

seems to make a factor of roughly 7. The output seems to be identical. Please try and report back.

SDohmen · March 28, 2019, 7:44am

I know it is a really short script and if i am not mistaken you even wrote it ;).

I just timed both of them and the results are as follows. It looks like there is alot of improvement using the grep line except that i dont know if it removes the filtered lines from the file like the awk solution.

awk -F";"  prijslijst_filter.csv lowercase_winnaar.csv  260,73s user 0,50s system 99% cpu 4:21,84 total

grep --color=auto -f prijslijst_filter.csv lowercase_winnaar.csv  45,13s user 0,52s system 99% cpu 45,679 total

(sorry for the code part)

I just tested your last bit and it makes a huge difference.

awk  prijslijst_filter.csv lowercase_winnaar.csv  9,51s user 0,13s system 99% cpu 9,647 total

I will have to check the files but this would help enormous with all scripts.

joker · March 28, 2019, 7:49am

Just one additional notice:

grep -F (grep for fixed strings, i. e. no patterns) is a lot faster than regular grep.
so you may try:

grep -F -f prijslijst_filter.csv lowercase_winnaar.csv

RudiC · March 28, 2019, 7:51am

You'd need to run two grep s, one for positive matches, one (with the -v option) for non-matches.

SDohmen · March 28, 2019, 7:56am

joker:

Just one additional notice:

grep -F (grep for fixed strings, i. e. no patterns) is a lot faster than regular grep.
so you may try:
grep -F -f prijslijst_filter.csv lowercase_winnaar.csv

I just tested this and it is even faster

grep --color=auto -F -f prijslijst_filter.csv lowercase_winnaar.csv  0,19s user 0,11s system 50% cpu 0,594 total

I believe you mean like this:

grep -v -F -f prijslijst_filter.csv lowercase_winnaar.csv > unfiltered_stuff.csv

and

grep -F -f prijslijst_filter.csv lowercase_winnaar.csv > filtered_stuff.csv

1 last question about this though. How well does this behave with capitals and such? The first awk script did not like capitals so i had to lowercase everything. It would be best if it just ignores casing completly.

joker · March 28, 2019, 8:49am

grep -F -i ignores case.

RudiC · March 28, 2019, 9:26am

Would you mind to also time the proposal in post #3?

SDohmen · March 28, 2019, 11:21am

I actually did but i edited in the post after.

awk  prijslijst_filter.csv lowercase_winnaar.csv  9,51s user 0,13s system 99% cpu 9,647 total

Since the difference between the grep and this newer awk is only mere seconds i am not sure which i am going to use. The awk one is prefered as it is a drop in solution for the current one but the grep one is still quite alot faster.

Grep has also the advantage that it responds better with the ignore case part. I never seem to get this properly working on the awk one even with the forced lowercase on both files.

I just tried your awk solution again RudiC and it seems something is wrong with it . I did not check the first time because i had to leave right after i tested it (the files got overwritten after).

It seems the part you gave does not give any files to continue the rest of the script.

awk '
NR==FNR                 {SRCH=SRCH DL $0
                         DL = "|"
                         next
                        }
tolower($0) ~ SRCH      {print > "'"$PAD/removed_woord_blaat33.csv"'"
                         next
                        }

                        {print > "'"$PAD/filtered_winnaar_blaat33.csv"'"
                        }
' prijslijst_filter.csv lowercase_winnaar.csv

I tried with and without time to see if that caused the issue but it did not change the outcome. Both new files arent created.

Peasant · March 28, 2019, 12:40pm

When processing extremely large files you might consider using split first.
Then in multicore environments spawn several awks or greps to process it in parallel from shell script.
There are also GNU tools which offer parallelism without shell logic.

Should be a bit tougher to program, but processing time will be reduced significantly if you have cores and disks are fast to service.

Memory also comes in play, since split will read the files, and operating system will cache those files in memory, if the same is available.
Making those awks or greps processes much faster on read operations.

Of course, limit being free memory on the system and configuration of the file system caching in general.
In default configurations file system caching will be able to use a large portion free memory on most linux / unix systems i've seen.

Hope that helps
Regards
Peasant.

SDohmen · March 29, 2019, 4:53am

peasant:

When processing extremely large files you might consider using split first.
Then in multicore environments spawn several awks or greps to process it in parallel from shell script.
There are also GNU tools which offer parallelism without shell logic.

Should be a bit tougher to program, but processing time will be reduced significantly if you have cores and disks are fast to service.

Memory also comes in play, since split will read the files, and operating system will cache those files in memory, if the same is available.
Making those awks or greps processes much faster on read operations.

Of course, limit being free memory on the system and configuration of the file system caching in general.
In default configurations file system caching will be able to use a large portion free memory on most linux / unix systems i've seen.

Hope that helps
Regards
Peasant.

This sounds very interesting but there are 2 issues.

I have to split the files in smaller files (around 5k i guess) which isn't a big deal but a little bit annoying.
Since this is running in a script i have no idea how to call multiple instances of awk at the same time. Everything i know says that it handles each part of the script after each other and not at the same time. If you have an idea how to accomplish that please let me know since it does sound interesting/promising.

CPU and MEM arent the issue as they are sufficient. The only thing that can stall the script are the other scripts that are running also. I tried spreading them out as much as possible but some just take quite long to run and thats why i want to slim them down so they dont run together.

SDohmen · April 2, 2019, 8:11am

rudic:

You might want to build an "alternation regex", with not too many keywords, and modify the matching slightly. Compare performance of
awk '
NR==FNR                 {SRCH=SRCH DL $0
   DL = "|"
   next
   }
tolower($0) ~ SRCH      {print > "'"$PAD/removed_woord.csv"'"
   next
   }

   {print > "'"$PAD/filtered_winnaar_2.csv"'"
   }
' file3 file4 

real    0m2,328s
user    0m2,318s
sys    0m0,005s
to this
time awk '
NR==FNR         {id[$0]
   next
   }
   {for (SP in id) if (tolower($0) ~ SP)    {print > "'"$PAD/removed_woord.csv"'"
   next
   }
   }
   {print > "'"$PAD/filtered_winnaar_2.csv"'"
   }
' file3 file4
real    0m17,038s
user    0m16,995s
sys    0m0,025s
seems to make a factor of roughly 7. The output seems to be identical. Please try and report back.

I just did this one again and i got it working. I noticed the -F";" was missing so i added that and it worked flawlessly. The complete script runs in about 20 sec now which was more then 7 min first.

RudiC · April 2, 2019, 9:21am

Congrats, that would be a factor ~21 of performance gain!

I'd be surprised if the script needs the -F";" as it doesn't handle single fields but just the entire line, $0

SDohmen · April 2, 2019, 10:53am

I have no idea how this is possible. When i tried without, it did not work.

Anyhow its a mayor improvement over the old version.

SDohmen · April 19, 2019, 11:42am

I am not sure what you mean with a large file but i am running this for several weeks now with files around 150k ish. The complete script with 2x this one in it (1 to remove products that are in the filter and 1 to readd when it gets filtered (i use parts of names to filter)) and several mysql queries takes less then a min to complete that that is about max 30 sec for this 2 combined as 1 of the sql queries takes 30sec to run.

All in all i would not say that it is slow but super fast for my needs.