Making use of multiple cores for running sed and awk scripts

Hi All,

After reading that the sort command in Linux can be made to use many processor cores just by using a simple script which I found on the internet, I was wondering if I can use similar techniques for programs like the awk and sed?

#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
# 
# psort largefile.txt 20m 4    
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
    let i++
    sort $fname > $fname.$suffix &
    mres=$(($i % $nthreads))
    test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix 
rm $1.part*

Previously, I used to use sort without using the above script and it used to take several minutes to sort a very large file. By default sort command only uses one core of the processor.

My school has just purchased a 16 core server with Linux and 96 GB RAM, so I am currently fiddling with it. :smiley:

Now, a thought comes to my mind: Can sed and awk be used in the same way so that they make use of all the 16 cores of the processor?

I ask this because once I tried to fiddle with a huge Wikipedia file dump which I downloaded from the internet. The XML file is 30 GB in size and contains some 3.5 million articles.

I then ran this script in order to parse the individual articles and store them in separate files:

awk '/<page>/{c++}{print > c ".dat"}' wikipedia_dump.xml

To my horror, it took about 10-12 days to complete the task. I am wondering, if it is possible to use awk in such a way that it could use all the cores of the processor and run in a multi-threaded fashion? I ran the above awk script on the same new server running Linux.

You can try GNU parallel. This command should help:

cat wikipedia_dump.xml | parallel --pipe --recstart '<page>' awk '...'

===

No... I'm afraid it will use a fresh counter on each chunk and new files would overwrite older ones... You should split somehow your file on 16 files and then do something like this

cat filelist | parallel awk '/<page>/{c++} {print > FILENAME "-" c ".dat"}

---------- Post updated at 10:56 PM ---------- Previous update was at 10:04 PM ----------

And there are almost one million articles. It would be hard to work with them as separated files. It better use some xml-streamed parser and put information to database. But this is another story, of course.

1 Like

Hi Shoaib,

Have you developed any tool to parse xml wikipedia dump?

Regards

Satheesh

Hi,

Not any tool though (as I think sed and awk are the best tools to parse the Wikipedia XML dump) and I just used a simple regular expression technique to parse and extract the Wikipedia articles from one huge file available for download. But the problem was, it took days to parse the entire dump so I thought why not parallelize the entire thing so that it could be done fast?
Though even after parsing lots of prepossessing needs to be done which I feel is easy just by using certain heuristics and then running sed or awk on those heuristics.

But if you are looking for tools parse the XML Wikipedia dump, you may look here:
Experiments on the English Wikipedia � gensim

Wikipedia Preprocessor (WikiPrep)

Hope this helps. :slight_smile:

Thank you Shoaib. Please let me know in case you succeed running your code by spawning multiple threads.

Thanks
Satheesh

useless use of cat. Yes, there are times it's useful and this ain't it. If you're pining for the fjords same order, you can do < filename command to get the same effect.

1 Like

From GNU parallel man page:

You can always send a patch them or report them about the bug.

So what? It's still a poor programming practice for a variety of reasons you're already aware of, and shouldn't be taught to others as an example. These threads are supposed to be references.