After reading that the sort command in Linux can be made to use many processor cores just by using a simple script which I found on the internet, I was wondering if I can use similar techniques for programs like the awk and sed?
#!/bin/bash
# Usage: psort filename <chunksize> <threads>
# In this example a the file largefile is split into chunks of 20 MB.
# The part are sorted in 4 simultaneous threads before getting merged.
#
# psort largefile.txt 20m 4
#
# by h.p.
split -b $2 $1 $1.part
suffix=sorttemp.`date +%s`
nthreads=$3
i=0
for fname in `ls *$1.part*`
do
let i++
sort $fname > $fname.$suffix &
mres=$(($i % $nthreads))
test "$mres" -eq 0 && wait
done
wait
sort -m *.$suffix
rm $1.part*
Previously, I used to use sort without using the above script and it used to take several minutes to sort a very large file. By default sort command only uses one core of the processor.
My school has just purchased a 16 core server with Linux and 96 GB RAM, so I am currently fiddling with it.
Now, a thought comes to my mind: Can sed and awk be used in the same way so that they make use of all the 16 cores of the processor?
I ask this because once I tried to fiddle with a huge Wikipedia file dump which I downloaded from the internet. The XML file is 30 GB in size and contains some 3.5 million articles.
I then ran this script in order to parse the individual articles and store them in separate files:
awk '/<page>/{c++}{print > c ".dat"}' wikipedia_dump.xml
To my horror, it took about 10-12 days to complete the task. I am wondering, if it is possible to use awk in such a way that it could use all the cores of the processor and run in a multi-threaded fashion? I ran the above awk script on the same new server running Linux.
No... I'm afraid it will use a fresh counter on each chunk and new files would overwrite older ones... You should split somehow your file on 16 files and then do something like this
---------- Post updated at 10:56 PM ---------- Previous update was at 10:04 PM ----------
And there are almost one million articles. It would be hard to work with them as separated files. It better use some xml-streamed parser and put information to database. But this is another story, of course.
Not any tool though (as I think sed and awk are the best tools to parse the Wikipedia XML dump) and I just used a simple regular expression technique to parse and extract the Wikipedia articles from one huge file available for download. But the problem was, it took days to parse the entire dump so I thought why not parallelize the entire thing so that it could be done fast?
Though even after parsing lots of prepossessing needs to be done which I feel is easy just by using certain heuristics and then running sed or awk on those heuristics.
useless use of cat. Yes, there are times it's useful and this ain't it. If you're pining for the fjords same order, you can do < filename command to get the same effect.
So what? It's still a poor programming practice for a variety of reasons you're already aware of, and shouldn't be taught to others as an example. These threads are supposed to be references.