Multi thread awk command for faster performance

chetan.c · April 28, 2012, 6:36am

Hi,

I have a script below for extracting xml from a file.

for i in *.txt
do
echo $i
awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n'
echo -ne '\n' 
done

.

I read about using multi threading to speed up the script.
I do not know much about it but read it on this forum.

Is it a possible option here? Otherwise please guide me to make this script perform faster as i have XML data of aorund 100mb to extract from files.

Thanks,
Chetan.C

bartus11 · April 28, 2012, 9:06am

You can split the files that you need to process into a number of batches and run your code concurrently on each of them.

chetan.c · April 29, 2012, 4:14am

Hi Bartus,

Is it like if there are 100 files then get 50 files and call function concurrently.

Like this?

 
function first {
#to_process=$1
cd $1
for i in *.txt
do
echo $i
awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n'
echo -ne '\n' 
done
}
first /home/Folder1 &
first /home/Folder2 &
wait

Please let me know if im wrong.Also is multithreading an option here?

Thanks,
Chetan

bartus11 · April 29, 2012, 6:29am

Yes, this is what I meant.

chetan.c · April 29, 2012, 7:10am

Thanks Bartus.

Can you tell me how I can loop this for dynamic number of folders?

Like the number of folders may chnage and the function has to be called for the folders present only.

Thanks,
Chetan.C

bartus11 · April 29, 2012, 7:39am

Try:

for dir in `find /home -type d -name "Folder*"`; do
  first $dir &
done

drl · April 29, 2012, 9:50am

Hi.

Here is a sample use of GNU parallel that counts file contents with wc:

#!/usr/bin/env bash

# @(#) s1	Demonstrate multiple processes simultaneously, with "GNU parallel".

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C parallel

pl " Structure of directories:"
tree d1 d2

pl " Results of parallel processes:"
ls d1/* d2/* |
grep txt |
parallel --ungroup 'echo -n job {#}, process $$", wc = "; wc {}' |
align

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
parallel GNU parallel 20111122

-----
 Structure of directories:
d1
|-- a.txt
|-- b.txt
|-- binary-1.exe
`-- c.txt
d2
|-- frog-town.jpg
|-- x.txt
`-- y.txt

0 directories, 7 files

-----
 Results of parallel processes:
job 1, process 27495, wc =  4  16   70 d1/a.txt
job 2, process 27515, wc = 16  16  123 d1/b.txt
job 3, process 27535, wc = 26 265 1464 d1/c.txt
job 4, process 27555, wc =  4  16   70 d2/x.txt
job 5, process 27575, wc = 16  16  123 d2/y.txt

Each one of the tasks was run as a separate process. The calling sequence for parallel is complex, so some experimentation might be useful. I have not tried it, but I think parallel claims to be able to utilize different computers for tasks.

The code for the (perl) parallel script is at GNU Parallel - GNU Project - Free Software Foundation

Best wishes ... cheers, drl

jim_mcnamara · April 29, 2012, 11:20am

(sem is part of parallel) You may want to limit the number of files running at the same time to avoid eating the system when other users want resources:

for dir in `find /home -type d -name "Folder*"`; do
   sem -j 10 first $dir
done 
sem

for cpu intensive operations sem (part of parallel) can control the number of cpu cores being used. Or limit access to a resource like a semaphore, hence the name. This example limits sem to the number of available cores on the system

for dir in `find /home -type d -name "Folder*"`
do
  sem -j+0  first $fname
done
sem

sem on the last line wait for all the other sem invocations to complete.

parallel is perl, so it runs on systems with perl 5.8 or higher

http://ftp.gnu.org/gnu/parallel/

Corona688 · April 29, 2012, 12:07pm

Your program already does use multiple processes -- the tr happens simultanelusly -- but it's a bit of a waste really, since doing it in that fashion isn't faster than doing it inside awk.

awk is hardly a one-trick pony, you can run it once here to replace everything you've been doing by running awk, tr, and echo 10,000 times apiece. Since there is a large cost to running small programs over and over, this will speed up performance a lot.

Perhaps something like this:

awk -v OFS="" 'FNR==1 { printf("\n") }; /<.*/ , /.*<\/.*>/' *.txt

jim_mcnamara · April 29, 2012, 12:21pm

I think with thousands of files: globbing *.txt is a problem with ARG_MAX on lots of UNIX platforms. Correct me if I messed something. I thought that was why the OP used find to start with.

bartus11 · April 29, 2012, 12:30pm

Actually OP didn't use find. I did, to get list of directories containing files to be processed

drl · April 29, 2012, 12:33pm

Hi.

The purpose of my post was to:

1) clarify the difference between the terms thread and process,

2) demonstrate that running simultaneous processes can be easy (but may require some tinkering with options, checking times for completion of a set of tasks, etc.),

I didn't know specifically about sem -- thanks Jim. Apparently version 20111122 didn't have it (at least on the install), but I see that ... GNU sem is an alias for GNU parallel --semaphore ... Other options can set the number of jobs relative to the number of CPUs or cores -- potentially very useful.

The man page for parallel contains lots of examples as well as comparisons between parallel and other utilities of the same kind, e.g. xargs, paexec, etc.

Best wishes ... cheers, drl

UPDATE:

I finally found sem in the parallel install directory, along with niceload, etc., including man pages.

( Edit 2: add sem discovery )
( Edit 1: correct minor typo )

chetan.c · April 30, 2012, 3:06am

Thanks drl.

It helped me understand the concept.

---------- Post updated at 02:06 AM ---------- Previous update was at 01:11 AM ----------

Hi Corona,

Actually i wanted the XML to be in one single line from one file.
So i was using script that way.

Can you please let me know how to get the xml into one single line from one file using the code above?

awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n'
echo -ne '\n'

Thanks,
Chetan.C

Corona688 · April 30, 2012, 11:45am

I know what you're trying to do, and thought my script did that, but it had an error:

awk -v ORS="" 'FNR==1 { printf("\n") }; /<.*/ , /.*<\/.*>/' *.txt

Try it again, please.

It ought to do everything you were trying to do in 5 lines and 3 external programs and a parallel program, in one line with one program, faster...

chetan.c · April 30, 2012, 12:13pm

Hi Corona,
Thanks a lot.!!
Actually filename was not getting printed.
I had a similar problem for which i had found the soultion on this forum itself.
The code looks like this now.

awk -v ORS="" 'FNR==1 { printf(FILENAME"\n") }; /<.*/ , /.*<\/.*>/' *.txt

Big thanks to all you guys here.

Thanks,
Chetan.C

Corona688 · April 30, 2012, 12:22pm

Oh, I didn't realize you wanted the filename either, sorry. :o Your original didn't have that. This is why I prefer people tell me what they actually want, rather than "how do I make this piece of code run faster" -- I'm liable to make bad guesses about their requirements.

Glad you got it working!

drl · April 30, 2012, 12:39pm

Hi.

+1 ... cheers, drl

chetan.c · May 1, 2012, 2:41am

Thanks Corona.

Yes will make sure i will post it right next time.

chetan.c · May 3, 2012, 7:04am

Hi,

The script which im running is executed simultaneously for multiple folders.
So is there a chance of overlapping of data beacause of this multiple process?

Im seeing overlapping of data here.
am i doing something wrong?

Thanks,
Chetan.C

---------- Post updated at 06:04 AM ---------- Previous update was at 04:56 AM ----------

This is the code

 
#!/bin/bash 

function fast {
cd $1
awk -v ORS="" 'FNR==1 { printf("\n"FILENAME"\n") }; /<.*/ , /.*<\/.*>/' *.txt
}
for dir in `find /opt/app/idss/data01/cc002h/computes/test_scripts/Test_files/ -type d -name "TLT*"`; do
fast $dir &
done
wait

Corona688 · May 3, 2012, 11:13am

It's overlapping because they're running literally at the same time -- i.e, what you asked for. Save their output to separate files, combine them later. That will possibly negate any benefit of paralleling them, though, since you'll be doing two to three times as much disk access for the same amount of work!

I continue to not believe there's incredibly large benefits to parallelizing this. Having 9 programs instead of 1 won't let the 9 programs read from your disk 9 times faster. Measure what throughput you have already, first.