Is it like if there are 100 files then get 50 files and call function concurrently.
Like this?
function first {
#to_process=$1
cd $1
for i in *.txt
do
echo $i
awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n'
echo -ne '\n'
done
}
first /home/Folder1 &
first /home/Folder2 &
wait
Please let me know if im wrong.Also is multithreading an option here?
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0.8 (lenny)
bash GNU bash 3.2.39
parallel GNU parallel 20111122
-----
Structure of directories:
d1
|-- a.txt
|-- b.txt
|-- binary-1.exe
`-- c.txt
d2
|-- frog-town.jpg
|-- x.txt
`-- y.txt
0 directories, 7 files
-----
Results of parallel processes:
job 1, process 27495, wc = 4 16 70 d1/a.txt
job 2, process 27515, wc = 16 16 123 d1/b.txt
job 3, process 27535, wc = 26 265 1464 d1/c.txt
job 4, process 27555, wc = 4 16 70 d2/x.txt
job 5, process 27575, wc = 16 16 123 d2/y.txt
Each one of the tasks was run as a separate process. The calling sequence for parallel is complex, so some experimentation might be useful. I have not tried it, but I think parallel claims to be able to utilize different computers for tasks.
(sem is part of parallel) You may want to limit the number of files running at the same time to avoid eating the system when other users want resources:
for dir in `find /home -type d -name "Folder*"`; do
sem -j 10 first $dir
done
sem
for cpu intensive operations sem (part of parallel) can control the number of cpu cores being used. Or limit access to a resource like a semaphore, hence the name. This example limits sem to the number of available cores on the system
for dir in `find /home -type d -name "Folder*"`
do
sem -j+0 first $fname
done
sem
sem on the last line wait for all the other sem invocations to complete.
parallel is perl, so it runs on systems with perl 5.8 or higher
Your program already does use multiple processes -- the tr happens simultanelusly -- but it's a bit of a waste really, since doing it in that fashion isn't faster than doing it inside awk.
awk is hardly a one-trick pony, you can run it once here to replace everything you've been doing by running awk, tr, and echo 10,000 times apiece. Since there is a large cost to running small programs over and over, this will speed up performance a lot.
I think with thousands of files: globbing *.txt is a problem with ARG_MAX on lots of UNIX platforms. Correct me if I messed something. I thought that was why the OP used find to start with.
1) clarify the difference between the terms thread and process,
2) demonstrate that running simultaneous processes can be easy (but may require some tinkering with options, checking times for completion of a set of tasks, etc.),
I didn't know specifically about sem -- thanks Jim. Apparently version 20111122 didn't have it (at least on the install), but I see that ... GNU sem is an alias for GNU parallel --semaphore ... Other options can set the number of jobs relative to the number of CPUs or cores -- potentially very useful.
The man page for parallel contains lots of examples as well as comparisons between parallel and other utilities of the same kind, e.g. xargs, paexec, etc.
Best wishes ... cheers, drl
UPDATE:
I finally found sem in the parallel install directory, along with niceload, etc., including man pages.
( Edit 2: add sem discovery )
( Edit 1: correct minor typo )
Hi Corona,
Thanks a lot.!!
Actually filename was not getting printed.
I had a similar problem for which i had found the soultion on this forum itself.
The code looks like this now.
Oh, I didn't realize you wanted the filename either, sorry. :o Your original didn't have that. This is why I prefer people tell me what they actually want, rather than "how do I make this piece of code run faster" -- I'm liable to make bad guesses about their requirements.
The script which im running is executed simultaneously for multiple folders.
So is there a chance of overlapping of data beacause of this multiple process?
Im seeing overlapping of data here.
am i doing something wrong?
Thanks,
Chetan.C
---------- Post updated at 06:04 AM ---------- Previous update was at 04:56 AM ----------
This is the code
#!/bin/bash
function fast {
cd $1
awk -v ORS="" 'FNR==1 { printf("\n"FILENAME"\n") }; /<.*/ , /.*<\/.*>/' *.txt
}
for dir in `find /opt/app/idss/data01/cc002h/computes/test_scripts/Test_files/ -type d -name "TLT*"`; do
fast $dir &
done
wait
It's overlapping because they're running literally at the same time -- i.e, what you asked for. Save their output to separate files, combine them later. That will possibly negate any benefit of paralleling them, though, since you'll be doing two to three times as much disk access for the same amount of work!
I continue to not believe there's incredibly large benefits to parallelizing this. Having 9 programs instead of 1 won't let the 9 programs read from your disk 9 times faster. Measure what throughput you have already, first.