piping problem with xargs

ivpz · August 5, 2011, 2:46pm

I'm trying to pipe the output from a command into another using xargs but is not getting what I want. Running this commands:

find . -name '33_cr*.rod' | xargs -n1 -t -i cut -f5 {} | sort -k1.3n | uniq | wc -l
give the following output:
cut -f5 ./33_cr22.rod
cut -f5 ./33_cr22.rod
...
9224236

What I want to see is the unique count number for each file instead of the the total count.

Any help is appreciated. Thank you.

karthik3152 · August 5, 2011, 3:47pm

xargs format is wrong .

find . -name '33_cr*.rod' | xargs -n1 -t -i cut -f5
there should be no "{}" it is used only for exec with find command.

Refer to the address:Linux,C,C++,Shell,perl: xargs vs exec in find command

it may give you a better idea.

h112 · August 5, 2011, 3:49pm

Try the following

$ find . -name '33_cr*.rod' | xargs -l  cut -f5  | sort -k1.3n | uniq | wc -l

alister · August 5, 2011, 4:24pm

The OP seems to want per-file counts. Beginning with the xargs stdout, that pipeline obliterates file boundaries and generates an aggregate result.

Regards,
Alister

tange · August 6, 2011, 1:54pm

Consider using GNU Parallel instead:

 find . -name '33_cr*.rod' | parallel "echo -n {}': '; cut -f5 {} | sort -k1.3n | uniq | wc -l"

Watch the intro video to learn more: youtube. com/watch?v=OpaiGYxkSuQ

ivpz · August 6, 2011, 3:36pm

I solved the problem with this:

for f in *.rod; do echo $f; cut -f5 $f | sort -k1.3n | uniq | wc -l; done

---------- Post updated at 02:36 PM ---------- Previous update was at 02:17 PM ----------

tange:

Consider using GNU Parallel instead:
 find . -name '33_cr*.rod' | parallel "echo -n {}': '; cut -f5 {} | sort -k1.3n | uniq | wc -l"
Watch the intro video to learn more: youtube. com/watch?v=OpaiGYxkSuQ

Hi Tange,

I find the GNU Parallel interesting. I'm not exactly a computing person, can you clarify the following for me:

what is the difference between GNU parallel and multi-threaded process? For example, if my program allows multi-threading and I run 4 threads in my quart-core computer, am I right to assume that the GNU parallel is not going to make any difference?
If I already running some other process and still specify -j+0, what is going to happen?

Thank you.

tange · August 9, 2011, 11:05am

GNU Parallel makes it often easy to generate command lines:

parallel "echo {}; cut -f5 {} | sort -k1.3n | uniq | wc -l" ::: *.rod

It will also run the jobs in parallel. If you jobs are multithreaded and you see 100% utilization of your cores all the time then you will not see a speed up using GNU Parallel.

If, however, your jobs have I/O and therefore sometimes wait for data, then it might be faster to have more processes running, that can utilize the CPU when other processes are waiting.

GNU Parallel can be made to look at the load average and not start more jobs if the load average is above a certain limit. Default (-j+0) is to start 1 job per CPU core regardless of load average.

If you want multiple instances of GNU Parallel to communicate, and not have more than N jobs in total - no matter how many times you start parallel, you should look at 'sem': http://www.gnu.org/software/parallel/sem.html

ivpz · August 9, 2011, 11:30am

Hi Tange,

Thanks for the clarification. I've been "playing" around with it the last few days and just glad to see the improved speed.