piping problem with xargs

I'm trying to pipe the output from a command into another using xargs but is not getting what I want. Running this commands:

find . -name '33_cr*.rod' | xargs -n1 -t -i cut -f5 {} | sort -k1.3n | uniq | wc -l
give the following output:
cut -f5 ./33_cr22.rod
cut -f5 ./33_cr22.rod
...
9224236

What I want to see is the unique count number for each file instead of the the total count.

Any help is appreciated. Thank you.

xargs format is wrong .

find . -name '33_cr*.rod' | xargs -n1 -t -i cut -f5
there should be no "{}" it is used only for exec with find command.

Refer to the address:Linux,C,C++,Shell,perl: xargs vs exec in find command

it may give you a better idea.

:slight_smile:

Try the following

$ find . -name '33_cr*.rod' | xargs -l  cut -f5  | sort -k1.3n | uniq | wc -l

The OP seems to want per-file counts. Beginning with the xargs stdout, that pipeline obliterates file boundaries and generates an aggregate result.

Regards,
Alister

Consider using GNU Parallel instead:

 find . -name '33_cr*.rod' | parallel "echo -n {}': '; cut -f5 {} | sort -k1.3n | uniq | wc -l"

Watch the intro video to learn more: youtube. com/watch?v=OpaiGYxkSuQ

I solved the problem with this:

for f in *.rod; do echo $f; cut -f5 $f | sort -k1.3n | uniq | wc -l; done

---------- Post updated at 02:36 PM ---------- Previous update was at 02:17 PM ----------

Hi Tange,

I find the GNU Parallel interesting. I'm not exactly a computing person, can you clarify the following for me:

  1. what is the difference between GNU parallel and multi-threaded process? For example, if my program allows multi-threading and I run 4 threads in my quart-core computer, am I right to assume that the GNU parallel is not going to make any difference?
  2. If I already running some other process and still specify -j+0, what is going to happen?

Thank you.

GNU Parallel makes it often easy to generate command lines:

parallel "echo {}; cut -f5 {} | sort -k1.3n | uniq | wc -l" ::: *.rod

It will also run the jobs in parallel. If you jobs are multithreaded and you see 100% utilization of your cores all the time then you will not see a speed up using GNU Parallel.

If, however, your jobs have I/O and therefore sometimes wait for data, then it might be faster to have more processes running, that can utilize the CPU when other processes are waiting.

GNU Parallel can be made to look at the load average and not start more jobs if the load average is above a certain limit. Default (-j+0) is to start 1 job per CPU core regardless of load average.

If you want multiple instances of GNU Parallel to communicate, and not have more than N jobs in total - no matter how many times you start parallel, you should look at 'sem': http://www.gnu.org/software/parallel/sem.html

1 Like

Hi Tange,

Thanks for the clarification. I've been "playing" around with it the last few days and just glad to see the improved speed.