Get first column value uniq

k_manimuthu · December 22, 2016, 12:28am

Hi All,

I have a directory and sub-directory that having �n' number of .log file in nearly 1GB.
The file is comma separated file. I need to recursively grep and uniq first column values only.
I did in perl. But i wish to know more command line utilities to calculate the time for grep and uniq.

sample contents of *.log file

value1,100,99,98
value1,99,97,98
value2,50,51,52
value3,10,11,12
value2,60,61,62
value3,70,71

Expected output

value1
value2
value3

Scrutinizer · December 22, 2016, 12:51am

Hi, try:

find . -name '*.log' -type f -exec cat {} + | awk -F, '!A[$1]++{print $1}'

rbatte1 · December 22, 2016, 5:41am

Could this be more efficient?

find . -name '*.log' -type f -exec awk -F, '!A[$1]++{print $1}' {} +

.... or if multiple input files to awk would confuse things, try:-

find . -name '*.log' -type f -exec awk -F, '!A[$1]++{print $1}' {} \;

An alternate (which may be horribly slow, I don't know) could be:-

cut -f1 -d, *.log | sort -u

.... although this will fail for excessive number of input files because the command grows too long. I suppose you could also wrap it in a find like this:-

find . -type f -name "*.log" -exec cut -f1 -d, {} + | sort -u

It will be one of those that you have to try variations to see which one works best for your data.

Robin

Scrutinizer · December 22, 2016, 9:12am

rbatte1:

Could this be more efficient?
find . -name '*.log' -type f -exec awk -F, '!A[$1]++{print $1}' {} +
.... or if multiple input files to awk would confuse things, try:-
find . -name '*.log' -type f -exec awk -F, '!A[$1]++{print $1}' {} \;
[..]

Hi Robin, it depends, how the OP's question should be interpreted. I interpreted it to be the unique values among all of the files in the directory and in its subdirectories. Then my solution would be most efficient and it will provide the right answer.

If the idea is to list the unique values per file then your second option should be used, although I think for that to be of use the filename should be printed as well.

Your first option cannot be used in either case, it might happen to provide the right answer if the total number is such that the awk is only called once for all those files. If it is called multiple times then the answer will be incorrect.

Aia · December 22, 2016, 5:37pm

I wonder

grep -rhoE '^\w+' *.log | sort -u