uniq -c in the pipeline

fletch00 · May 19, 2012, 8:48pm

Hello gurus - I must be missing something, or there is a better way - pls enlighten me

I'm on a Solaris 10 vm running the following pipeline to reduce some apache logs (actually lynx dumps of /server-status/ when threads are above a threshold) to a set of offending DDoS IP addresses.

awk '{print $NF}' * | egrep '\.' | egrep -i -v HTTP | egrep -v '/' | uniq -c | sort -nr | more

From the uniq manpage -

     -c              Precedes each output line with  a  count  of
                     the number of times the line occurred in the
                     input.

So why am I getting duplicate lines in the uniq output?

  19 127.0.0.1
  14 127.0.0.1
  11 127.0.0.1
   7 120.168.0.166
   6 127.0.0.1
   5 65.52.109.73
   3 65.52.109.73
   3 65.52.109.73
   3 65.52.109.73
   3 65.52.109.73
   3 65.52.109.73
   3 65.52.109.73
   3 65.52.109.73
   3 65.52.109.73
   3 65.52.109.73

thanks

fletch00 · May 19, 2012, 9:12pm

I'm attaching the ipaddress list file to reduce the problem
FWIW, the solaris vm and my mac os show the same output

Why is uniq not unique? Is it due to invisible differences in the input ?

uniq -c ips.txt | sort -nr | more

  19 127.0.0.1
  14 127.0.0.1
  11 127.0.0.1  <----all 127.0.0.1 lines should be counted into ONE uniq line!?
   7 120.168.0.166
   6 127.0.0.1
   5 207.46.13.206
   5 207.46.13.206
   5 207.46.13.206
   5 207.46.13.206
   5 207.46.13.206
   5 207.46.13.206
   5 207.46.13.206
   5 207.46.13.206
   4 157.55.17.199
   4 157.55.17.199
   4 157.55.17.199
   4 157.55.17.199
   4 157.55.17.199
   4 157.55.17.199
   4 157.55.17.199

mirni · May 19, 2012, 10:09pm

It's because you should be sorting before uniq, not after.
From man uniq:

fletch00 · May 19, 2012, 10:39pm

Thanks - I'd still like an explanation why uniq -c needs a pre-req sort, but I am happy this new sort, uniq, sort pipe line delivers the results.

thanks!

PS: MSN bingbot does not seem to play by our robots.txt rules!

sort -nr ips.txt | uniq -c | sort -nr | more
 679 207.46.13.206
 658 207.46.13.48
 631 207.46.13.147
 516 157.55.17.199
 215 171.65.64.189
 153 172.25.65.220
 143 199.21.99.124
 139 127.0.0.1
 135 171.65.64.28
  80 107.22.107.114
  71 123.126.68.19
  47 219.137.183.138
  47 193.47.80.36
  33 207.46.92.19
  28 220.181.108.90
  28 172.25.104.106
  28 121.166.124.193
  26 218.65.102.178
  25 180.76.5.195
  23 180.76.5.154
  23 122.174.113.6
  22 210.0.229.224
  21 180.76.5.143
  19 180.76.5.94

mirni · May 19, 2012, 10:51pm

uniq remembers only 1 line -- the last one seen, therefore a file

one
two
two
one

will be

one
two
one

after running through uniq. It forgot that it saw 'one' already.
Before anyone starts complaining about how dumb uniq is, let me assure you that this is a feature -- it enables you to run the filter on a huge file that wouldn't fit in memory. Since sorting can be done fast (~ O(nlog(n))), it is usually run in a chain sort | uniq. If sorting messes you up and you absolutely have to preserve order, you could use awk to count the occurances, e.g.

awk '{print ++cnt[$0],$0}'

fletch00 · May 19, 2012, 10:57pm

The man page and usage is therefore misleading and incorrect.

How do we correct this for the users who expect documentation to reflect implementation?

(and yes I very much appreciate the big O reference - that's why I prefer egrep)

thanks

mirni · May 19, 2012, 11:15pm

I don't know what does your man pages say, but uniq from GNU coreutils says:

So it is indeed correct.

fletch00 · May 19, 2012, 11:41pm

Mine is when the intended outcome or meaning juxtaposes significantly enough from the actual outcome or meaning for me to seek out the unix.com forums, become a member, post a problem, read a WORKAROUND solution, then be told the WORKAROUND is intended somehow as a subtle misinterpretation of the man page (missed that the lines must be ADJACENT) for uniq to do its -c switch as documented in the manpage:

 -c      Precede each output line with the count of the number of times the line occurred in the input, fol-
         lowed by a single space.

Imagine if I implemented "sort" and said "applies only to letters I R O N and Y" (but buried that subtly with one word in a man page)

At least the uniq man page should clarify this with a note and a switch to presort (on the penalty of performance) to deliver the EXPECTED -c results?

I would like to see the uniq source code - is there a reference?

thanks

mirni · May 19, 2012, 11:55pm

You are making a really big issue out of this trivial matter and trying to blame the tool, instead of making it a learning experience.

This has nothing to do with the -c switch. -c just adds a number. This is a default behavior of uniq -- it filters only adjacent (consecutive) lines.

What are you trying to say with this comment? The fact that it operates on consecutive lines makes it more general and useful, not less.
So how would you write uniq, if you took the effort? How would you deal with the repeated lines? Would you rather slurp the whole file into memory and make this completely useless for large files? Or do you have a better solution? I'd be very interested to hear it.

But it does! Didn't you read my post? :

Note:  'uniq'  does  not detect repeated lines unless they are adjacent.   You may want to sort the input first, or use `sort -u' without `uniq'.

Which uniq do you have installed? What does your man page say?

Of course, help yourself:
GNU Project Archives
Again, I do not know whether it's GNU coreutils that you are using.

fletch00 · May 20, 2012, 12:25am

No - no, I thank you for the solution - and I think you miss the irony - to use the uniq command for unique results one is require to presort the input.
What self respecting computer scientist would ever make such a 1/2 assed implementation without a full disclosure for the Big O tradeoff / duplicate results and offer a switch for the slower, yet accurate version of uniq -c is beyond me.

Everyone:

if you have duplicates in uniq -c - this is a feature, not a bug, since the lines must be ADJACENT to be considered.

If you want your expected results, first sort, then uniq, then sort again.

May the google duplicate uniq sort fix solution find you

thanks to mirni - I owe you a vBeer.

mirni · May 20, 2012, 12:32am

No need for the last sort, it's already sorted.

O(nlog(n)) is so close to O(n), that sorting does not make much difference at all. And everything is fully disclosed in the documentation, you just have to read carefully -- every word can have significant meaning.

It is not half-assed at all, again you are missing an important point -- this is so that you can filter huge outputs without worrying about memory limitation. It is cleverly designed to be as useful as possible.

Glad I could help. (And I don't drink, but thanks! )