Hello gurus - I must be missing something, or there is a better way - pls enlighten me
I'm on a Solaris 10 vm running the following pipeline to reduce some apache logs (actually lynx dumps of /server-status/ when threads are above a threshold) to a set of offending DDoS IP addresses.
uniq remembers only 1 line -- the last one seen, therefore a file
one
two
two
one
will be
one
two
one
after running through uniq. It forgot that it saw 'one' already.
Before anyone starts complaining about how dumb uniq is, let me assure you that this is a feature -- it enables you to run the filter on a huge file that wouldn't fit in memory. Since sorting can be done fast (~ O(nlog(n))), it is usually run in a chain sort | uniq. If sorting messes you up and you absolutely have to preserve order, you could use awk to count the occurances, e.g.
Mine is when the intended outcome or meaning juxtaposes significantly enough from the actual outcome or meaning for me to seek out the unix.com forums, become a member, post a problem, read a WORKAROUND solution, then be told the WORKAROUND is intended somehow as a subtle misinterpretation of the man page (missed that the lines must be ADJACENT) for uniq to do its -c switch as documented in the manpage:
-c Precede each output line with the count of the number of times the line occurred in the input, fol-
lowed by a single space.
Imagine if I implemented "sort" and said "applies only to letters I R O N and Y" (but buried that subtly with one word in a man page)
At least the uniq man page should clarify this with a note and a switch to presort (on the penalty of performance) to deliver the EXPECTED -c results?
I would like to see the uniq source code - is there a reference?
You are making a really big issue out of this trivial matter and trying to blame the tool, instead of making it a learning experience.
This has nothing to do with the -c switch. -c just adds a number. This is a default behavior of uniq -- it filters only adjacent (consecutive) lines.
What are you trying to say with this comment? The fact that it operates on consecutive lines makes it more general and useful, not less.
So how would you write uniq, if you took the effort? How would you deal with the repeated lines? Would you rather slurp the whole file into memory and make this completely useless for large files? Or do you have a better solution? I'd be very interested to hear it.
But it does! Didn't you read my post? :
Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use `sort -u' without `uniq'.
Which uniq do you have installed? What does your man page say?
Of course, help yourself: GNU Project Archives
Again, I do not know whether it's GNU coreutils that you are using.
No - no, I thank you for the solution - and I think you miss the irony - to use the uniq command for unique results one is require to presort the input.
What self respecting computer scientist would ever make such a 1/2 assed implementation without a full disclosure for the Big O tradeoff / duplicate results and offer a switch for the slower, yet accurate version of uniq -c is beyond me.
Everyone:
if you have duplicates in uniq -c - this is a feature, not a bug, since the lines must be ADJACENT to be considered.
If you want your expected results, first sort, then uniq, then sort again.
May the google duplicate uniq sort fix solution find you
O(nlog(n)) is so close to O(n), that sorting does not make much difference at all. And everything is fully disclosed in the documentation, you just have to read carefully -- every word can have significant meaning.
It is not half-assed at all, again you are missing an important point -- this is so that you can filter huge outputs without worrying about memory limitation. It is cleverly designed to be as useful as possible.
Glad I could help. (And I don't drink, but thanks! )