My man uniq (GNU coreutils) 8.28 explicitly states: Filter adjacent matching lines from INPUT. Your man page should have that restriction somewhere in the page.
The input must be sorted first (generally through a pipe).
uniq therefore only needs to consider a small section of the file at any time.
The perl and awk solutions both build a table (in memory) of lines that have already been seen, which does not rely on input sequence, but does assume you have memory available to hold (in the worst case) the complete input file. sort generally requires disc workfile space for double the input file size, but much less RAM memory.
@cokedude , which OS/version and uniq version ?, a single line (as posted) is way short of what (any) man page would have as documentation.
most (if not all) will state that only adjacent lines that are duplicated are detected (hence 'removed'), and that data should generally be sorted before being given to uniq.
sort -u looks good but the result is always a sorted file. If we need for some reason to keep the original order of the lines, just to remove the duplicates I prefer the awk version even though it uses more RAM.
sort will use as much memory as it is permitted, but it won't (shouldn't) do bad things if you have a huge file. It should switch to temporary files long before it need to swap.
My man page for sort (GNU coreutils) 8.28 is not good. It permits an option like --buffer-size=2M or --buffer-size=30%, but does not accept 2.5M and does not define the default size.
I'm not sure I understand how this awk version works.
I compared the output of: seq 5 ; seq 5 | awk '!seen[$0]++'
and seq 5 ; seq 5 | awk '!($0 in seen) {seen[$0]; print}'
and they are different, the second one does not remove the duplicates and I don't understand how it will be done by this code. My awk version is GNU Awk 5.1.0, API: 3.0
Yes, the brackets make the difference to be one set of 10 lines instead of two sets of 5 each with no duplicates. Thanks!
With some experiments I figured out how it works now even though I'm note sure if awk keeps only the hashed keys or some default value associated with each key because the result print of: ( seq 5 ; seq 5 ) | awk '!($0 in seen) {seen[$0]; print} END{for (k in seen) {print seen[k]} }'
is the same as: ( seq 5 ; seq 5 ) | awk '!($0 in seen) {seen[$0]=""; print} END{for (k in seen) {print seen[k]} }'
Yes, now it is clear that the first one is unassigned and the second one is from type string even though both are printed the same way to the output the unassigned one is expected to use less memory. Thanks!
hmm, not sure 'Backspace in old school vi' is relevant to your request regarding uniq or to my request, that you post more than a single line of 'documentation', anyways, this question has seems to have run its course and been answered numerous times within the thread with decent alternatives to uniq (along with [ir]relevant side chats /opinions ).
In Solaris use nawk (new awk) or /usr/xpg4/bin/awk (Posix awk).
The Solaris /usr/bin/awk is linked to oawk (old awk, as was shippped with Unix SysV 4.0).
Just to add my 2c, if memory was a problem I'd consider creating a hashmap from line content to line number, skipping or overwriting duplicate hashes on construction. It could be done in a shell script, but would probably be easier with Python or similar. A binary representation of key:value would be more memory efficient than strings.