Uniq command

cokedude · September 18, 2023, 9:01pm

My man pages of uniq say:

NAME
       uniq - report or filter out repeated lines in a file

So that makes me think one of these methods should filter out the duplicate bob. Is that not the case?

: cat junk
bob
joe
bob
: uniq junk
bob
joe
bob
: uniq -u junk
bob
joe
bob

I saw there was a way to use awk or perl. Just surprised uniq will not do it.

: awk '!seen[$0]++' junk
bob
joe
: perl -ne 'print unless $dup{$_}++;' junk
bob
joe

Paul_Pedant · September 18, 2023, 9:09pm

My man uniq (GNU coreutils) 8.28 explicitly states: Filter adjacent matching lines from INPUT. Your man page should have that restriction somewhere in the page.

The input must be sorted first (generally through a pipe).

uniq therefore only needs to consider a small section of the file at any time.

The perl and awk solutions both build a table (in memory) of lines that have already been seen, which does not rely on input sequence, but does assume you have memory available to hold (in the worst case) the complete input file. sort generally requires disc workfile space for double the input file size, but much less RAM memory.

munkeHoller · September 18, 2023, 9:48pm

@cokedude , which OS/version and uniq version ?, a single line (as posted) is way short of what (any) man page would have as documentation.

most (if not all) will state that only adjacent lines that are duplicated are detected (hence 'removed'), and that data should generally be sorted before being given to uniq.

sort -u a preferable alternative

AGG2020 · September 18, 2023, 10:15pm

sort -u looks good but the result is always a sorted file. If we need for some reason to keep the original order of the lines, just to remove the duplicates I prefer the awk version even though it uses more RAM.

jgt · September 19, 2023, 1:54am

nl junk |sort -u -k2 |sort |cut -f2

MadeInGermany · September 19, 2023, 7:16am

I just realized that cat -n is not Posix (but exists in almost all *nixes).
nl should be used instead.

I think that sort has nearly the same memory footprint, because quick sorting happens in RAM.
awk allows to not store any value in the array:

awk '!($0 in seen) {seen[$0]; print}' junk

It depends on the awk implementation if this saves memory.

Paul_Pedant · September 19, 2023, 9:11am

sort will use as much memory as it is permitted, but it won't (shouldn't) do bad things if you have a huge file. It should switch to temporary files long before it need to swap.

My man page for sort (GNU coreutils) 8.28 is not good. It permits an option like --buffer-size=2M or --buffer-size=30%, but does not accept 2.5M and does not define the default size.

AGG2020 · September 19, 2023, 11:10am

I'm not sure I understand how this awk version works.
I compared the output of:
seq 5 ; seq 5 | awk '!seen[$0]++'
and
seq 5 ; seq 5 | awk '!($0 in seen) {seen[$0]; print}'
and they are different, the second one does not remove the duplicates and I don't understand how it will be done by this code. My awk version is GNU Awk 5.1.0, API: 3.0

MadeInGermany · September 19, 2023, 12:25pm

It does work.
Perhaps you missed the brackets around your input commands?
(seq 5 ; seq 5) | awk ...
{ seq 5 ; seq 5 ; } | awk ...

AGG2020 · September 19, 2023, 1:05pm

Yes, the brackets make the difference to be one set of 10 lines instead of two sets of 5 each with no duplicates. Thanks!
With some experiments I figured out how it works now even though I'm note sure if awk keeps only the hashed keys or some default value associated with each key because the result print of:
( seq 5 ; seq 5 ) | awk '!($0 in seen) {seen[$0]; print} END{for (k in seen) {print seen[k]} }'
is the same as:
( seq 5 ; seq 5 ) | awk '!($0 in seen) {seen[$0]=""; print} END{for (k in seen) {print seen[k]} }'

MadeInGermany · September 19, 2023, 4:01pm

You can see a difference in GNU awk when doing
END {for (k in seen) {print typeof(seen[k])} }
But not all awk versions have typeof

AGG2020 · September 19, 2023, 4:27pm

Yes, now it is clear that the first one is unassigned and the second one is from type string even though both are printed the same way to the output the unassigned one is expected to use less memory. Thanks!

cokedude · September 19, 2023, 9:17pm

I am using an ancient SunOS.

: uname -a
SunOS ah5719006ub002 5.10 Generic_150400-64 sun4u sparc SUNW,SPARC-Enterprise

munkeHoller · September 19, 2023, 9:59pm

hmm, not sure 'Backspace in old school vi' is relevant to your request regarding uniq or to my request, that you post more than a single line of 'documentation', anyways, this question has seems to have run its course and been answered numerous times within the thread with decent alternatives to uniq (along with [ir]relevant side chats /opinions ).

rgds

MadeInGermany · September 20, 2023, 6:33am

In Solaris use nawk (new awk) or /usr/xpg4/bin/awk (Posix awk).
The Solaris /usr/bin/awk is linked to oawk (old awk, as was shippped with Unix SysV 4.0).

Unixes other than Solaris have switched to nawk

bitrat · September 21, 2023, 10:30pm

Just to add my 2c, if memory was a problem I'd consider creating a hashmap from line content to line number, skipping or overwriting duplicate hashes on construction. It could be done in a shell script, but would probably be easier with Python or similar. A binary representation of key:value would be more memory efficient than strings.

system · October 5, 2023, 10:31pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.