sort | uniq question

palex · February 2, 2011, 9:47am

Hello,
I have a large data file:

1234 8888 bbb
2745 8888 bbb
9489 8888 bbb
1234 8888 aaa
4838 8888 aaa
3977 8888 aaa

I need to remove duplicate lines (where the first column is the duplicate). I have been using:

sort file.txt | uniq -w4 > newfile.txt

However, it seems to keep the first of the duplicate pair alphabetically. So in the example above,

1234 8888 aaa    would be kept, and
1234 8888 bbb    would be excluded

I need to modify the command so that the first of the two lines *chronologically* would be kept (In this case, 1234 8888 bbb).

Thanks so much!

drl · February 2, 2011, 10:22am

Hi.

This seems to work, assuming you have an appropriate sort command:

#!/usr/bin/env bash

# @(#) s1	Demonstrate sort and uniq.

# Section 1, setup, pre-solution.
# Infrastructure details, environment, commands for forum posts. 
# Uncomment export command to test script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
C=$HOME/bin/context && [ -f $C ] && . $C specimen sort uniq
set -o nounset
pe

FILE=${1-data1}

# Section 2, display input file.
# Display sample of data file, with head & tail as a last resort.
pe " || start [ first:middle:last ]"
specimen $FILE \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"

# Section 3, solution.
pl " Results, sort | uniq:"
sort -k1,1 $FILE | uniq -w4 

pl " Results, sort -u:"
sort -k1,1 -u $FILE

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.7 (lenny) 
GNU bash 3.2.39
specimen (local) 1.17
sort (GNU coreutils) 6.10
uniq (GNU coreutils) 6.10

 || start [ first:middle:last ]
Whole: 5:0:5 of 6 lines in file "data1"
1234 8888 bbb
2745 8888 bbb
9489 8888 bbb
1234 8888 aaa
4838 8888 aaa
3977 8888 aaa
 || end

-----
 Results, sort | uniq:
1234 8888 aaa
2745 8888 bbb
3977 8888 aaa
4838 8888 aaa
9489 8888 bbb

-----
 Results, sort -u:
1234 8888 bbb
2745 8888 bbb
3977 8888 aaa
4838 8888 aaa
9489 8888 bbb

See man sort for details.

Best wishes ... cheers, drl

birei · February 2, 2011, 10:45am

Hi,

Other solution using 'perl':

$ perl -lane 'print if not $no{ $F[0] }++' infile

Regards,
Birei

methyl · February 2, 2011, 12:10pm

Try the "-u" switch to uniq (only print unique lines).

sort file.txt | uniq -w4 -u > newfile.txt

yinyuemi · February 2, 2011, 2:45pm

awk '!a[$1$2]++'

dba_frog · February 2, 2011, 2:50pm

i tried this

awk '!a[$1$2]++' filename

on this

01/Feb/2011   -- User Count : 27
  31/Jan/2011   --  User Count : 21
  02/Feb/2011   -- User Count : 24
  30/Jan/2011   --  User Count : 4

and it didn't sort by mo & day. But, I assumed that is because I didn't specify the correct columns.

yinyuemi · February 2, 2011, 2:54pm

try:

awk -v FS=OFS="/" '!a[$1$2]++'

note: make sure the first replicate is you want,

dba_frog · February 2, 2011, 3:45pm

I changed it to this to try and get the months to sort:

awk -v FS=OFS="/" '!a[$2$1]++' filename

, but I don't think I'm getting the usage of

a[$1$2]

right for splitting the columns

drl · February 2, 2011, 4:07pm

Hi, dba_frog.

dba_frog:

i tried this
awk '!a[$1$2]++' filename
on this
01/Feb/2011   -- User Count : 27
  31/Jan/2011   --  User Count : 21
  02/Feb/2011   -- User Count : 24
  30/Jan/2011   --  User Count : 4
and it didn't sort by mo & day. But, I assumed that is because I didn't specify the correct columns.

The main purpose of this thread is to choose the correct line among lines that have the same value for a field.

Although sorting may be involved in some solutions, the purpose of most of the awk codes is to remove duplicates.

If you are interested in sorting your data, I suggest that you start a new thread.

Best wishes ... cheers, drl

Scrutinizer · February 2, 2011, 6:07pm

Have a look at stable sort. For example in GNU sort this is the -s option:

$ sort -us -k1,1n infile
1234 8888 bbb
2745 8888 bbb
3977 8888 aaa
4838 8888 aaa
9489 8888 bbb

palex · February 2, 2011, 10:51pm

Thanks everyone...
Thanks, drl... that worked perfectly.

PA

drl · February 3, 2011, 4:57am

Hi, palex.

You are welcome.

It's possible that with the small sample we have, we were just lucky. You may need -- as Scrutinizer wrote -- to use the "-s" option in addition to "-u" on the sort ( not on the uniq ):

Finally, as a last resort when all keys compare
equal, `sort' compares entire lines as if no ordering options other
than `--reverse' (`-r') were specified.  The `--stable' (`-s') option
disables this "last-resort comparison" so that lines in which all
fields compare equal are left in their original relative order. 

-- excerpt from info coreutils sort

Good luck ... cheers, drl