palex
February 2, 2011, 9:47am
1
Hello,
I have a large data file:
1234 8888 bbb
2745 8888 bbb
9489 8888 bbb
1234 8888 aaa
4838 8888 aaa
3977 8888 aaa
I need to remove duplicate lines (where the first column is the duplicate). I have been using:
sort file.txt | uniq -w4 > newfile.txt
However, it seems to keep the first of the duplicate pair alphabetically. So in the example above,
1234 8888 aaa would be kept, and
1234 8888 bbb would be excluded
I need to modify the command so that the first of the two lines *chronologically* would be kept (In this case, 1234 8888 bbb).
Thanks so much!
drl
February 2, 2011, 10:22am
2
Hi.
This seems to work, assuming you have an appropriate sort command:
#!/usr/bin/env bash
# @(#) s1 Demonstrate sort and uniq.
# Section 1, setup, pre-solution.
# Infrastructure details, environment, commands for forum posts.
# Uncomment export command to test script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
C=$HOME/bin/context && [ -f $C ] && . $C specimen sort uniq
set -o nounset
pe
FILE=${1-data1}
# Section 2, display input file.
# Display sample of data file, with head & tail as a last resort.
pe " || start [ first:middle:last ]"
specimen $FILE \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"
# Section 3, solution.
pl " Results, sort | uniq:"
sort -k1,1 $FILE | uniq -w4
pl " Results, sort -u:"
sort -k1,1 -u $FILE
exit 0
producing:
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0.7 (lenny)
GNU bash 3.2.39
specimen (local) 1.17
sort (GNU coreutils) 6.10
uniq (GNU coreutils) 6.10
|| start [ first:middle:last ]
Whole: 5:0:5 of 6 lines in file "data1"
1234 8888 bbb
2745 8888 bbb
9489 8888 bbb
1234 8888 aaa
4838 8888 aaa
3977 8888 aaa
|| end
-----
Results, sort | uniq:
1234 8888 aaa
2745 8888 bbb
3977 8888 aaa
4838 8888 aaa
9489 8888 bbb
-----
Results, sort -u:
1234 8888 bbb
2745 8888 bbb
3977 8888 aaa
4838 8888 aaa
9489 8888 bbb
See man sort for details.
Best wishes ... cheers, drl
birei
February 2, 2011, 10:45am
3
Hi,
Other solution using 'perl':
$ perl -lane 'print if not $no{ $F[0] }++' infile
Regards,
Birei
methyl
February 2, 2011, 12:10pm
4
Try the "-u" switch to uniq (only print unique lines).
sort file.txt | uniq -w4 -u > newfile.txt
2 Likes
i tried this
awk '!a[$1$2]++' filename
on this
01/Feb/2011 -- User Count : 27
31/Jan/2011 -- User Count : 21
02/Feb/2011 -- User Count : 24
30/Jan/2011 -- User Count : 4
and it didn't sort by mo & day. But, I assumed that is because I didn't specify the correct columns.
try:
awk -v FS=OFS="/" '!a[$1$2]++'
note: make sure the first replicate is you want,
I changed it to this to try and get the months to sort:
awk -v FS=OFS="/" '!a[$2$1]++' filename
, but I don't think I'm getting the usage of
a[$1$2]
right for splitting the columns
drl
February 2, 2011, 4:07pm
9
Hi, dba_frog.
dba_frog:
i tried this
awk '!a[$1$2]++' filename
on this
01/Feb/2011 -- User Count : 27
31/Jan/2011 -- User Count : 21
02/Feb/2011 -- User Count : 24
30/Jan/2011 -- User Count : 4
and it didn't sort by mo & day. But, I assumed that is because I didn't specify the correct columns.
The main purpose of this thread is to choose the correct line among lines that have the same value for a field.
Although sorting may be involved in some solutions, the purpose of most of the awk codes is to remove duplicates.
If you are interested in sorting your data, I suggest that you start a new thread.
Best wishes ... cheers, drl
Have a look at stable sort. For example in GNU sort this is the -s option:
$ sort -us -k1,1n infile
1234 8888 bbb
2745 8888 bbb
3977 8888 aaa
4838 8888 aaa
9489 8888 bbb
palex
February 2, 2011, 10:51pm
11
Thanks everyone...
Thanks, drl... that worked perfectly.
PA
drl
February 3, 2011, 4:57am
12
Hi, palex.
You are welcome.
It's possible that with the small sample we have, we were just lucky. You may need -- as Scrutinizer wrote -- to use the "-s" option in addition to "-u" on the sort ( not on the uniq ):
Finally, as a last resort when all keys compare
equal, `sort' compares entire lines as if no ordering options other
than `--reverse' (`-r') were specified. The `--stable' (`-s') option
disables this "last-resort comparison" so that lines in which all
fields compare equal are left in their original relative order.
-- excerpt from info coreutils sort
Good luck ... cheers, drl