finding duplicates in columns and removing lines

totus · April 24, 2008, 4:04pm

I am trying to figure out how to scan a file like so:

1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
2 margies office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

and end up with this:

1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

specifically, I'm needing to look for duplicates in column 3 in csv file, if a duplicate is found, remove "lines" based on duplicates found in column 3. In the instance above line two is removed or filtered.

Does anyone know if the unix uniq command can be utilized or perl? uniq doesn't seen to have a delimiter flag to use only character count or bit.

Thanks!
Totus:confused:

aigles · April 24, 2008, 4:27pm

awk -F, '! mail[$3]++' inputfile

Jean-Pierre.

totus · April 24, 2008, 4:37pm

how does that work? I'm vaguely familiar with awk.

jim_mcnamara · April 24, 2008, 4:42pm

awk has associative arrays - the key for the mail array is field #3 ($3).
The first time $3 shows up the value of mail[$3] is zero, mail[$3]++ increments that array element to one. The next time $3 is found to have a value of 1. It does not print.

!mail[$3] only evaluates true when mail[$3] == 0, so when it is 1, 2 ,3 ... it evaluates as false.

in2nix4life · April 24, 2008, 4:44pm

With the 'uniq' command:

uniq -1 [inputfile]

Hope this helps.

totus · April 24, 2008, 6:05pm

Jean-Pierre,

This seemed to work but I noticed that there seem to be a few duplicated left behind. How does the array know what the delimiter? $3 is the field, but not clear on delimiter. Would the same work with tabs for delimiter?

Cheers!

ilan · April 24, 2008, 7:12pm

Hi Totus,

from aigles solution.... delimitter is ,
so, if you have tabs/spaces...i think you can use it as
awk -F " " '!mail[$4]++' inputfile

(logic is you have to specify the column correctly; i hope you noticed that i am using $4)

-ilan

totus · April 25, 2008, 11:09am

Thanks ilan, I think I got it. In order to use tabs in awk it's awk -F"\t"

Thanks everyone for your help, it was greatly appreciated.

summer_cherry · April 29, 2008, 2:44am

awk 'BEGIN{FS="\""}
{
a[$5]++
if (a[$5]<=1)
	print
}' file

Mr.Aketi · April 29, 2008, 7:36am

Hi,

I do have an idea to resolve this issue. Taking the uniq values of the third column with the help of awk, sort and uniq. Grep the values in the original file using for loop and use head -1 to get only first entry values out of duplicate entires

Here, I took the entries in the file named duplicate.txt

$ cat duplicate.txt
1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
2 margies office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

I followed the below procedure at command prompt....

$ for id in ` cat duplicate.txt | awk -F, {'print $3'} | sort | uniq`
do
grep ${id} duplicate.txt | head -1
done

Output I got as follows:

3 kims office","555-555-5555","kims@mail.com","www.ralph.com
1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com
$

Hope, this works...

Thanks,
Aketi.

kinksville · April 29, 2008, 10:55am

I have data like this:
It's sorted by the 2nd field (TID).
envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/12/2008,04:23:45,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/12/2008,23:14:25,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,04:23:39,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:41:58,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:42:44,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:49:43,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:50:45,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:53:23,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/14/2008,12:38:40,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/14/2008,12:52:22,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000693200010001,04/17/2008,09:07:09,RB00060,0009,ENVOY,ERROR,26
envoy,90000000000000693200010001,04/18/2008,10:27:13,RB00083,0009,ENVOY,ERROR,26
envoy,90000000000000693200010001,04/18/2008,11:36:27,RB00084,0009,ENVOY,ERROR,26
envoy,90000000000001034800010001,04/01/2008,23:59:15,RB00294,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/02/2008,23:59:12,RB00295,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/03/2008,23:59:11,RB00296,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/04/2008,23:59:08,RB00297,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/05/2008,23:59:04,RB00297,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/06/2008,22:59:06,RB00297,0030,ENVOY,ERROR,57

I want to do the following:
Check the second field to see if the TID is the same as the previous line. If the TID has been seen before then check the 7th field to see if that is the same as the previous line. If both are the same, I want to remove the line and increment a counter.

My ideal output would look something like this.
11,envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR,
3,envoy,90000000000000693200010001,04/17/2008,09:07:09,RB00060,0009,ENVOY,ERROR,26

etc.

I figure I actually need to do an awk script rather than a 1 liner. The other option is for the last 3 fields to be one field and compare by the TID field and the error field, then split them into 3 on the output.

Any thoughts? I've looked at other stuff removing dups with awk and it's mostly one liners.

I'd love to ask get some explanation of WHY it works so that I can mod it if need be.

Franklin52 · April 29, 2008, 12:10pm

kinksville,

Please don't hijack another ones thread, but start a new thread for your problem.

Thanks.

kinksville · April 29, 2008, 12:20pm

Sorry about that, I'm happy to start a new thread, I hadn't wanted to post something that was already being answered.

yahyaaa · May 16, 2008, 6:30pm

Hi Guys...

Please Could you help me with the following ?

aaaa bbbb cccc sdsd
aaaa bbbb cccc qwer

as you can see, the 2 lines are matched in three fields...
how can I delete this dupicate ? I mean to delete the second line if 3 fields were matched?

Thanks

fabtagon · May 16, 2008, 8:10pm

dublicate http://www.unix.com/shell-programming-scripting/65497-help.html\#post302196110

ahmad_khouly · November 29, 2008, 9:19am

Dears,

I need to make the field number 7 ($7) uniq for the below input:

BSC38_E709 3025-Faiaz-43 43 2-0 SWRF9139V CTU2 X79T7H05ET B-U 2008-11-27
BSC38_E709 3025-_Faiaz-43 43 2-1 SWRF9139V CTU2 X79T7H05ET B-U 2008-11-14
BSC38_E709 3026-Rafgah-5 5 1-0 SWRF9139V CTU2 X79T7H06U3 B-U 2008-11-27
BSC38_E709 3026-Rafgah-5 5 1-1 SWRF9139V CTU2 X79T7H06U3 B-U 2008-11-14
BSC38_E709 3026-Rafgah-5 5 2-0 SWRF9139V CTU2 X79T7H06SM B-U 2008-11-27
BSC38_E709 3026-Rafgah-5 5 2-1 SWRF9139V CTU2 X79T7H06SM B-U 2008-11-14

and the output should be as below:

BSC38_E709 3025-Faiaz-43 43 2-0 SWRF9139V CTU2 X79T7H05ET B-U 2008-11-27
BSC38_E709 3026-Rafgah-5 5 1-0 SWRF9139V CTU2 X79T7H06U3 B-U 2008-11-27
BSC38_E709 3026-Rafgah-5 5 2-0 SWRF9139V CTU2 X79T7H06SM B-U 2008-11-27

Note: uniq for column number 7 & order to print the entire line,

Your feedback is highly appreciated, thanks.

danmero · November 29, 2008, 9:52am

Did you search the forum first?

awk '! _[$7]++' file

http://www.unix.com/shell-programming-scripting/62574-finding-duplicates-columns-removing-lines.html\#post302189002

ahmad_khouly · November 29, 2008, 10:27am

yes I did and it's didn't work,

I used the below one but it's take too long time :

touch D22
for id in ` cat D3 | awk '/BSC/{print $13}' | uniq`
do
grep $id D3 | head -1 >> D22
wait
done

note: D22 is output file and D3 is the input file.

is there any other suggestion ?? , thanks.