finding duplicates in columns and removing lines

I am trying to figure out how to scan a file like so:

1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
2 margies office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

and end up with this:

1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

specifically, I'm needing to look for duplicates in column 3 in csv file, if a duplicate is found, remove "lines" based on duplicates found in column 3. In the instance above line two is removed or filtered.

Does anyone know if the unix uniq command can be utilized or perl? uniq doesn't seen to have a delimiter flag to use only character count or bit.

Thanks!
Totus:confused:

awk -F, '! mail[$3]++' inputfile

Jean-Pierre.

how does that work? I'm vaguely familiar with awk.

awk has associative arrays - the key for the mail array is field #3 ($3).
The first time $3 shows up the value of mail[$3] is zero, mail[$3]++ increments that array element to one. The next time $3 is found to have a value of 1. It does not print.

!mail[$3] only evaluates true when mail[$3] == 0, so when it is 1, 2 ,3 ... it evaluates as false.

With the 'uniq' command:

uniq -1 [inputfile]

Hope this helps.

Jean-Pierre,

This seemed to work but I noticed that there seem to be a few duplicated left behind. How does the array know what the delimiter? $3 is the field, but not clear on delimiter. Would the same work with tabs for delimiter?

Cheers!:confused:

Hi Totus,

from aigles solution.... delimitter is ,
so, if you have tabs/spaces...i think you can use it as
awk -F " " '!mail[$4]++' inputfile

(logic is you have to specify the column correctly; i hope you noticed that i am using $4)

-ilan

Thanks ilan, I think I got it. In order to use tabs in awk it's awk -F"\t"

Thanks everyone for your help, it was greatly appreciated.:slight_smile:

awk 'BEGIN{FS="\""}
{
a[$5]++
if (a[$5]<=1)
	print
}' file

Hi,

I do have an idea to resolve this issue. Taking the uniq values of the third column with the help of awk, sort and uniq. Grep the values in the original file using for loop and use head -1 to get only first entry values out of duplicate entires

Here, I took the entries in the file named duplicate.txt

$ cat duplicate.txt
1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
2 margies office","555-555-5555","ralph@mail.com","www.ralph.com
3 kims office","555-555-5555","kims@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com

I followed the below procedure at command prompt....

$ for id in ` cat duplicate.txt | awk -F, {'print $3'} | sort | uniq`
do
grep ${id} duplicate.txt | head -1
done

Output I got as follows:

3 kims office","555-555-5555","kims@mail.com","www.ralph.com
1 ralphs office","555-555-5555","ralph@mail.com","www.ralph.com
4 tims office","555-555-5555","tims@mail.com","www.ralph.com
$

Hope, this works...

Thanks,
Aketi.

I have data like this:
It's sorted by the 2nd field (TID).
envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/12/2008,04:23:45,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/12/2008,23:14:25,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,04:23:39,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:41:58,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:42:44,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:49:43,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:50:45,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/13/2008,22:53:23,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/14/2008,12:38:40,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000634600010001,04/14/2008,12:52:22,RB00266,0015,DETAIL,ERROR,
envoy,90000000000000693200010001,04/17/2008,09:07:09,RB00060,0009,ENVOY,ERROR,26
envoy,90000000000000693200010001,04/18/2008,10:27:13,RB00083,0009,ENVOY,ERROR,26
envoy,90000000000000693200010001,04/18/2008,11:36:27,RB00084,0009,ENVOY,ERROR,26
envoy,90000000000001034800010001,04/01/2008,23:59:15,RB00294,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/02/2008,23:59:12,RB00295,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/03/2008,23:59:11,RB00296,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/04/2008,23:59:08,RB00297,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/05/2008,23:59:04,RB00297,0030,ENVOY,ERROR,57
envoy,90000000000001034800010001,04/06/2008,22:59:06,RB00297,0030,ENVOY,ERROR,57

I want to do the following:
Check the second field to see if the TID is the same as the previous line. If the TID has been seen before then check the 7th field to see if that is the same as the previous line. If both are the same, I want to remove the line and increment a counter.

My ideal output would look something like this.
11,envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR,
3,envoy,90000000000000693200010001,04/17/2008,09:07:09,RB00060,0009,ENVOY,ERROR,26

etc.

I figure I actually need to do an awk script rather than a 1 liner. The other option is for the last 3 fields to be one field and compare by the TID field and the error field, then split them into 3 on the output.

Any thoughts? I've looked at other stuff removing dups with awk and it's mostly one liners.

I'd love to ask get some explanation of WHY it works so that I can mod it if need be.

kinksville,

Please don't hijack another ones thread, but start a new thread for your problem.

Thanks.

Sorry about that, I'm happy to start a new thread, I hadn't wanted to post something that was already being answered.

Hi Guys...

Please Could you help me with the following ?

aaaa bbbb cccc sdsd
aaaa bbbb cccc qwer

as you can see, the 2 lines are matched in three fields...
how can I delete this dupicate ? I mean to delete the second line if 3 fields were matched?

Thanks

dublicate http://www.unix.com/shell-programming-scripting/65497-help.html\#post302196110

Dears,

I need to make the field number 7 ($7) uniq for the below input:

BSC38_E709 3025-Faiaz-43 43 2-0 SWRF9139V CTU2 X79T7H05ET B-U 2008-11-27
BSC38_E709 3025-_Faiaz-43 43 2-1 SWRF9139V CTU2 X79T7H05ET B-U 2008-11-14
BSC38_E709 3026-Rafgah-5 5 1-0 SWRF9139V CTU2 X79T7H06U3 B-U 2008-11-27
BSC38_E709 3026-Rafgah-5 5 1-1 SWRF9139V CTU2 X79T7H06U3 B-U 2008-11-14
BSC38_E709 3026-Rafgah-5 5 2-0 SWRF9139V CTU2 X79T7H06SM B-U 2008-11-27
BSC38_E709 3026-Rafgah-5 5 2-1 SWRF9139V CTU2 X79T7H06SM B-U 2008-11-14

and the output should be as below:

BSC38_E709 3025-Faiaz-43 43 2-0 SWRF9139V CTU2 X79T7H05ET B-U 2008-11-27
BSC38_E709 3026-Rafgah-5 5 1-0 SWRF9139V CTU2 X79T7H06U3 B-U 2008-11-27
BSC38_E709 3026-Rafgah-5 5 2-0 SWRF9139V CTU2 X79T7H06SM B-U 2008-11-27

Note: uniq for column number 7 & order to print the entire line,

Your feedback is highly appreciated, thanks.

Did you search the forum first?

awk '! _[$7]++' file

http://www.unix.com/shell-programming-scripting/62574-finding-duplicates-columns-removing-lines.html\#post302189002

yes I did and it's didn't work,

I used the below one but it's take too long time :

touch D22
for id in ` cat D3 | awk '/BSC/{print $13}' | uniq`
do
grep $id D3 | head -1 >> D22
wait
done

note: D22 is output file and D3 is the input file.

is there any other suggestion ?? , thanks.