Numbering duplicates

kylle345 · August 2, 2009, 2:06am

Hi,

I have this large file and sometimes there are duplicates and I want to basically find them and figure how many there are.

So I have a file with multiple columns and the last column (9) has the duplicates.

eg.

yan
tar
tar
man
ban
tan
tub
tub
tub

Basically what I want to do is label non duplicates as "0" and duplicates as "0", "1" and in the case of triplicates "0", "1" and "2"

So the output file will look like this

yan 0
tar 0
tar 1
man 0
ban 0
tan 0
tub 0
tub 1
tub 2

thanks

Kylle:confused:

kshji · August 2, 2009, 3:52am

awk '
NF >= 9 { word[$9]++ }
END { for (w in word) {
            print w,word[w]
            }
       }
'  inputfile

malcomex999 · August 2, 2009, 4:09am

you didnt tell what delimeter you have but Try this...

awk '{print $9,word[$9]++}' yourfile

kshji · August 2, 2009, 4:21am

malcomex999 is better reader :), use it.

kylle345 · August 2, 2009, 12:53pm

Hi its tab deliminted,

thanks but Im not sure if that does what I want it to do. It counted how many are unique and how many are replicates. Basically what i want it to do is this:

Before...

yan
tar
tar
man
ban
tan
tub
tub
tub

yan unique
tar unique
tar duplicate
man unique
ban unique
tan unique
tub unique
tub duplicate
tub triplicate

thanks

---------- Post updated at 12:53 PM ---------- Previous update was at 12:46 PM ----------

Hi its tab deliminted,

thanks but Im not sure if that does what I want it to do. It counted how many are unique and how many are replicates. Basically what i want it to do is this:

Before...

yan
tar
tar
man
ban
tan
tub
tub
tub

yan unique
tar unique
tar duplicate
man unique
ban unique
tan unique
tub unique
tub duplicate
tub triplicate

thanks

Franklin52 · August 2, 2009, 1:11pm

Assuming you're using the 9th column:

awk '{print $9, a[$9]++?" duplicate":" unique"}' file

kshji · August 2, 2009, 1:20pm

Did you try

version ?
It give result:

yan 0
tar 0
tar 1
man 0
ban 0
tan 0
tub 0
tub 1
tub 2

Which is just that what you have in your 1st definition. Your field delimeter is tab, which is one of the default delimeter. If your data include also space in data, then you need set FS value:

awk -F "\t" '{print $9,word[$9]++}' yourfile