Count unique column

nans · May 16, 2017, 10:12am

Hello,
I am trying to count unique rows in my file based on 4 columns (2-5) and to output its frequency in a sixth column. My file is tab delimited
My input file looks like this:

Colum1 Colum2 Colum3 Colum4 Coulmn5
1.1 100 100 a b
1.1 100 100 a c
1.2 200 205 a d
1.3 300 301 a y
1.3 300 301 a y
1.4 400 410 a b
1.5 500 510 a c
1.5 500 500 a d
1.5 500 500 a y
1.5 500 500 a y

and the desired output is

Colum1 Colum2 Colum3 Colum4 Column5 Column6
1.1 100 100 a b 1
1.1 100 100 a c 1
1.2 200 205 a d 1
1.3 300 301 a y 2
1.4 400 410 a b 1
1.5 500 510 a c 1
1.5 500 500 a d 1
1.5 500 500 a y 2

So far I have tried this

sort inputfile.csv | uniq -ci | awk '{print $0}' > freq.txt

This gives a frequency of 1 for all the rows and ends up sorting the output file. I want the output to be in its original form. Any suggestions ? Thank you.

RavinderSingh13 · May 16, 2017, 10:25am

Hello nans,

Could you please try following and let me know if this helps you.

awk 'FNR>1 && FNR==NR{A[$2,$3,$4,$5]++;next} (($2,$3,$4,$5) in A){print $0,A[$2,$3,$4,$5];delete A[$2,$3,$4,$5];next}'   Input_file  Input_file

Also if you need to have the headers then you could mention them in the BEGIN section of it too.

Thanks,
R. Singh

nans · May 16, 2017, 11:19am

Thanks RavinderSingh.
I tried the suggested command

awk 'FNR>1 && FNR==NR{A[$2,$3,$4,$5]++;next} (($2,$3,$4,$5) in A){print $0,A[$2,$3,$4,$5];delete A[$2,$3,$4,$5];next}' Input_file > Ouput_file

This ends up in generating a blank output file.

RudiC · May 16, 2017, 11:26am

Yes of course. You didn't copy RavinderSingh13's entire proposal.

RavinderSingh13 · May 16, 2017, 11:56am

nans:

Thanks RavinderSingh.
I tried the suggested command
awk 'FNR>1 && FNR==NR{A[$2,$3,$4,$5]++;next} (($2,$3,$4,$5) in A){print $0,A[$2,$3,$4,$5];delete A[$2,$3,$4,$5];next}' Input_file > Ouput_file
This ends up in generating a blank output file.

Hello nans,

You should mention Input_file 2 times in my above code as Rudi mentioned and it should fly then.

Thanks,
R. Singh

nans · May 17, 2017, 3:56am

Ah yes, thank you. Though the output looks

Colum1 Colum2 Colum3 Colum4 Column5 Column6
1.1 100 100 a b^M 1
1.1 100 100 a c^M 1
1.2 200 205 a d^M 1
1.3 300 301 a y^M 2
1.4 400 410 a b^M 1
1.5 500 510 a c^M  1
1.5 500 500 a d^M  1
1.5 500 500 a y^M  2

But that should be okay, I can always use sed to remove the ^M characters. Thank you.

Don_Cragun · May 17, 2017, 5:40am

nans:

Ah yes, thank you. Though the output looks
Colum1 Colum2 Colum3 Colum4 Column5 Column6
1.1 100 100 a b^M 1
1.1 100 100 a c^M 1
1.2 200 205 a d^M 1
1.3 300 301 a y^M 2
1.4 400 410 a b^M 1
1.5 500 510 a c^M  1
1.5 500 500 a d^M  1
1.5 500 500 a y^M  2                      
But that should be okay, I can always use sed to remove the ^M characters. Thank you.

I don't see how this code prints out the heading line, but you can get rid of the carriage return characters in the awk script without needing to also invoke sed :

awk '{gsub(/\r/,"")}FNR>1 && FNR==NR{A[$2,$3,$4,$5]++;next} (($2,$3,$4,$5) in A){print $0,A[$2,$3,$4,$5];delete A[$2,$3,$4,$5];next}'   Input_file  Input_file

If you want the augmented header line, you might try the following (in a formatI find it a little bit easier to read):

awk '
{	gsub(/\r/, "")
}
NR==1 {	print $0, "Column6"
	next
}
FNR>1 && FNR==NR {
	A[$2, $3, $4, $5]++
	next
}
(($2, $3, $4, $5) in A) {
	print $0, A[$2, $3, $4, $5]
	delete A[$2, $3, $4, $5]
}'   OFS='\t' Input_file  Input_file

Note that the sample input and output you provided used <space> as a field delimiter but you said your files were <tab> delimited. I specified <tab> as the output field separator here assuming that your real data is <tab> delimited.