Count Repetitive Number in a column and renumbering using awk

himanshu · October 1, 2018, 8:34am

Unable to get the desired output. Need only the rows which has repeated values in column 5.

Input File <tab separated file>

chr1    3773797 3773797 CEP10   1
chr1    3773797 3773797 CEP104  2
chr1    3689350 3689350 SMIM1   2
chr1    3773797 3773797 CEP4    3
chr1    3773797 3773797 EP104   4
chr1    43283059    43283059    CCDC23  5
chr1    43282775    43282775    ERMAP   5

Extra column 6 should be added in column 6. Rows, with repeated valued in column 5, should be extracted and extra column 6 is added with renumbering from 1 to n.

Output File <tab separated file>

chr1    3773797 3773797 CEP104  2   1
chr1    3689350 3689350 SMIM1   2   1
chr1    43283059    43283059    CCDC23  5   2
chr1    43282775    43282775    ERMAP   5   2

vgersh99 · October 1, 2018, 9:46am

where exactly are you stuck?

himanshu · October 1, 2018, 9:57am

I am not able to increase the column 6 value. Every time it is print 2 in the column 6.

awk 'NR == FNR {CNT[$NF]++; next} CNT[$NF] > 1 {print $0, CNT[$NF]}' file1 file1

Output I am getting.

chr1    3773797 3773797 CEP104  2 2
chr1    3689350 3689350 SMIM1   2 2
chr1    43283059    43283059    CCDC23  5 2
chr1    43282775    43282775    ERMAP   5 2

vgersh99 · October 1, 2018, 10:20am

 awk 'NR == FNR {CNT[$NF]++;if (!($NF in order) && CNT[$NF]>1) order[$NF]=++cnt;next} $NF in order {print $0, order[$NF]}' file file

himanshu · October 1, 2018, 12:16pm

@Vgersh99: Thank you very much. The code is working perfectly fine.

Don_Cragun · October 1, 2018, 5:08pm

You haven't described the format of your input file. Just showing us a 10 line sample input file lets us makes lots of possibly erroneous assumptions that might or might not be valid in your actual input files.

The code you showed us in post #3 in this thread prints each row with a last field (not necessarily 5th field) that appears more than once with a new final field (not necessarily field 6) that specifies the number of times the previous final field appeared in your input file. Your code assumes that each input line contains 5 input fields, but that is not stated as a known requirement of any input file you will be processing.

The code vgersh99 suggested looks like it should do what you requested as long as each input line contains exactly 5 input fields and all lines within a given input file with the same value in the last field are adjacent in the input file. Neither of these assumptions are stated as requirements for your input files, but both assumptions hold true in your sample input file.

Note that both the code you showed us and the code vgersh99 suggested read the input file twice. For large files, this can be inefficient. If you want to process data being read from a pipe, it just won't work.

If we can assume that all records within a given input file with the same value in the 5th field are adjacent in the input file (including that all lines with less than 5 fields and all lines with an empty 5th field are adjacent), the following seems to do what you requested no matter how many fields appear on each input line and only needing to read an input file once. Again, this assumption is not stated as a requirement for your input files but holds true in your sample input file.

awk '
BEGIN {	FS = OFS = "\t"
}
{	if(++c[$5] == 1) {
		$5 = $5 OFS groups+1
		$0 = $0
		last = $0
	}
	if(c[$5] == 2) {
		print last
		$5 = $5 OFS (++groups)
		$0 = $0
		print
	}
	if(c[$5] > 2) {
		$5 = $5 OFS groups
		$0 = $0
		print
	}
}' file

Note that the parentheses in the statement:

		$5 = $5 OFS (++groups)

should not be needed. But, at least with the BSD-based awk version 20070501 distributed with macOS High Sierra version 10.13.6, this script gets a memory fault in awk when processing the first line that has a field #5 that matches a field #5 from a previous line if those parentheses are omitted.

You haven't said what operating system you're using. If you want to try this code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

With your sample input, the above code produces the output you requested in post #1. With the following in file :

f1
f1	f2
f1	f2	f3
f1	f2	f3	f4
f1	f2	f3	f4	f5
chr1	3773797	3773797	CEP10	1
chr1	3773797	3773797	CEP104	2
chr1	3689350	3689350	SMIM1	2
chr1	3773797	3773797	CEP4	3
chr1	3773797	3773797	EP104	4
chr1	43283059	43283059	CCDC23	5
chr1	43282775	43282775	ERMAP	5
add1	43282775	43282775	ERMAP	6	field 6	field 7.1
add2	43282775	43282775	ERMAP	6	f62	field 7.2
add3	43282775	43282775	ERMAP	6	f63	field 7.3
add4	43282775	43282775	ERMAP	6	f64	field 7.4
add5	43282775	43282775	ERMAP	6	f65	field 7.5
add6	43282775	43282775	ERMAP	6	f66	field 7.6

the output produced by the above script is:

f1					1
f1	f2				1
f1	f2	f3			1
f1	f2	f3	f4		1
chr1	3773797	3773797	CEP104	2	2
chr1	3689350	3689350	SMIM1	2	2
chr1	43283059	43283059	CCDC23	5	3
chr1	43282775	43282775	ERMAP	5	3
add1	43282775	43282775	ERMAP	6	4	field 6	field 7.1
add2	43282775	43282775	ERMAP	6	4	f62	field 7.2
add3	43282775	43282775	ERMAP	6	4	f63	field 7.3
add4	43282775	43282775	ERMAP	6	4	f64	field 7.4
add5	43282775	43282775	ERMAP	6	4	f65	field 7.5
add6	43282775	43282775	ERMAP	6	4	f66	field 7.6

Note that the output field counting the number of groups of common field #5 values is always stored in field #6 no matter how many fields were in the input file. If there were 6 or more fields in an input line, all fields after field #5 are shifted to the right in the output and the group counter is inserted in field #6. If were less than 5 fields in an input line, empty fields are inserted before output field #6.

rdrtx1 · October 1, 2018, 6:18pm

awk -F"\t" '
NR==FNR {if (a[$5]++) b[$5]=$5; next}
length(b[$5]) {if (!c[$NF]++) d++; print $0, d}
' file OFS="\t" file

himanshu · October 4, 2018, 11:38am

Dear Mr. Don, thank you very much helping me in this regard.
Regards
Himanshu