You haven't described the format of your input file. Just showing us a 10 line sample input file lets us makes lots of possibly erroneous assumptions that might or might not be valid in your actual input files.
The code you showed us in post #3 in this thread prints each row with a last field (not necessarily 5th field) that appears more than once with a new final field (not necessarily field 6) that specifies the number of times the previous final field appeared in your input file. Your code assumes that each input line contains 5 input fields, but that is not stated as a known requirement of any input file you will be processing.
The code vgersh99 suggested looks like it should do what you requested as long as each input line contains exactly 5 input fields and all lines within a given input file with the same value in the last field are adjacent in the input file. Neither of these assumptions are stated as requirements for your input files, but both assumptions hold true in your sample input file.
Note that both the code you showed us and the code vgersh99 suggested read the input file twice. For large files, this can be inefficient. If you want to process data being read from a pipe, it just won't work.
If we can assume that all records within a given input file with the same value in the 5th field are adjacent in the input file (including that all lines with less than 5 fields and all lines with an empty 5th field are adjacent), the following seems to do what you requested no matter how many fields appear on each input line and only needing to read an input file once. Again, this assumption is not stated as a requirement for your input files but holds true in your sample input file.
awk '
BEGIN { FS = OFS = "\t"
}
{ if(++c[$5] == 1) {
$5 = $5 OFS groups+1
$0 = $0
last = $0
}
if(c[$5] == 2) {
print last
$5 = $5 OFS (++groups)
$0 = $0
print
}
if(c[$5] > 2) {
$5 = $5 OFS groups
$0 = $0
print
}
}' file
Note that the parentheses in the statement:
$5 = $5 OFS (++groups)
should not be needed. But, at least with the BSD-based awk
version 20070501 distributed with macOS High Sierra version 10.13.6, this script gets a memory fault in awk
when processing the first line that has a field #5 that matches a field #5 from a previous line if those parentheses are omitted.
You haven't said what operating system you're using. If you want to try this code on a Solaris/SunOS system, change awk
to /usr/xpg4/bin/awk
or nawk
.
With your sample input, the above code produces the output you requested in post #1. With the following in file
:
f1
f1 f2
f1 f2 f3
f1 f2 f3 f4
f1 f2 f3 f4 f5
chr1 3773797 3773797 CEP10 1
chr1 3773797 3773797 CEP104 2
chr1 3689350 3689350 SMIM1 2
chr1 3773797 3773797 CEP4 3
chr1 3773797 3773797 EP104 4
chr1 43283059 43283059 CCDC23 5
chr1 43282775 43282775 ERMAP 5
add1 43282775 43282775 ERMAP 6 field 6 field 7.1
add2 43282775 43282775 ERMAP 6 f62 field 7.2
add3 43282775 43282775 ERMAP 6 f63 field 7.3
add4 43282775 43282775 ERMAP 6 f64 field 7.4
add5 43282775 43282775 ERMAP 6 f65 field 7.5
add6 43282775 43282775 ERMAP 6 f66 field 7.6
the output produced by the above script is:
f1 1
f1 f2 1
f1 f2 f3 1
f1 f2 f3 f4 1
chr1 3773797 3773797 CEP104 2 2
chr1 3689350 3689350 SMIM1 2 2
chr1 43283059 43283059 CCDC23 5 3
chr1 43282775 43282775 ERMAP 5 3
add1 43282775 43282775 ERMAP 6 4 field 6 field 7.1
add2 43282775 43282775 ERMAP 6 4 f62 field 7.2
add3 43282775 43282775 ERMAP 6 4 f63 field 7.3
add4 43282775 43282775 ERMAP 6 4 f64 field 7.4
add5 43282775 43282775 ERMAP 6 4 f65 field 7.5
add6 43282775 43282775 ERMAP 6 4 f66 field 7.6
Note that the output field counting the number of groups of common field #5 values is always stored in field #6 no matter how many fields were in the input file. If there were 6 or more fields in an input line, all fields after field #5 are shifted to the right in the output and the group counter is inserted in field #6. If were less than 5 fields in an input line, empty fields are inserted before output field #6.