awk to combine matching lines in file

cmccabe · September 8, 2016, 5:58pm

I am trying to combine all matching lines in the tab-delimited using awk . The below runs but no output results. Thank you :).

input

chrX    110925349    110925532    ALG13
chrX    110925349    110925532    ALG13
chrX    110925349    110925532    ALG13
chrX    47433390    47433999    SYN1
chrX    47433390    47433999    SYN1
chr18    53298518    53298629    TCF4
chr18    53298518    53298629    TCF4
chr18    53298640    53298695    TCF4
chr18    53298640    53298695    TCF4

desired output

chrX    110925349    110925532    ALG13
chrX    47433390    47433999    SYN1
chr18    53298518    53298629    TCF4
chr18    53298640    53298695    TCF4

awk '!(NR){print$0p}{p=$0}' input

Yoda · September 8, 2016, 6:23pm

awk '!A[$0]++' file

Don_Cragun · September 8, 2016, 6:55pm

Hi cmccabe,
The code you were using:

awk '!(NR){print$0p}{p=$0}' input

only tries to print anything when the condition !(NR) evaluates to a non-zero value. But, since the awk NR variable is set to one when awk reads the first record from your input files and increments by 1 every time another input record is read, !NR ALWAYS evaluates to zero. Therefore, the above script is logically equivalent to:

awk '{p=$0}

which, as you said, produces no output.

If you are just trying to remove duplicated adjacent lines in a file (and the first line in your file is never an empty line), you could try:

awk '$0 != p {print;p = $0}' input

If you could have an empty line as the first line in your file (and you want to keep that empty line in the output), you would need to make it a little more complicated:

awk '$0 != p || NR == 1 {print;p = $0}' input

The code Yoda suggested removes duplicated lines even if they are not adjacent. If you just need to worry about adjacent lines, Yoda's code does that as well but takes more time and memory to get the job done. For a small file like your sample; it doesn't matter. For a file with a huge number of lines with different contents, the code above should run considerably faster.

Hope this helps.

cmccabe · September 10, 2016, 8:47am

Thank you both very much