awk and seen to report duplicates

Xterra · November 15, 2018, 12:13pm

I have this file:

@Muestra-1
agctgcgagctgcgacccgggttatataggaagagacacacacaccccc
+
!@$#%^&*()@^#&HH!&*(@&#*(FT^%$&*()*&^%@
@Muestra-2
agctgcgagctgcgacccgggttatataggaagagacacacacaccccc
+
!@$#%^&*()@^#&HH!&*(@&#*(FT^%$&*()*&^%@
@Muestra-3
agctgcgagctgcgacccgggttatataggaagagacacacacaccccc
+
!@$#%^&*()@^#&HH!&*(@&#*(FT^%$&*()*&^%@
@Muestra-4
agctgcgagctgcgacccgggttatataggaagagacacacacaccccc
+
!@$#%^&*()@^#&HH!&*(@&#*(FT^%$&*()*&^%@
@Muestra-5
agctgcgagctgcgacccgggttatataggaagagacacacacaccccc
+
!@$#%^&*()@^#&HH!&*(@&#*(FT^%$&*()*&^%@

And I would like to use this code to output the entries with identical nucleotide sequences (second line for every forth one). I thought did would do the job but obviously didnt:

awk 'BEGIN{ RS="^@"; FS="\n" } !x[$2]++ '

I am not quite sure what I am doing wrong here

vgersh99 · November 15, 2018, 12:36pm

and what's the desired output based on your sample input?
Kind of hard to understand the explanation - maybe the output would help.

RudiC · November 15, 2018, 12:55pm

Always four lines per record? Try

awk '{for (i=1; i<=3; i++) {getline X; $0 = $0 "\n" X}} !x[$2]++' file
@Muestra-1
agctgcgagctgcgacccgggttatataggaagagacacacacaccccc
+
 !@$#%^&*()@^#&HH!&*(@&#*(FT^%$&*()*&^%@

Or adapt your code like

awk 'BEGIN{ RS="\n@" } !x[$2]++ ' file

Xterra · November 15, 2018, 1:00pm

Rudy
Thanks a TON! Quick question, why BEGIN{ RS="^@"; FS="\n" } would not work?

RudiC · November 15, 2018, 1:11pm

Can't tell. Although man awk says

, it might be it doesn't like the caret for begin-of-line. When I modified your code snippet like in post#3, it seems to work.