I have a file that looks like this:
DIP-27772N DIP-18408N refseq:NP_523941
DIP-23436N|refseq:NP_536784 DIP-23130N|refseq:NP_652017
DIP-22958N|refseq:NP_651195 DIP-20072N|refseq:NP_724597
DIP-22928N|refseq:NP_569972 DIP-22042N|refseq:NP_536744|uniprotkb:P54622
DIP-20065N|refseq:NP_731331 DIP-17103N
I want to remove those lines that do not contain "refseq:NP" in either column (the 1st and last line in the given example)
required output
DIP-23436N|refseq:NP_536784 DIP-23130N|refseq:NP_652017
DIP-22958N|refseq:NP_651195 DIP-20072N|refseq:NP_724597
DIP-22928N|refseq:NP_569972 DIP-22042N|refseq:NP_536744|uniprotkb:P54622
How can I do it using grep? Any help would be highly appreciated.
Hello Syeda,
Could you please try following and let me know if this helps.
awk '{count=gsub(/refseq:NP/,"refseq:NP",$0);if(count==NF){print}}' Input_file
Output will be as follows.
DIP-23436N|refseq:NP_536784 DIP-23130N|refseq:NP_652017
DIP-22958N|refseq:NP_651195 DIP-20072N|refseq:NP_724597
DIP-22928N|refseq:NP_569972 DIP-22042N|refseq:NP_536744|uniprotkb:P54622
Thanks,
R. Singh
1 Like
RudiC
3
Try also
awk '2==gsub(/refseq:NP/,"&")' file
DIP-23436N|refseq:NP_536784 DIP-23130N|refseq:NP_652017
DIP-22958N|refseq:NP_651195 DIP-20072N|refseq:NP_724597
DIP-22928N|refseq:NP_569972 DIP-22042N|refseq:NP_536744|uniprotkb:P54622
---------- Post updated at 12:45 ---------- Previous update was at 12:43 ----------
If there's more than two columns, use NF==
as RavinderSingh13 does.
1 Like
With grep:
grep -v 'refseq:NP.*refseq:NP' file
Not asked here, but I want to mention that sed can delete the nth occurrence, here the 2nd:
sed 's/|refseq:NP[_0-9]*//2' file
---------- Post updated at 09:22 AM ---------- Previous update was at 08:55 AM ----------
Thanks to RavinderSingh, I see you want to do the opposite, then it's
grep 'refseq:NP.*refseq:NP' file
BTW you can use a back reference as follows
grep '\(refseq:NP\).*\1' file
1 Like