Search for duplicates and delete but remain the first one based on a specific pattern

redse171 · July 27, 2013, 8:22am

Hi all,

I have been trying to delete duplicates based on a certain pattern but failed to make it works. There are more than 1 pattern which are duplicated but i just want to remove 1 pattern only and remain the rest. I cannot use awk '!x[$0]++' inputfile.txt or sed '/pattern/d' or use uniq and sort command as it will deleted all the duplicated patterns in the file. A sample as follows:

inputfile.txt

;;  
;;
ID    701
NAME    701
FUNC    Null
FUNC    Null
FUNC    Null
CC    27749
PRO    A
NO    NO:3676
NO    NO:3677
NO    NO:3723
NO    NO:3964
COMMENT    Nothing is impossible
@@
ID    702
NAME    702
FUNC    Null
FUNC    Null
FUNC    Null
FUNC    Null
PRO    A
NO    NO:3676
NO    NO:3677
COMMENT    Need to change
@@
ID    706
NAME    706
FUNC    Null
PRO    A
NO    NO:6301
NO    NO:6310
NO    NO:6450
NO    NO:6647
NO    NO:6812
@@

I want to remove the duplicates for pattern "FUNC" only, where the output should look like this:

output.txt

;;  
;;
ID    701
NAME    701
FUNC    Null
CC    27749
PRO    A
NO    NO:3676
NO    NO:3677
NO    NO:3723
NO    NO:3964
COMMENT    Nothing is impossible
@@
ID    702
NAME    702
FUNC    Null
PRO    A
NO    NO:3676
NO    NO:3677
COMMENT    Need to change
@@
ID    706
NAME    706
FUNC    Null
PRO    A
NO    NO:6301
NO    NO:6310
NO    NO:6450
NO    NO:6647
NO    NO:6812
@@

I have thousands of data like this and i need to delete a different pattern at one time. I tried to do it by specifying the column no too but it affects other duplicated values which i dont want it to be affected. Appreciate your help on this. Thanks

MadeInGermany · July 27, 2013, 9:03am

awk '$1!="FUNC" || $2!="Null" || $0!=prev {print} {prev=$0}' inputfile.txt

redse171 · July 27, 2013, 9:36am

Hi MadeInGermany,

Thanks so much!! It works perfectly... :). btw, can you pls explain to me the code? especially

$0!=prev {print} {prev=$0}'

ripat · July 27, 2013, 10:44am

Another way:

awk 'l==$0&&/FUNC/{next}{l=$0}1' file

redse171 · July 27, 2013, 11:01am

Hi ripat,

Yeah, i tried yours and it worked great too! But, if u dont mind, can u pls help me explain the code? Thanks

ripat · July 27, 2013, 11:14am

The idea is to store every line in a buffer variable {l=$0}

For every line seen == to the previous line stored in the buffer l==$0 and containing FUNC &&/FUNC/ skip that line {next} and start all over again with the next line.

If the line is not skipped it will be catched by the 1 at the end which is shorthand for print. Same as: l==$0&&/FUNC/{next}{l=$0;print}

redse171 · July 27, 2013, 11:41am

got that.. thanks!

MadeInGermany · July 27, 2013, 1:09pm

The expression before a {code in braces} is an implicit if.