Removing a block of duplicate lines from a file

Hi all,

I have a file with the data

1  abc
2  123
3  ;
4  rao
5  bell
6  ;
7  call
8  abc
9  123
10 ;
11 rao
12 bell
13 ;
 

I want remove the lines from 8 to 13 which are the repetion of the lines 1 to 6.

When i use the below command, one of semicolon (:wink: in the first 6 lines is also getting deleted which shouldn't.

sort myfile | uniq

But I need an ouput as

 
abc
2 123
3 ;
4 rao
5 bell
6 ;
7 call
 

Please help me with this..

Thanks alot!!

Regards,
Sreenivas

I don't understand the requirements. Do you want to remove all lines from a file that duplicate the contents of field 2 of the first six lines of your input file? Why do we need to remove field 1 of the first line in the file, but keep field 1 in the other six lines that are kept? Since field 2 in the third line in the file is a duplicate of field 2 in the sixth, tenth, and thirteenth lines of the input file, why shouldn't they all be removed?

Hi Don,

Sorry for the confusion.

Field 1 refers the line numbers. I just kept those for referring lines. Please treat that only field 2 is there in the actual file.

I want to remove a block of lines from 8 to 13, which are the duplicate of the lines 1 to 6.

From lines 1 to 6 ; got repeated twice. That should be kept as it is.

Finally my output should look like:

1 abc
2 123
3 ;
4 rao
5 bell
6 ;
7 call

Is there any special condition for ";"..? Please explain..

awk '!X[$2]++ || $2 ~ ";"' file
1  abc
2  123
3  ;
4  rao
5  bell
6  ;
7  call
10 ;
13 ;

And if you want to keep first two occurrences of ";" then use..

awk '!X[$2]++ || ( $2 ~ ";" && ++a < 2 )' file

1  abc
2  123
3  ;
4  rao
5  bell
6  ;
7  call

If you just want to remove line 8 to 13 then you can try this

sed '8,13d' filename

The requirements still aren't clear.

Are you always only concerned about matching the first six lines, or are you trying to find the first set of n ilnes that are duplicated later in the file?

If the chosen lines at the start of the file are duplicated multiple times, do you only want to remove the first set of duplicated lines or do you want to remove every set of duplicated lines?

try:

awk '{sub("^[0-9]* *",""); b=b $0 " "} /;/ {for (i in a) sub(a,"",b); a=b;
if (b) { i=split(b,o); for (j=1; j<=i; j++) print ++c, o[j]; }
}' infile

Just for fun:

$ cat infile
abc
123
;
rao
bell
;
call
abc
123
;
rao
bell
;

Output:

$ sed '1h;1!H;$!d;$g;s/\(.*\)\(.*\)\n\1/\1\2/' infile
abc
123
;
rao
bell
;
call
1 Like

Thank you all!! I got my requirement!!

---------- Post updated at 05:59 PM ---------- Previous update was at 05:25 PM ----------

Thank you all! I got it!!

clap clap!

The following variants run the RE only on the last line (and are faster?):

sed '1h;1!H;$!d;${g;s/\(.*\)\(.*\)\n\1/\1\2/;}'
sed -n '1h;1!H;${g;s/\(.*\)\(.*\)\n\1/\1\2/;p;}'

Hi MadeInGermany, thanks ;). My suggestion also only works on the last line, because of the $!d which stops executing the code for every line except the last one and reads the next line... The $ before the g is therefore not necessary:

sed '1h;1!H;$!d;g;s/\(.*\)\(.*\)\n\1/\1\2/' file