Removing a block of duplicate lines from a file

raosr020 · December 13, 2012, 1:01am

Hi all,

I have a file with the data

1  abc
2  123
3  ;
4  rao
5  bell
6  ;
7  call
8  abc
9  123
10 ;
11 rao
12 bell
13 ;

I want remove the lines from 8 to 13 which are the repetion of the lines 1 to 6.

When i use the below command, one of semicolon ( in the first 6 lines is also getting deleted which shouldn't.

sort myfile | uniq

But I need an ouput as

 
abc
2 123
3 ;
4 rao
5 bell
6 ;
7 call

Please help me with this..

Thanks alot!!

Regards,
Sreenivas

Don_Cragun · December 13, 2012, 1:59am

I don't understand the requirements. Do you want to remove all lines from a file that duplicate the contents of field 2 of the first six lines of your input file? Why do we need to remove field 1 of the first line in the file, but keep field 1 in the other six lines that are kept? Since field 2 in the third line in the file is a duplicate of field 2 in the sixth, tenth, and thirteenth lines of the input file, why shouldn't they all be removed?

raosr020 · December 13, 2012, 5:44am

Hi Don,

Sorry for the confusion.

Field 1 refers the line numbers. I just kept those for referring lines. Please treat that only field 2 is there in the actual file.

I want to remove a block of lines from 8 to 13, which are the duplicate of the lines 1 to 6.

From lines 1 to 6 ; got repeated twice. That should be kept as it is.

Finally my output should look like:

1 abc
2 123
3 ;
4 rao
5 bell
6 ;
7 call

pamu · December 13, 2012, 6:50am

Is there any special condition for ";"..? Please explain..

awk '!X[$2]++ || $2 ~ ";"' file
1  abc
2  123
3  ;
4  rao
5  bell
6  ;
7  call
10 ;
13 ;

And if you want to keep first two occurrences of ";" then use..

awk '!X[$2]++ || ( $2 ~ ";" && ++a < 2 )' file

1  abc
2  123
3  ;
4  rao
5  bell
6  ;
7  call

Vikram_Tanwar12 · December 13, 2012, 7:03am

If you just want to remove line 8 to 13 then you can try this

sed '8,13d' filename

Don_Cragun · December 13, 2012, 10:16am

The requirements still aren't clear.

Are you always only concerned about matching the first six lines, or are you trying to find the first set of n ilnes that are duplicated later in the file?

If the chosen lines at the start of the file are duplicated multiple times, do you only want to remove the first set of duplicated lines or do you want to remove every set of duplicated lines?

rdrtx1 · December 13, 2012, 10:59am

try:

awk '{sub("^[0-9]* *",""); b=b $0 " "} /;/ {for (i in a) sub(a,"",b); a=b;
if (b) { i=split(b,o); for (j=1; j<=i; j++) print ++c, o[j]; }
}' infile

Scrutinizer · December 13, 2012, 7:47pm

Just for fun:

$ cat infile
abc
123
;
rao
bell
;
call
abc
123
;
rao
bell
;

Output:

$ sed '1h;1!H;$!d;$g;s/\(.*\)\(.*\)\n\1/\1\2/' infile
abc
123
;
rao
bell
;
call

raosr020 · December 14, 2012, 7:29am

Thank you all!! I got my requirement!!

---------- Post updated at 05:59 PM ---------- Previous update was at 05:25 PM ----------

Thank you all! I got it!!

MadeInGermany · December 14, 2012, 7:35am

clap clap!

The following variants run the RE only on the last line (and are faster?):

sed '1h;1!H;$!d;${g;s/\(.*\)\(.*\)\n\1/\1\2/;}'

sed -n '1h;1!H;${g;s/\(.*\)\(.*\)\n\1/\1\2/;p;}'

Scrutinizer · December 14, 2012, 7:42am

Hi MadeInGermany, thanks ;). My suggestion also only works on the last line, because of the $!d which stops executing the code for every line except the last one and reads the next line... The $ before the g is therefore not necessary:

sed '1h;1!H;$!d;g;s/\(.*\)\(.*\)\n\1/\1\2/' file