Delete lines and the first pattern between 2 matched patterns

redse171 · July 31, 2013, 9:00am

Hi,

i need help to delete all the lines between 2 matched patterns and the first pattern must be deleted too. sample as follows:

inputfile.txt

>kump_1
...........................
...........................
>start_0124
dgfhghgfh
fgfdgfh
fdgfdh
>kump_2
..........................
..........................
..........................
....................
>start_0012
sdfdsagf
gfhghg
>kump_3
...........................
>start_3254
sdafdsfg.......
fsdf....adfdf
fdsaf...

i want to have the output like this:

>start_0124
dgfhghgfh
fgfdgfh
fdgfdh
>start_0012
sdfdsagf
gfhghg
>start_3254
sdafdsfg.......
fsdf....adfdf
fdsaf...

i tried using

sed '/>kump/,/>/{ />/p; d }'

but it still show the ">kump_1" for the first line of the file. Any help is much appreciated. Thanks

Scott · July 31, 2013, 9:15am

$ awk '/^>/ {P = 0} /^>start/ {P = 1} P' file
>start_0124
dgfhghgfh
fgfdgfh
fdgfdh
>start_0012
sdfdsagf
gfhghg
>start_3254
sdafdsfg.......
fsdf....adfdf
fdsaf...

redse171 · July 31, 2013, 10:35am

Hi Scott,

thanks so much for the code. It works great!!..but i did a very minor change to the code as i have another issue that i didn't show in the sample. actually, not all line use ">start_xxx". So, i change it to ">kump" and add "!" in front of P in your code as follows:

 $ awk '/^>/ {P = 0} /^>kump/ {P = 1} !P' file

Thanks again for your kind help

MadeInGermany · August 1, 2013, 2:17am

With positive logic P stands for print

awk '/^>/ {P = 1} /^>kump/ {P = 0} P' file

RavinderSingh13 · August 1, 2013, 3:25am

Hello,

Just an another way to get the desired output.
Lets say we have file named check_test as follows.

 cat check_test
>kump_1
...........................
...........................
>start_0124
dgfhghgfh
fgfdgfh
fdgfdh
>kump_2
..........................
..........................
..........................
....................
>start_0012
sdfdsagf
gfhghg
>kump_3
...........................
>start_3254
sdafdsfg.......
fsdf....adfdf
fdsaf...

Command is as follows to get desired output.

 
$ sed 's/^\>kump_[0-9]//g; s/\.//g; /^ *$/d' check_test

Output is as follows.

>start_0124
dgfhghgfh
fgfdgfh
fdgfdh
>start_0012
sdfdsagf
gfhghg
>start_3254
sdafdsfg
fsdfadfdf
fdsaf
$

Thanks,
R. Singh

MadeInGermany · August 1, 2013, 7:16am

@R.Singh
Must be > not \> . The latter has a special meaning "right word boundary" in many sed versions.
Here is another sed solution (IMHO ugly compared to the awk solution):

sed -n -e '${x;p;x;p;}' -e '/^>kump/{g;1!p;}' -e 'H;/^>/h' file

alister · August 1, 2013, 1:14pm

This is wrong since it modifies line outside of the designated range by indiscriminately deleting dots.

If, as I suspect, the dots in the data are placeholders for irrelevant data, then this solution which depends on literal dots is wrong in a second way.

Regards,
Alister

---------- Post updated at 01:14 PM ---------- Previous update was at 12:27 PM ----------

I'm curious. Besides GNU (and perhaps Busybox, which emulates GNU whenever their minimalist mission allows), which sed implementations support \> ?

That sed script is not equivalent to the awk solution. If the data ends when in mid-range, it will print (accumulated) lines which the awk alternative would not have.

Not a one-liner, but it's the most straightforward approach which behaves analogously:

#n

/^>kump/ {
        :top
        n
        /^>/! b top
}
p

Regards,
Alister

MadeInGermany · August 2, 2013, 4:27am

Yes that's the safer strategy. As one-liner:

sed -n -e '/^>kump/{:top' -e 'n; /^>/!b top' -e '}' -e p file