Selectively deleting newlines with sed

Xterra · November 10, 2018, 3:38pm

I have a file that look like this:

>Muestra-1
agctgcgagctgcgaccc
gggttatata
ggaagagacacacacaccccc
>Muestra-2
agctgcg
agctgcgacccgggttatataggaagagac
acacacaccccc
>Muestra-3
agctgcgagctgcgaccc
gggttatata
ggaagagacacacacaccccc

I use the following sed script to remove newlines from lines not starting with >

sed ':a /^>/!N;s/\r\?\n\([^>]\)/\1/;ta'

I was trying to use b instead of t . So, this is what I did:

sed '/^>/!{:a;N;$!ba};s/\r\?\n//g'

but didnt get the desired result. Is there any way to use b in the second script to eliminate the newlines skipping those ones that start with > ?

bakunin · November 10, 2018, 7:09pm

xterra:

So, this is what I did:
sed '/^>/!{:a;N;$!ba};s/\r\?\n//g'
but didnt get the desired result. Is there any way to use b in the second script to eliminate the newlines skipping those ones that start with > ?

The problem does not have anything to do with "t" or "b" but how sed actually works: lets say you have a sed-script like this:#

sed 'command1
     command2
     /regexp/ {
           command3
           command4
     }' /some/file

What happens is this: sed will read in the first line of the input file (this is called the "pattern space"), then apply the first line of its script to it ("command1"), then the next and so on until it reaches the end of the script. If still something is in the pattern space it will be printed to stdout, then the next line of input is read, setting the pattern space to it, then apply the first command ... So, in table format:

read line1 of input
apply "command1" to it
apply "command2" to the result of previous line
if /regexp/ matches
     apply "command3" to the result of previous line
     apply "command4" to the result of previous line
endif
read next line of input
apply "command1" to it
...

Now, what does your code do:

/^>/!            # do the following for all lines not starting with a ">"
     {:a                 # define a return point for any "t" or "b" command
     N                   # read next line immediately, not reurning to the beginning of the line
     $! ba               # if this is not the last line jump to a
     }
s/\r\?\n//g

Do you spot it? Once you are inside the condition it is never checked again, you only loop inside it, always adding more text to the pattern space but never doing anything with it - until you hit the last line. Also notice that "/^>/" is true for ANY pattern space content starting with ">". That means, for this:

> bla foo

but also for this, after adding a line:

> bla foo
more text

And the same goes the other way: not "/^>/" is true for this:

foo bar

but also for this:

foo bar
> a line starting with ">"

This means your logic is wrong, regardless of using "t" or "b". The difference is that "t" will branch only when the last s/...-command actually did something, whereas "b" will branch always. Say, this is the input file:

xxx
yyy
xxx

and this is your sed-script working on the file:

sed 's/xxx/XXX/
b end
s/yyy/YYY/
:end'

Then the substitution of "yyy" to "YYY" will never take place because ot is unconditionally skipped over. If you change the "b" to a "t" it will be executed because in the lines with no "xxx" the first substitution will do nothing and therefore the "t" will not branch to end.

I hope this helps.

bakunin

Xterra · November 11, 2018, 2:23pm

got it ! Thanks