Help with allocated text content based on specific rules...

Input file format:

/tag="ABL"
/note="abl homolog
2
/tag="ABLIM1"
/note="actin binding LIM 1
/tag="ABP1"
/note="amiloride binding protein 1 (amine oxidase (copper-
containing))
/tag="ABR"
/note="active BCR-related
/tag="AC003042.1"
/note="SDR family member 11
precursor
.
.
.

Desired output file:

/tag="ABL"
/note="abl homolog 2
/tag="ABLIM1"
/note="actin binding LIM 1
/tag="ABP1"
/note="amiloride binding protein 1 (amine oxidase (copper-containing))
/tag="ABR"
/note="active BCR-related
/tag="AC003042.1"
/note="SDR family member 11 precursor
.
.
.

If the first line of the content are not start as "/tag" or "/note". I would like those content allocated at the end of the content at "/note" based on the following rules:

  1. If the last content at the "/note" is end with "-", the content (first line are not start as "/tag" or "/note") should straight append to it.
    eg.
Input:
/note="amiloride binding protein 1 (amine oxidase (copper-
containing))

Desired output:
/note="amiloride binding protein 1 (amine oxidase (copper-containing))
  1. If the last content at the "/note" is excluded with "-", the content (first line are not start as "/tag" or "/note") should add a space " " before append to it.
    eg.
Input:
/note="SDR family member 11
precursor

Output:
/note="SDR family member 11 precursor

Any programming language (awk, sed ,perl ,etc) are appreciate.
Thanks first for advice :slight_smile:

Lazy way ...

tr '\n' '#' <infile | sed 's/#\([^/]\)/\1/g' | tr '#' '\n'

---------- Post updated at 09:33 AM ---------- Previous update was at 09:29 AM ----------

dealing with space stuff or not when end with '-' :

tr '\n' '#' <tst | sed 's/-#\([^/]\)/-\1/g;s/#\([^/]\)/ \1/g' | tr '#' '\n'

---------- Post updated at 09:34 AM ---------- Previous update was at 09:33 AM ----------

$ cat tst
/tag="ABL"
/note="abl homolog
2
/tag="ABLIM1"
/note="actin binding LIM 1
/tag="ABP1"
/note="amiloride binding protein 1 (amine oxidase (copper-
containing))
/tag="ABR"
/note="active BCR-related
/tag="AC003042.1"
/note="SDR family member 11
precursor
$ tr '\n' '#' <tst | sed 's/-#\([^/]\)/-\1/g;s/#\([^/]\)/ \1/g' | tr '#' '\n'
/tag="ABL"
/note="abl homolog 2
/tag="ABLIM1"
/note="actin binding LIM 1
/tag="ABP1"
/note="amiloride binding protein 1 (amine oxidase (copper-containing))
/tag="ABR"
/note="active BCR-related
/tag="AC003042.1"
/note="SDR family member 11 precursor

$

---------- Post updated at 09:37 AM ---------- Previous update was at 09:34 AM ----------

May be shorten a bit like:

tr '\n' '#' <inputfile | sed 's/-#/-/g;s/#\([^/]\)/ \1/g' | tr '#' '\n'
1 Like

Hi ctsgnb,

Thanks for your reply.
Your "lazy way" is worked but it don't follow rules 2 :frowning:
It gives the following output:

cat infile:
/note="SDR family member 11
precursor

tr '\n' '#' < infile | sed 's/#\([^/]\)/\1/g' | tr '#' '\n'
/note="SDR family member 11precursor

My desired output is:

/note="SDR family member 11 precursor

Thanks again :slight_smile:

I have meanwhile updated my previous post, did you try the last suggestion ?

---------- Post updated at 10:49 AM ---------- Previous update was at 10:40 AM ----------

also try

sed -e ':a' -e 'N;/^\/.*\n\/.*/{P;D;};s/\(.*\)-\n/\1-/;/^\/.*\n[^/].*/s/\n\([^/]\)/ \1/;p;d' -e 'ta' infile
1 Like

Hi ctsgnb,

Really thanks.
It worked :slight_smile: