Addition of new line

sa1 · February 16, 2014, 11:00pm

Hi

I have a file whose contents are as follows:

sorce1       LEN   assumption   695     3570    0.770047        -       .       ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1       LEN   descriptive     3334    3570    .       -       0       Parent=f000001.1;

sorce1       LEN   assumption    8859    11328   0.628724        +       .       ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     8859    9032    .       +       0       Parent=f000002.1;

sorce1       LEN   assumption    354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive        354600    360111    .       +       0       Parent=f000012.1;

sorce1       LEN   assumption    350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     350567    353321    .       +       0                       Parent=f000012.2;

I wanted it to look like this

sorce1       LEN   predictive    695     3570    0.770047        -       .       ID=f000001;source_id=A.off_LEN_10008424;
sorce1       LEN   assumption   695     3570    0.770047        -       .       ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1       LEN   descriptive     3334    3570    .       -       0       Parent=f000001.1;

sorce1       LEN   predictive    8859    11328   0.628724        +       .       ID=f000002;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    8859    11328   0.628724        +       .       ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     8859    9032    .       +       0       Parent=f000002.1;

sorce1       LEN   predictive    350567    361011    0.628724        +       .       ID=f000012;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive        354600    360111    .       +       0       Parent=f000012.1;

sorce1       LEN   assumption    350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     350567    353321    .       +       0                       Parent=f000012.2;

Basically I wanted to add a statement with the third column entry as predictive and the ID having only the id name without anything after the dot.
So for every statement for assumption,I need to add a statement with predictive.

So i used this code

sed 's/\(.*\)assumption\(.*\)\(ID=[^.]*\)[^;]*\(;.*\)/\1predictive\2\3\4\n&/' file

However in my file, I have some instance where there are variants for the id name :For example One variant of id is f000012.1 and the other is f000012.2
this above code worked perfectly well for instance having no variants of IDS. But in case of variants,I am getting a multiple entry of predictive statement for the same ids.

result of the code

sorce1       LEN   predictive    695     3570    0.770047        -       .       ID=f000001;source_id=A.off_LEN_10008424;
sorce1       LEN   assumption   695     3570    0.770047        -       .       ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1       LEN   descriptive     3334    3570    .       -       0       Parent=f000001.1;

sorce1       LEN   predictive    8859    11328   0.628724        +       .       ID=f000002;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    8859    11328   0.628724        +       .       ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     8859    9032    .       +       0       Parent=f000002.1;

sorce1       LEN   predictive   354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive        354600    360111    .       +       0       Parent=f000012.1;

sorce1       LEN  predictive     350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   descrptive     350567    353321    .       +       0                       Parent=f000012.2;

whereas what i needed should look like this

sorce1       LEN   predictive    350567    361011    0.628724        +       .       ID=f000012;source_id=A.off_LEN_10008425;

Is there a way I could only add a single line with predictive statement with using the earliest start point i e : and farthest away end point to represent the predictive statement?The ID name shouldnt have variants .

Yoda · February 17, 2014, 11:32am

Using awk:

awk '
        /assumption/ {
                r = $0;
                sub ( "assumption", "predictive", r )
                sub ( /\.[0-9]*\;/, ";", r )
                print r
        }
        1
' file

sa1 · February 18, 2014, 12:20am

Thanks for the reply.
The script you gave didnt remove the multiple entry of predictive lines.

Basically I was looking for a command that would insert a predictive line for every assumption line and also look for ids which have variants and in case it finds variants ,it should represent the predictive line with start location representing the earliest start point in this eg it would be 350567 and farthest away end point 361011

Original Code

sorce1       LEN   assumption   695     3570    0.770047        -       .       ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1       LEN   descriptive     3334    3570    .       -       0       Parent=f000001.1;

sorce1       LEN   assumption    8859    11328   0.628724        +       .       ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     8859    9032    .       +       0       Parent=f000002.1;

sorce1       LEN   assumption    354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive        354600    360111    .       +       0       Parent=f000012.1;

sorce1       LEN   assumption    350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     350567    353321    .       +       0                       Parent=f000012.2;

Result :

sorce1       LEN   predictive    695     3570    0.770047        -       .       ID=f000001;source_id=A.off_LEN_10008424;
sorce1       LEN   assumption   695     3570    0.770047        -       .       ID=f000001.1;source_id=A.off_LEN_10008424;
sorce1       LEN   descriptive     3334    3570    .       -       0       Parent=f000001.1;

sorce1       LEN   predictive    8859    11328   0.628724        +       .       ID=f000002;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    8859    11328   0.628724        +       .       ID=f000002.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     8859    9032    .       +       0       Parent=f000002.1;

sorce1       LEN   predictive    350567    361011    0.628724        +       .       ID=f000012;source_id=A.off_LEN_10008425;
sorce1       LEN   assumption    354569    361011   0.628724        +       .       ID=f000012.1;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive        354600    360111    .       +       0       Parent=f000012.1;

sorce1       LEN   assumption    350567    354686    0.628724        +       .       ID=f000012.2;source_id=A.off_LEN_10008425;
sorce1       LEN   descriptive     350567    353321    .       +       0                       Parent=f000012.2;

ahamed101 · February 18, 2014, 12:48am

Try this

awk '/assumption/{
    st=$0; split($NF,arr,"[=;.]")
    if(arr[2] in key){ print; next }
    key[arr[2]]; $3="predictive"
    sub ( /\.[0-9]*\;/, ";")
    print; print st; next
  }1
' infile

--ahamed

sa1 · February 18, 2014, 10:18am

thanks but can it done using sed.

Becoz using awk and $3="predictive" command,,it is changing format of the file.

can we do it without mention of column usage
:

ahamed101 · February 18, 2014, 1:20pm

Try this...

awk '/assumption/{
        st=$0; split($NF,arr,"[=;.]")
        if(arr[2] in key){ print; next }
        key[arr[2]];
        sub(/assumption/, "predictive")
        sub ( /\.[0-9]*\;/, ";")
        print; print st; next
}1
' infile

--ahamed

sa1 · February 20, 2014, 11:01am

I tried using this code

awk '/assumption/ {
  line = $0
..
    print line
  }
  _++
}
1
' infile > outfile.txt

But this gave me a bash error

 
-bash: outfile.txt: Permission denied

vbe · February 20, 2014, 11:22am

In other words you have no write permission in that directory...

Yoda · February 20, 2014, 11:26am

Or file: outfile.txt already exist and you don't have permission to write to it.