Help with modifying files

Hello everyone,

I have some data files, with mixed header formats. the sample for the same is:

>ABCD76567.x1 
AGTCGATCGTAGTCGTAGCTGT
>ABCD76567.y1
AGTCGATCGTAGTCGTAGCTGT
>ABCD76568.x1 pair_info:898989
AGTCGATCGTAGTCGTAGCTGT
>ABCD76568.y1 pair_info:893489
AGTCGATCGTAGTCGTAGCTGT
>ABCD76569.x1 pair_info:892189
AGTCGATCGTAGTCGTAGCTGT
>ABCD76569.y1 pair_info:2098308
AGTCGATCGTAGTCGTAGCTGT
>ABCD76570.x01 pair_info:8787321
AGTCGATCGTAGTCGTAGCTGT
>ABCD76570.x1 pair_info:898989
AGTCGATCGTAGTCGTAGCTGT
>ABCD76570.y1 pairs_info:898989,87574
AGTCGATCGTAGTCGTAGCTGT
 >ABCD76571.x1 pair_info:1626762
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76572.x1 pairs_info:898989,34374
AGTCGATCGTAGTCGTAGCTGT
>ABCD76572.y01 pair_info:898989
AGTCGATCGTAGTCGTAGCTGT
>ABCD76572.y1 pair_info:898989
AGTCGATCGTAGTCGTAGCTGT
>ABCD76573.y1 pair_info:113242
 AGTCGATCGTAGTCGTAGCTGT
...
....
..
..

I just need to focus on the the first field in the header line and there are 3 things I need to achieve:

  1. the headers which do not have "pair_info" field are to be put in one file, such that :
>ABCD76567.x1 
AGTCGATCGTAGTCGTAGCTGT
>ABCD76567.y1
 AGTCGATCGTAGTCGTAGCTGT
...
....
...
  1. The headers with "pair_info" and "pairs_info" are to be put in one file so that it satisfies the following:
>ABCD76568.x1 pair_info:898989
 AGTCGATCGTAGTCGTAGCTGT
 >ABCD76568.y1 pair_info:893489
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76569.x1 pair_info:892189
 AGTCGATCGTAGTCGTAGCTGT
 >ABCD76569.y1 pair_info:2098308
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76570.x1 pair_info:898989
 AGTCGATCGTAGTCGTAGCTGT
 >ABCD76570.y1 pairs_info:898989,87574
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76572.x1 pairs_info:898989,34374
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76572.y1 pair_info:898989
 AGTCGATCGTAGTCGTAGCTGT

From the above, I do not need header information with no pairs, such as in case of
>ABCD76573.y1 (no corresponding *.x1 pair) and >ABCD76571.x1 (no corresponding *.y1 pair)

Thanks!

Hi

For Req 1:

# sed  '/pairs*_info/{$!N;d}' file
>ABCD76567.x1
AGTCGATCGTAGTCGTAGCTGT
>ABCD76567.y1
AGTCGATCGTAGTCGTAGCTGT
#

For Req 2:

# sed -n '/pairs*_info/{$!N;p}' file
>ABCD76568.x1 pair_info:898989
AGTCGATCGTAGTCGTAGCTGT
>ABCD76568.y1 pair_info:893489
AGTCGATCGTAGTCGTAGCTGT
>ABCD76569.x1 pair_info:892189
AGTCGATCGTAGTCGTAGCTGT
>ABCD76569.y1 pair_info:2098308
AGTCGATCGTAGTCGTAGCTGT
>ABCD76570.x01 pair_info:8787321
AGTCGATCGTAGTCGTAGCTGT
>ABCD76570.x1 pair_info:898989
AGTCGATCGTAGTCGTAGCTGT
>ABCD76570.y1 pairs_info:898989,87574
AGTCGATCGTAGTCGTAGCTGT
 >ABCD76571.x1 pair_info:1626762
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76572.x1 pairs_info:898989,34374
AGTCGATCGTAGTCGTAGCTGT
#

You can redirect the above output to any file of your choice.

Guru.

Thanks for your reply.

But in Req 2, I need a condition to satisfy, so that pairs are in the following ouput:

>ABCD76568.x1 pair_info:898989
 AGTCGATCGTAGTCGTAGCTGT
 >ABCD76568.y1 pair_info:893489
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76569.x1 pair_info:892189
 AGTCGATCGTAGTCGTAGCTGT
 >ABCD76569.y1 pair_info:2098308
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76570.x1 pair_info:898989
 AGTCGATCGTAGTCGTAGCTGT
 >ABCD76570.y1 pairs_info:898989,87574
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76572.x1 pairs_info:898989,34374
 AGTCGATCGTAGTCGTAGCTGT
>ABCD76572.y1 pair_info:898989
 AGTCGATCGTAGTCGTAGCTGT

Also I need to pull those sequences which have "pair(s)_info" field, but do not have a corresponding pair (x1 but no y1 and vice-versa), like the last sequence in my example:

>ABCD76573.y1 pair_info:113242
 AGTCGATCGTAGTCGTAGCTGT

will go in the first file.

Thanks!

---------- Post updated 08-12-10 at 09:12 AM ---------- Previous update was 08-11-10 at 11:34 AM ----------

Any more thoughts about this ?