Parsing a file based on next line

sammy777 · October 29, 2014, 3:34pm

I have a file1 like

ID   E2AK1_HUMAN             Reviewed;         630 AA.
CC   -!- SUBCELLULAR LOCATION: Host nucleus {ECO:0000305}.
ID   E1A_ADEM1               Reviewed;         200 AA.
ID   E1A_ADES7               Reviewed;         266 AA.
CC   -!- SUBCELLULAR LOCATION: Host nucleus {ECO:0000305}.
ID   E1B55_ADE02             Reviewed;         495 AA.
CC   -!- SUBCELLULAR LOCATION: Membrane {ECO:0000269|PubMed:10211970}.
ID   E1B9_ADE07              Reviewed;          88 AA.
ID   E1BL_ADE05              Reviewed;         496 AA.
ID   E1BL_ADET1              Reviewed;         391 AA.
ID   E1BS_ADE02              Reviewed;         175 AA.
CC   -!- SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}. Host
ID   E1BS_ADE04              Reviewed;         142 AA.
CC   -!- SUBCELLULAR LOCATION: Host cell membrane {ECO:0000250}. Host
ID   E2204_ARATH             Reviewed;         329 AA.
ID   E2AB_ECOLX              Reviewed;         123 AA.
CC   -!- SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}.
ID   E2AK1_MACFA             Reviewed;         631 AA.

I want to create a file2 like

ID   E2AK1_HUMAN             Reviewed;         630 AA. CC   -!- SUBCELLULAR LOCATION: Host nucleus {ECO:0000305}.
ID   E1A_ADES7               Reviewed;         266 AA. CC   -!- SUBCELLULAR LOCATION: Host nucleus {ECO:0000305}.
ID   E1B55_ADE02             Reviewed;         495 AA. CC   -!- SUBCELLULAR LOCATION: Membrane {ECO:0000269|PubMed:10211970}.
ID   E1BS_ADE02              Reviewed;         175 AA. CC   -!- SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}. Host
ID   E1BS_ADE04              Reviewed;         142 AA. CC   -!- SUBCELLULAR LOCATION: Host cell membrane {ECO:0000250}. Host
ID   E2AB_ECOLX              Reviewed;         123 AA. CC   -!- SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}.

Each line starting for ID will only remain if the next line start from CC. So if a there is line starting from ID and next line also starting from ID, then the first occurrence of ID should be deleted (for example line3 in file1 will be deleted as the next line start from ID). Further as the line1 in file1 start from ID and its next start from CC so it will result like line1 in file2.

Corona688 · October 29, 2014, 3:54pm

Use code tags, not icode please.

```text
stuff
```

or the button.

Aia · October 29, 2014, 4:22pm

With minimal validation

perl -lne '/^CC\s+/ && $previous && print "$previous $_"; $previous = $_' file1 > file2

With specific validation

perl -lne '/^CC\s+/ && $previous =~ /^ID\s+/ && print "$previous $_"; $previous = $_' file1 > file2

ghostdog74 · October 29, 2014, 9:41pm

Whenever you need to solve some thing, try to design your approach first before writing code.

Pseudocode:

while read each line from file 
do
   if start with ID then
       save the line to a variable=var
   fi
   if start with CC then
       print the variable=var       
   fi
done

Don_Cragun · October 29, 2014, 11:28pm

And, if you prefer awk instead of perl , you could try:

awk '
$1 == "ID" { id = $0; next }
$1 == "CC" { print id, $0 }
' file1 > file2

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin//awk , or nawk .

RavinderSingh13 · October 30, 2014, 1:05am

Hello Sammy777,

Following is an another approach with awk may help you too.

awk '($1 == "ID"){S=$0;++i;{if(i>1){i=1}}} ($1 == "CC"){if(i==1){print S OFS $0;S="";i=""}}'  Input_file > Output_file

Thanks,
R. Singh

Don_Cragun · October 30, 2014, 2:06am

ravindersingh13:

Hello Sammy777,

Following is an another approach with awk may help you too.
awk '($1 == "ID"){S=$0;++i;{if(i>1){i=1}}} ($1 == "CC"){if(i==1){print S OFS $0;S="";i=""}}'  Input_file > Output_file
Thanks,
R. Singh

If you're going to take this step to verify that you only print a "CC" line that appears after an "ID" line that hasn't already been printed, why use:

++i;{if(i>1){i=1}}

instead of the much simpler i=1 ? And, why use:

S="";i=""

instead of just i=0 ?

Or, simpler still:

awk '$1=="ID"{S=$0} $1=="CC" && S!=""{print S,$0;S=""}}' Input_file > Output_file

Akshay_Hegde · October 30, 2014, 3:37am

Try

$ awk '$1=="CC" && p{print s OFS $0}p=$1=="ID"{s=$0}' infile

Scrutinizer · October 30, 2014, 4:48am

sed version:

sed -n '/^CC/{x;G;/^ID/s/\n/ /p;};h'  file