Parsing a file based on next line

I have a file1 like

ID   E2AK1_HUMAN             Reviewed;         630 AA.
CC   -!- SUBCELLULAR LOCATION: Host nucleus {ECO:0000305}.
ID   E1A_ADEM1               Reviewed;         200 AA.
ID   E1A_ADES7               Reviewed;         266 AA.
CC   -!- SUBCELLULAR LOCATION: Host nucleus {ECO:0000305}.
ID   E1B55_ADE02             Reviewed;         495 AA.
CC   -!- SUBCELLULAR LOCATION: Membrane {ECO:0000269|PubMed:10211970}.
ID   E1B9_ADE07              Reviewed;          88 AA.
ID   E1BL_ADE05              Reviewed;         496 AA.
ID   E1BL_ADET1              Reviewed;         391 AA.
ID   E1BS_ADE02              Reviewed;         175 AA.
CC   -!- SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}. Host
ID   E1BS_ADE04              Reviewed;         142 AA.
CC   -!- SUBCELLULAR LOCATION: Host cell membrane {ECO:0000250}. Host
ID   E2204_ARATH             Reviewed;         329 AA.
ID   E2AB_ECOLX              Reviewed;         123 AA.
CC   -!- SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}.
ID   E2AK1_MACFA             Reviewed;         631 AA. 

I want to create a file2 like

ID   E2AK1_HUMAN             Reviewed;         630 AA. CC   -!- SUBCELLULAR LOCATION: Host nucleus {ECO:0000305}.
ID   E1A_ADES7               Reviewed;         266 AA. CC   -!- SUBCELLULAR LOCATION: Host nucleus {ECO:0000305}.
ID   E1B55_ADE02             Reviewed;         495 AA. CC   -!- SUBCELLULAR LOCATION: Membrane {ECO:0000269|PubMed:10211970}.
ID   E1BS_ADE02              Reviewed;         175 AA. CC   -!- SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}. Host
ID   E1BS_ADE04              Reviewed;         142 AA. CC   -!- SUBCELLULAR LOCATION: Host cell membrane {ECO:0000250}. Host
ID   E2AB_ECOLX              Reviewed;         123 AA. CC   -!- SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}.

Each line starting for ID will only remain if the next line start from CC. So if a there is line starting from ID and next line also starting from ID, then the first occurrence of ID should be deleted (for example line3 in file1 will be deleted as the next line start from ID). Further as the line1 in file1 start from ID and its next start from CC so it will result like line1 in file2.

Use code tags, not icode please.

```text
stuff
```

or the button.

With minimal validation

perl -lne '/^CC\s+/ && $previous && print "$previous $_"; $previous = $_' file1 > file2

With specific validation

perl -lne '/^CC\s+/ && $previous =~ /^ID\s+/ && print "$previous $_"; $previous = $_' file1 > file2

Whenever you need to solve some thing, try to design your approach first before writing code.

Pseudocode:

while read each line from file 
do
   if start with ID then
       save the line to a variable=var
   fi
   if start with CC then
       print the variable=var       
   fi
done 

And, if you prefer awk instead of perl , you could try:

awk '
$1 == "ID" { id = $0; next }
$1 == "CC" { print id, $0 }
' file1 > file2

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin//awk , or nawk .

Hello Sammy777,

Following is an another approach with awk may help you too.

awk '($1 == "ID"){S=$0;++i;{if(i>1){i=1}}} ($1 == "CC"){if(i==1){print S OFS $0;S="";i=""}}'  Input_file > Output_file

Thanks,
R. Singh

If you're going to take this step to verify that you only print a "CC" line that appears after an "ID" line that hasn't already been printed, why use:

++i;{if(i>1){i=1}}

instead of the much simpler i=1 ? And, why use:

S="";i=""

instead of just i=0 ?

Or, simpler still:

awk '$1=="ID"{S=$0} $1=="CC" && S!=""{print S,$0;S=""}}' Input_file > Output_file
1 Like

Try

$ awk '$1=="CC" && p{print s OFS $0}p=$1=="ID"{s=$0}' infile

sed version:

sed -n '/^CC/{x;G;/^ID/s/\n/ /p;};h'  file