Help with data re-arrangement problem facing

cpp_beginner · December 13, 2011, 4:31am

Input file:

<symbol>Q9Y8G1</symbol>
<name>Q9Y8G1_EMENI</name>

<symbol>Q6V953</symbol>
<symbol>Q5B8K1</symbol>
<name>Q6V953_EMENI</name>

<symbol>G1A416</symbol>
<name>G1A416_9FUNG</name>

<symbol>D4N894</symbol>
<name>D4N894_PLEER</name>

<symbol>B0FFU4</symbol>
<symbol>B0LF02</symbol>
<symbol>B0LF04</symbol>
<symbol>B0LF05</symbol>
<symbol>B0LF07</symbol>
<symbol>B0LF08</symbol>
<name>B0FFU4_9HYPO</name>
.
.

Desired output file:

Q9Y8G1    Q9Y8G1_EMENI
Q6V953/Q5B8K1    Q6V953_EMENI
G1A416    G1A416_9FUNG
D4N894    D4N894_PLEER
B0FFU4/B0LF02/B0LF04/B0LF05/B0LF07/B0LF08    B0FFU4_9HYPO
.
.

Condition to generate desired output file:

Content in between "<symbol>" and "</symbol>" should represent data in column 1 of desired output file;
If more than one symbol share one "<name>", add "/" to represent the sharing relationship;
Content in between "<name>" and "</name>" should represent data in column 2 of desired output file;

"\n" can be treat as field separator of each group of data.

Thanks with any advice.

ahamed101 · December 13, 2011, 6:35am

Try this...

awk -F"<symbol>|</symbol>|<name>|</name>" '/symbol/{x=x?x"/"$2:$2}/name/{print x"\t"$2;x=""}' input_file

--ahamed

vivek_d_r · December 13, 2011, 6:41am

here is your code dude...

cat share.txt
<symbol>Q9Y8G1</symbol>
<name>Q9Y8G1_EMENI</name>

<symbol>Q6V953</symbol>
<symbol>Q5B8K1</symbol>
<name>Q6V953_EMENI</name>

<symbol>G1A416</symbol>
<name>G1A416_9FUNG</name>

<symbol>D4N894</symbol>
<name>D4N894_PLEER</name>

<symbol>B0FFU4</symbol>
<symbol>B0LF02</symbol>
<symbol>B0LF04</symbol>
<symbol>B0LF05</symbol>
<symbol>B0LF07</symbol>
<symbol>B0LF08</symbol>
<name>B0FFU4_9HYPO</name>

#!/bin/sh
 

while read line
do
        if [[ `expr match "$line" ".*symbol.*"` != "0" ]]
        then
                echo -n "$line" | awk -F '>' '{printf $2}' |  awk -F '<' '{printf $1}'
                echo -n "/"
        fi

        if [[ `expr match "$line" ".*name.*"` != "0" ]]
        then
                echo -n "         "
                echo "$line" | awk -F '>' '{print $2}' |  awk -F '<' '{print $1}'
        fi
done < share.txt

balajesuri · December 13, 2011, 6:42am

perl -ne 'if(/<symbol>/../<\/name>/){chomp;if(/symbol/){print "/" if($cnt!=0);s/[<\/symbol>]//g;print;$cnt++}if(/name/){s/[<\/name>]//g;print "\t$_\n";$cnt=0}}' inputfile

vivek_d_r · December 13, 2011, 6:43am

@ahamed..: you are good at this.. :-)... yours is better again...

michaelrozar17 · December 13, 2011, 6:59am

Alternate awk solution..

awk -F'[><]' '/symbol/{x=x"/"$3;++n;next}n{print substr(x,2),$3;x=n=""}' inputfile

cpp_beginner · December 13, 2011, 10:17pm

Thanks ahamed.
Do you mind to explain a little bit more about your code?
I'm quite confusing about following code:

'/symbol/{x=x?x"/"$2:$2}/name/{print x"\t"$2;x=""}'

Apart from that, if my input file is look like:

<name>Q9Y8G1_EMENI</name>
<symbol>Q9Y8G1</symbol>

<name>Q6V953_EMENI</name>
<symbol>Q6V953</symbol>
<symbol>Q5B8K1</symbol>

<name>G1A416_9FUNG</name>
<symbol>G1A416</symbol>

<name>D4N894_PLEER</name>
<symbol>D4N894</symbol>

<name>B0FFU4_9HYPO</name>
<symbol>B0FFU4</symbol>
<symbol>B0LF02</symbol>
<symbol>B0LF04</symbol>
<symbol>B0LF05</symbol>
<symbol>B0LF07</symbol>
<symbol>B0LF08</symbol>

How should I edit your code to generate the following output?

Q9Y8G1_EMENI   Q9Y8G1
Q6V953_EMENI   Q6V953/Q5B8K1
G1A416_9FUNG   G1A416
D4N894_PLEER   D4N894
B0FFU4_9HYPO   B0FFU4/B0LF02/B0LF04/B0LF05/B0LF07/B0LF08

Thanks for advice.

ahamed101 · December 13, 2011, 10:34pm

Try this...

awk -F"[><]" '/name/{if(x){sub("/$","",x);print x;x=""}x=$3"\t"}/symbol/{x=x $3"/"}END{sub("/$","",x);print x}'
 input_file

--ahamed

---------- Post updated at 07:34 PM ---------- Previous update was at 07:30 PM ----------

Usage of ternary operator, to avoid adding "/" at the beginning. Add "/" only if x is not empty!

Something like this...

if(x is not null)
{
  x = x "/" value
}
else
{
  x = value
}

HTH
--ahamed

cpp_beginner · December 14, 2011, 5:13am

hi ahamed,

I just found one bug by using your awk code:

<name>A2ASPNC</name>
<symbol>Remark:_alternate_name_YHR211W</symbol>

<name>9STRA</name>
<symbol>Unnamed_product</symbol>

By using your awk code, it will give the following result:

A2ASPNC	
Remark:_alternate_name_YHR211W	Remark:_alternate_name_YHR211W
9STRA	
Unnamed_product	Unnamed_product

Ideally, it should shown the result like this:

A2ASPNC	  Remark:_alternate_name_YHR211W	
9STRA   Unnamed_product

Do you have any idea to solve this bug?
I found out that the error happen is due to got the content of "name" in between "<symbol>" and "</symbol>"
Many thanks ya.

ahamed101 · December 14, 2011, 5:20am

Try this...

awk -F"[><]" '/<name>/{if(x){sub("/$","",x);print x;x=""}x=$3"\t"}
/<symbol>/{x=x $3"/"}END{sub("/$","",x);print x}' input_file

--ahamed

cpp_beginner · December 15, 2011, 5:02am

Thanks again ahamed, your awk code worked fine.
I'm facing another problem when my data is look like this:

__<tmp>SAST</tmp>
______<Reference_id="92320298"_key="4"_type="PAPER"/>
______<Reference_id="1621096"_key="5"_type="TEDT"/>
____</citation>
____<scope>SEQUENCE</scope>
__</reference>
__<Reference_id="Q9UWM9"_key="6"_type="ModelPortal"/>
__<Reference_id="AO:0005525"_key="7"_type="Go">
____<property_type="term"_value="F:GTP_binding"/>
____<property_type="evidence"_value="IEA:InterPro"/>
__</Reference>

__<tmp>G3FH</tmp>
____<scope>Sample</scope>
__</reference>
__<Reference_evidence="2"_id="JF460418"_key="3"_type="EMBL">
____<property_type="protein_sequence_ID"_value="AEN93129.1"/>
____<property_type="molecule_type"_value="Genomic_DNA"/>
__</Reference>

__<tmp>STAAD</tmp>
______<Reference_id="92320298"_key="4"_type="PAPER"/>
______<Reference_id="1621096"_key="5"_type="TEDT"/>
____</citation>
____<scope>SEQUENCE</scope>
__</reference>
__<Reference_id="AO:0005525"_key="7"_type="Go">
____<property_type="term"_value="F:TMP_binding"/>
__</Reference>

Desired output:

SAST AO:0005525 F:GTP_binding
G3FH - -
STAAD AO:0005525 F:TMP_binding

I just wanna extract the info in between "__<tmp>" and "</tmp>" to represent the first column data in output file;
Column 2 in output file is those content in "__<Reference_id="" when "AO:XXXXXXX" is detected. If not, just use a "-" to represent it;
Column 3 in output file is extract from those data that one line after "AO:XXXXXX" and only extract out the info in between "term"_value=" and "">"

Many thanks for advice.

ahamed101 · December 15, 2011, 6:16am

what about __<Reference_id="Q9UWM9"_key="6"_type="ModelPortal"/> ? that also has __<Reference_id="

--ahamed

cpp_beginner · December 15, 2011, 8:12pm

Hi ahamed,

I would only like to extract those info that have "__<Reference_id=" and the content must have "AO:XXXXX" as well.
Thus "__<Reference_id="Q9UWM9"_key="6"_type="ModelPortal"/> ? that also has __<Reference_id="" is not include because it don't have "AO:XXXXXX".
Thanks ya.

ahamed101 · December 16, 2011, 4:04am

Try this...

awk -F'[><"]' '/<tmp>/{if(!x&&y){printf "- -"}x=0;printf"\n"}
/<tmp>/{y=9;printf $3 OFS}/AO:/{x=1;printf $3 OFS;getline;printf $5 OFS}
END{printf"\n"}' input_file

--ahamed