Extracting content from xml file

Hello All,

Hope you are doing well!!!!!

I have a small code in the below format in xml file:

<UML:ModelElement.taggedValue>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0231X
HLD_DOORS_003X;HLD_DOORS_0021"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0232X
HLD_DOORS_003X;HLD_DOORS_ijkl"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0345X
HLD_DOORS_05762X;HLD_DOORS_aasja"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="version" value="1.0"/>
				<UML:TaggedValue tag="author" value="suvendu.rath"/>
				<UML:TaggedValue tag="created_date" value="2013-05-02 10:40:30"/>
				<UML:TaggedValue tag="modified_date" value="2013-05-02 16:59:06"/>
				<UML:TaggedValue tag="package" value="EAPK_6C9E48AC_4D1E_4953_A547_C222079BD1DD"/>
				<UML:TaggedValue tag="type" value="Sequence"/>
				<UML:TaggedValue tag="swimlanes" value="locked=false;orientation=0;width=0;inbar=false;names=false;color=0;bold=false;fcol=0;;cls=0;"/>
				<UML:TaggedValue tag="matrixitems" value="locked=false;matrixactive=false;swimlanesactive=true;width=1;"/>
				<UML:TaggedValue tag="ea_localid" value="63"/>
				<UML:TaggedValue tag="EAStyle" value="ShowPrivate=1;ShowProtected=1;ShowPublic=1;HideRelationships=0;Locked=0;Border=1;HighlightForeign=1;PackageContents=1;SequenceNotes=0;ScalePrintImage=0;PPgs.cx=2;PPgs.cy=1;DocSize.cx=850;DocSize.cy=1098;ShowDetails=0;Orientation=P;Zoom=100;ShowTags=0;OpParams=1;VisibleAttributeDetail=0;ShowOpRetType=1;ShowIcons=1;CollabNums=0;HideProps=0;ShowReqs=0;ShowCons=0;PaperSize=1;HideParents=0;UseAlias=0;HideAtts=0;HideOps=0;HideStereo=0;HideElemStereo=0;ShowTests=0;ShowMaint=0;ConnectorNotation=UML 2.1;ExplicitNavigability=0;AdvancedElementProps=1;AdvancedFeatureProps=1;AdvancedConnectorProps=1;ShowNotes=0;SuppressBrackets=0;SuppConnectorLabels=0;PrintPageHeadFoot=0;ShowAsList=0;"/>
				<UML:TaggedValue tag="styleex" value="ExcludeRTF=0;DocAll=0;HideQuals=0;AttPkg=1;ShowTests=0;ShowMaint=0;SuppressFOC=0;INT_ARGS=;INT_RET=;INT_ATT=;SeqTopMargin=50;MatrixActive=0;SwimlanesActive=1;MatrixLineWidth=1;MatrixLocked=0;TConnectorNotation=UML 2.1;TExplicitNavigability=0;AdvancedElementProps=1;AdvancedFeatureProps=1;AdvancedConnectorProps=1;ProfileData=;MDGDgm=;STBLDgm=;ShowNotes=0;VisibleAttributeDetail=0;ShowOpRetType=1;SuppressBrackets=0;SuppConnectorLabels=0;PrintPageHeadFoot=0;ShowAsList=0;"/>
</UML:ModelElement.taggedValue>

I want to export the tags starts with HLD_EA and HLD_DOORS.
These tags are only visible in these lines

<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0231X
HLD_DOORS_003X;HLD_DOORS_0021"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0232X
HLD_DOORS_003X;HLD_DOORS_ijkl"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0345X
HLD_DOORS_05762X;HLD_DOORS_aasja"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>

Now i want to feed this tags in to one excel sheet/any file type:

My file shall looks like this:

HLD_EA_0001X HLD_DOORS_002X
HLD_EA_0231X HLD_DOORS_003X HLD_DOORS_0021
HLD_EA_0232X HLD_DOORS_003X HLD_DOORS_ijkl

Can you please help me out how to write this script?

Thanks,
Suvendu

Please show us your attempts at the solution.

1 Like

I am not an expert of shell scripting
But this is my approach

sed -n '/<UML:TaggedValue tag="documentation" value="This sequence

"[HLD]/,/<\/Variable>/{
s/.*=\("[^"]*"\).*/\1/
t prnt
b
:prnt
p
}' file
grep "documentation" file | grep -o -E "HLD_[0-9a-zA-Z_]+"
1 Like

Thanks a lot for your reply.....

The script works fine......

<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0231X
HLD_DOORS_003X;HLD_DOORS_0021"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0232X
HLD_DOORS_003X;HLD_DOORS_ijkl"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0345X
HLD_DOORS_05762X;HLD_DOORS_aasja"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>
				<UML:TaggedValue tag="documentation" value="This sequence

HLD_EA_0001X
HLD_DOORS_002X"/>

In the above code i want to extract the content in to an excel file:
Ex: My excel file shall looks like below:---

1st Column in Excel      2nd Column in Excel 
HLD_EA_0001X           HLD_DOORS_002X
HLD_DOORS_003X       HLD_DOORS_003X
                          HLD_DOORS_0021
HLD_EA_0232X           HLD_DOORS_003X
                          HLD_DOORS_ijkl

Hope you have understood my question.

I donot know whether it is possible in shell script or not.But as per my knowledge it is possible in perl script.Give your idea and suggestions.

---------- Post updated at 08:48 AM ---------- Previous update was at 02:40 AM ----------

Any suggestion guys.......

Can i use "use Spreadsheet::WriteExcel"in perl and can do this.....

Or any simple solution is possible

perl -ne 'if (/documentation/){while(/(HLD_\w+)/g){print "$1"};print "\n"}' file

And yes, you may use Spreadsheet::WriteExcel module to write to an xls file.

1 Like

Hello Balajesuri,

Thanks for your reply...It works fine....

Trying to do the same using Spreadsheet::WriteExcel.

HLD_DOORS_XXX needs to be in first column in xls
Corresponding HLD_EA_XXX needs to be in second column in xls....

Any idea or suggestion are always welcome

Give it a try, suvendu4urs. Take a look at the sample program for WriteExcel in CPAN. Let us know what you have cooked and where you're stuck :slight_smile:

1 Like

Hello Balajesuri,

Thing started working out as required....Just missing with some ideas...find my below approach:

-Extracted the content from xml file using your perl script:

perl -ne 'if (/documentation/){while(/(HLD_\w+)/g){print "$1"};print "\n"}' file

-
Done the below script where i have given two file as command line argument
one is the output of the above step
second one is the new .xls file name.

#!/usr/bin/perl -w

    use strict;
    use Spreadsheet::WriteExcel;

    # Check for valid number of arguments
    #if (($#ARGV < 1) || ($#ARGV > 2)) {
     #  die("Usage: tab2xls tabfile.txt newfile.xls\n");
    #};

    # Open the tab-delimited file
    open (TABFILE, $ARGV[0]) or die "$ARGV[0]: $!";

    # Create a new Excel workbook
    my $workbook  = Spreadsheet::WriteExcel->new($ARGV[1]);
    my $worksheet = $workbook->addworksheet();
    # Row and column are zero indexed
    my $row = 0;

    while (<TABFILE>) {
       chomp;
       # Split on single space
       my @Fld = split(' ', $_);

       my $col = 0;
       foreach my $token (@Fld) {
           $worksheet->write($row, $col, $token);
           $col++;   
       }
       $row++;
    }

Wherever i am getting a single space i am splitting it.Now the data are feeded in to excel sheet.
But multiple columns are created.

May be i will try to figure out the solution for this.Your ideas and guideance on this is always welcome.

It's always a good idea to split according to white spaces (\s+). This will ensure that regardless of whether the columns are separated by spaces or tabs, your code will work.

# Split on white spaces
my @Fld = split(/\s+/, $_);

Thanks for your input...

Now i have got all my required details in different columns.Please find the below outcome:

1st column                2nd column            3rd column          4th column
HLD_EA_0001X           HLD_DOORS_002X
HLD_EA_003X             HLD_DOORS_003X  HLD_DOORS_0021 HLD_DOORS_XXX
HLD_EA_0232X           HLD_DOORS_003X   HLD_DOORS_ijkl    HLD_DOORS_CDKL

But here i dont want the 3rd and fourth column.
All the HLD_EA needs to be in 1st column
All the corresponding HLD_DOORS needs to be in 2 nd column.Data of 3rd and 4th column needs to be in second column.

I hope you have understand my question.

---------- Post updated at 11:12 AM ---------- Previous update was at 11:10 AM ----------

The output shall looks like this:


1st column                2nd column           
HLD_EA_0001X           HLD_DOORS_002X
HLD_EA_003X            HLD_DOORS_003X  
                              HLD_DOORS_0021 
                              HLD_DOORS_XXX
HLD_EA_0232X          HLD_DOORS_003X
                              HLD_DOORS_ijkl    
                              HLD_DOORS_CDKL

[/CODE]

Hello Guys.....
Anyways whatever i need i have got it.....

Trying to do something different from pattern extraction.....

The below is my xml code

<UML:TaggedValue tag="documentation" value="This sequence HLD_EA_0001X SRS_DOORS_002X"/>
<UML:TaggedValue tag="documentation" value="This sequence HLD_EA_0231X SRS_DOORS_003X;SRS_DOORS_0021"/>
<UML:TaggedValue tag="documentation" value="This sequence HLD_EA_0232X SRS_DOORS_003X;SRS_DOORS_ijkl"/>
<UML:TaggedValue tag="documentation" value="This sequence HLD_EA_0345X SRS_DOORS_05762X;SRS_DOORS_aasja"/>
<UML:TaggedValue tag="documentation" value="This sequence HLD_EA_0001X SRS_DOORS_002X"/>
<UML:TaggedValue tag="documentation" value="This sequence HLD_EA_0001X SRS_DOORS_002X"/>
<UML:TaggedValue tag="documentation" value="This sequence HLD_EA_0001X SRS_DOORS_002X"/>
<UML:TaggedValue tag="documentation" value="This sequence HLD_EA_0001X SRS_DOORS_002X"/>

I know that the below code extract the HLD_tag:

perl -ne 'if (/documentation/){while(/(HLD_\w+)/g){print "$1"};print "\n"}' file

I want to do some code changes in the above so that HLD tag as well corresponding SRS tag needs to be extracted...

Some change in while loop is required to achieve the same but not getting the exact one...