Change XML file structure script

cgkmal · October 5, 2011, 4:22am

Hi to all,

Maybe someone could help me. I want to transform the structure of a xml file.

I have this input.xml:

<?xml version="1.0" encoding="utf-8"?>
<votings>
  <file name="Reference 19762">
    <case id="No. 3 Div. 870">
      <j v="1">Peter</j>
      <j v="1">Ely</j>
      <j v="9">Mark</j>
    </case>
    <case id="No. 3 Div. 887">
      <j v="1">Mary</j>
      <j v="9">Peter</j>
      <j v="1">Ely</j>
      <j v="1">Perry</j>
      <j v="1">Mark</j>
    </case>
  </file>
</votings>

and the required output should be:

<Table ss:ExpandedColumnCount="3" ss:ExpandedRowCount="6" x:FullColumns="1"
   x:FullRows="1" ss:DefaultColumnWidth="60" ss:DefaultRowHeight="15">
   <Column ss:Width="30.75"/>
   <Column ss:Width="67.5" ss:Span="1"/>
   <Row ss:AutoFitHeight="0">
    <Cell ss:Index="2"><Data ss:Type="String">No. 3 Div. 870</Data></Cell>
    <Cell><Data ss:Type="String">No. 3 Div. 887</Data></Cell>
   </Row>
   <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">Ely</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
   </Row>
   <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">Mark</Data></Cell>
    <Cell><Data ss:Type="Number">9</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
   </Row>
   <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">Mary</Data></Cell>
    <Cell><Data ss:Type="Number">0</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
   </Row>
   <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">Perry</Data></Cell>
    <Cell><Data ss:Type="Number">0</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
   </Row>
   <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">Peter</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="Number">9</Data></Cell>
   </Row>
  </Table>

As you can see, in input.xml there are 2 blocks of "cases" with names (it could be more than 2 "cases" blocks. e.g 5, 7, 8 etc.).
In both "cases" blocks appear some names repeated (Peter, Ely and Mark appear in both blocks)

Then, in the output the "Row" blocks should be obtained as follow:

Block 1:

<Table ss:ExpandedColumnCount="3" ss:ExpandedRowCount="6" x:FullColumns="1"

Variables here are in red:
ExpandedColumnCount = Number of "Cases" blocks + 1 = 2 +1 = 3
ExpandedRowCount = Number of unique names + 1 = 5 + 1 = 6

Block 2 (the first "Row" block):

   <Row ss:AutoFitHeight="0">
    <Cell ss:Index="2"><Data ss:Type="String">No. 3 Div. 870</Data></Cell>
    <Cell><Data ss:Type="String">No. 3 Div. 887</Data></Cell>
   </Row>

The values in red should be taken from "case id" in "cases" blocks.

Blocks 3,4,5...N ("Row" blocks for each unique name):

   <Row ss:AutoFitHeight="0">
    <Cell><Data ss:Type="String">Ely</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
    <Cell><Data ss:Type="Number">1</Data></Cell>
   </Row>

The values in red are taken from "cases" blocks, but needs to look up unique names and show a unique "Row" block for each name and
all related values for each name within the same "Row" block as shown above.

I really hope you could help me with this. I'm a kind of beginner in this type of scripts.

Thanks in advance.

Regards.

Shell_Life · October 5, 2011, 10:45am

I read your requirements three times and I still could not figure it out.

You have a large project.

Why do you have to write a solution using unix shell script?

Try using a compiled language.

cgkmal · October 5, 2011, 11:54am

Hi, thanks for answer.

Sorry for my explanation. In input file there are repeated names and in the output only appear once with their corresponding values. Maybe you could understand better only seeing how is the input and how is the output.

I'm not sure, I've seen several questions of more complex xml transformations here, I hope this is not too complex. Maybe awk, sed or combination of those with bash.

This because is what I could understand and what I thought could be the solution. I really dont know a compiled language to try doing something like this.

Many thanks for any help.

Regards.

ahamed101 · October 5, 2011, 12:09pm

Like Shell_Life said, it is complex!
And yes it is possible through scripts, not that easy... But to what extend have you implemented this?...

btw, I also read your requirement 2 - 3 times, well couldn't catch all of them!

--ahamed

cgkmal · October 6, 2011, 5:08am

Hi ahamed,

I intend to parse the input in that way because the output would be open in MS Excel, but showing in different layout, as I said before, without repeat the names, only show them once and locate within the same block every associated value to each name. One block for each name.

Sorry for my explanation, I hope somebody could get my english explanation.

Thanks in advance.

Regards.

---------- Post updated 10-06-11 at 03:08 AM ---------- Previous update was 10-05-11 at 11:51 AM ----------

Hi again,

Answering my own question to show if anybody is interested in the future.

I had to separate step by step the conversion needed, it's not pure bash. I helped me with individual awk commands to get
each section of the script.

Probably the same script could be obtained in a unique awk program, I'll like to see how to join this code in a single awk program.

Well, the code I could work is:

#########################################################################################################################
### Begin of script of XML conversion ###################################################################################
#########################################################################################################################

Voting_Info="input.xml"

## (1) - Get cases id's between double quotes e.g. "<case id="No. 3 Div. 870">" and store them in varible array ###
oldIFS=$IFS
IFS=$'\n'
Cases=($(awk -F "[\"]" '/id=/{print $2}' "$Voting_Info"))
IFS=$oldIFS

## (2) - Count "cases" blocks
CasesNumber=($(awk -F "[\"]" '/id=/{print $2}' "$Voting_Info" | wc -l))
let "CasesNumber=$CasesNumber+1"  # Add "1" to set the value in "ExpandedColumnCount"

## (3) - Get unique names between ">" and "</j>", e.g. ">Mary</j>" and store them in varible array ###
UniqNames=($(awk -F "[><]" '/v=/{a[$3];} END{for (i in a) print i;}' "$Voting_Info" | sort))

## (4) Print first lines of output
echo "  <Table ss:ExpandedColumnCount=\""$CasesNumber"\" ss:ExpandedRowCount=\"6\" x:FullColumns=\"1\""
echo "   x:FullRows=\"1\" ss:DefaultColumnWidth=\"60\" ss:DefaultRowHeight=\"15\">"
echo "   <Column ss:Width=\"30.75\"/>"
echo "   <Column ss:Width=\"67.5\" ss:Span=\"1\"/>"

## (5) - Print first block, that is the "Cases" names block.
    echo "   <Row ss:AutoFitHeight=\"0\">"
    echo "    <Cell ss:Index=\"2\"><Data ss:Type=\"String\">""${Cases[0]}""</Data></Cell>"
#for i in "${Cases[@]}"
for ((i=1;i<${#Cases
[*]};i++))
   do
    echo "    <Cell><Data ss:Type=\"String\">""${Cases[$i]}""</Data></Cell>"
done

## (6) - Loop to get values of each name within all cases blocks
for j in "${UniqNames[@]}"
   do
    echo "   </Row>"
    echo "    <Row ss:AutoFitHeight=\"0\">"
    echo "    <Cell><Data ss:Type=\"String\">"$j"</Data></Cell>"
    awk -v Z=$j -F"[\"><]+" '/case id/{v=0}/\/case/{print "    <Cell><Data ss:Type=\"Number\">" v "</Data></Cell>"}$0 ~ Z{v=$3}' "$Voting_Info"
done
## (7) - Print last lines to complete output
echo "   </Row>"
echo "  </Table>"

Hope this helps.

Thanks as always for your help and time.

Regards

clx · October 6, 2011, 7:34am

If the rules are always same.. and the criteria for pattern matching are unchanged, which means the actual file is same as sample you have posted except the variables.

This is what I have..
Please note I have just tried to achieve your requirement i.e I hadn't thought about tuning and efficiency.

datafile=xmlfile
ExpandedColumnCount=$(( $(grep -c '<case id' ${datafile}) + 1 ))
ExpandedRowCount=$(( $(awk -F '[<>]' ' /<j/ {print $3| "sort -u|wc -l"}' ${datafile}) + 1))


cat <<-ENDCAT1
<Table ss:ExpandedColumnCount="${ExpandedColumnCount}" ss:ExpandedRowCount="${ExpandedRowCount}" x:FullColumns="1"
    x:FullRows="1" ss:DefaultColumnWidth="60" ss:DefaultRowHeight="15">
    <Column ss:Width="30.75"/>
    <Column ss:Width="67.5" ss:Span="1"/>
    <Row ss:AutoFitHeight="0">
$(awk -F '"' ' /case id=/ {print $2}' ${datafile} | sed 's|^|      <Cell ss:Index="2"><Data ss:Type="String">|g' | sed 's|$|</Data></Cell>|g')
    </Row>
ENDCAT1

awk -F '[<>]' ' /<j/ {print $3| "sort -u"}' ${datafile} | while read name
do
cat <<-ENDCAT2
    <Row ss:AutoFitHeight="0">
      <Cell><Data ss:Type="String">${name}</Data></Cell>"
$(awk -F '["<>]' -v n=$name '/<j/ && $5 == n {print $3}' ${datafile} | sed 's|^|      <Cell><Data ss:Type="Number">|g' | sed 's|$|</Data></Cell>|g')
    </Row>
ENDCAT2
done

echo '</Table>'

O/P

<Table ss:ExpandedColumnCount="3" ss:ExpandedRowCount="6" x:FullColumns="1"
    x:FullRows="1" ss:DefaultColumnWidth="60" ss:DefaultRowHeight="15">
    <Column ss:Width="30.75"/>
    <Column ss:Width="67.5" ss:Span="1"/>
    <Row ss:AutoFitHeight="0">
      <Cell ss:Index="2"><Data ss:Type="String">No. 3 Div. 870</Data></Cell>
      <Cell ss:Index="2"><Data ss:Type="String">No. 3 Div. 887</Data></Cell>
    </Row>
    <Row ss:AutoFitHeight="0">
      <Cell><Data ss:Type="String">Ely</Data></Cell>"
      <Cell><Data ss:Type="Number">1</Data></Cell>
      <Cell><Data ss:Type="Number">1</Data></Cell>
    </Row>
    <Row ss:AutoFitHeight="0">
      <Cell><Data ss:Type="String">Mark</Data></Cell>"
      <Cell><Data ss:Type="Number">9</Data></Cell>
      <Cell><Data ss:Type="Number">1</Data></Cell>
    </Row>
    <Row ss:AutoFitHeight="0">
      <Cell><Data ss:Type="String">Mary</Data></Cell>"
      <Cell><Data ss:Type="Number">1</Data></Cell>
    </Row>
    <Row ss:AutoFitHeight="0">
      <Cell><Data ss:Type="String">Perry</Data></Cell>"
      <Cell><Data ss:Type="Number">1</Data></Cell>
    </Row>
    <Row ss:AutoFitHeight="0">
      <Cell><Data ss:Type="String">Peter</Data></Cell>"
      <Cell><Data ss:Type="Number">1</Data></Cell>
      <Cell><Data ss:Type="Number">9</Data></Cell>
    </Row>
</Table>

Please note, this doesn't include one thing..
Putting entry with zero if the name doesn't appear in any row block. but i think you can try that yourself.

cgkmal · October 6, 2011, 5:13pm

Hello anchal_khare,

Many thanks for reply and give some of your time to share your knowledge to help. Your code works beautifully!

With your code I've learned more than one thing new.
1-) I didnt know that "cat" can be used in that way, it has a name use cat in that way?
2-) The use of "while read" in combination of awk commands to avoid several steps, processing, resources and memory using array variables.
3-) I had forgotten the great features of "sed" to replace directly at the begin or at the end of a string, great.

I only modified the following in your code (the added code in red).

I changed this:

$(awk -F '"' ' /case id=/ {print $2}' ${datafile} | sed 's|^|      <Cell ss:Index="2"><Data ss:Type="String">|g' | sed 's|$|</Data></Cell>|g')

for this:

$(awk -F '"' ' /case id=/ {print $2}' ${datafile} | sed q | sed 's|^|    <Cell ss:Index="2"><Data ss:Type="String">|g' | sed 's|$|</Data></Cell>|g')
$(awk -F '"' ' /case id=/ {print $2}' ${datafile} | sed 1d | sed 's|^|    <Cell><Data ss:Type="String">|g' | sed 's|$|</Data></Cell>|g')

and this:

$(awk -F '["<>]' -v n=$name '/<j/ && $5 == n {print $3}' ${datafile} | sed 's|^|      <Cell><Data ss:Type="Number">|g' | sed 's|$|</Data></Cell>|g')

for this (to get zeros when there is not a match)

$(awk -v Z=$name -F"[\"><]+" '/case id/{v=0}/\/case/{print "    <Cell><Data ss:Type=\"Number\">" v "</Data></Cell>"}$0 ~ Z{v=$3}' ${datafile})

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I think I'll be able to emulate your code to extend it and process other file that I need to, after this input.xml and input1.xml have
been changed in similar way they will be part of an final output.xml like show below:

Line1..............
Line2..............
..
.
Line30............

Output of processing input1.xml # Output from other input.xml

Output of processing input2.xml # The output your script already does

Line31............
Line31............
..
.
.
Line50............

In order to get that, may you help me with suggestions regarding this (assume your code is rutine2 and rutine1 is a code I have
to add to process input1.xml and I have both rutines working already):

A suggestion or idea in how to surround your script with a loop to process all somename_ci.xml in folder with Rutine1 and proccess
somename_vi.xml in folder with Rutine2.

The final code should look like this:

# there are couples of files with the same name, only different at the end with "_ci" and "_vi". 
E.g. December_ci.xml and December_vi.xml, June_ci.xml and June_vi.xml,... etc.

For i=1 to CountOfAllFilesInFolder/2 # divided by 2 since there are many couples of files to process in one output each couple
 
file1=somename_ci.xml # somename is the string that will vary
file2=somename_vi.xml # somename is the string that will vary
  do
      Code to add first 30 fixed lines #I'll think of emulate the way you use "cat" to add this fixed lines
      Code to add lines after processing somename_ci.xml #Rutine1, I'll add it later.
      Code to add lines after processing somename_vi.xml #Rutine2, Your code....
      Code to add last 20 fixed lines  #I'll think of emulate the way you use "cat" to add this fixed lines
done

I hope you can give and idea of how to get this last part of my code.

Thanks in advance.

Regards.

clx · October 7, 2011, 3:19am

I didn't understand the "divided by 2" reason as your statement is not clear.
Probably you want to split the final output file for being too large?

From the comments, If I understood you correctly, This is what you can do..

1) create two file which contains all the "vi files" and "ci files". The the order must be sequential. i.e "June.ci" and "June.vi" must be at the same line in both the files.
2) If you are not sure about the files i.e "June.ci" is there but "june.vi" is missing, You need to handle that condition to ignore that file. (its easy)

$ ls *vi.xml
December_vi.xml  July_vi.xml      June_vi.xml
$ 
$ ls *ci.xml
December_ci.xml  July_ci.xml      June_ci.xml
$ 
$ 
$ 
$ cat proc 
for i in *_vi.xml
do
        echo $i >> vi_file_list
        echo ${i/_vi/_ci} >> ci_file_list
done
 
 
exec 6<vi_file_list
while read -r ci
do
        read -r vi <&6
        echo Code to add first 30 fixed lines
        echo Code to add lines after processing $ci
        echo Code to add lines after processing $vi
        echo Code to add last 20 fixed lines
done <ci_file_list
exec >&6-
$ 
$ 
$ 
$ ./proc 
Code to add first 30 fixed lines
Code to add lines after processing December_ci.xml
Code to add lines after processing December_vi.xml
Code to add last 20 fixed lines
Code to add first 30 fixed lines
Code to add lines after processing July_ci.xml
Code to add lines after processing July_vi.xml
Code to add last 20 fixed lines
Code to add first 30 fixed lines
Code to add lines after processing June_ci.xml
Code to add lines after processing June_vi.xml
Code to add last 20 fixed lines
$ 
$

cgkmal · October 9, 2011, 7:14pm

Hi anchal_khare,

I can see it works, many thanks, only I don't understand some of the syntax in your code, may you explain me please the parts in red please?

1-)

 echo ${i/_vi/_ci} # What does it mean?

2-)

exec 6<vi_file_list # What does the exec command do and 6 why?

3-)

while read -r ci # Why -r and not only r?

4-)

read -r vi <&6 # What does it mean?

5-)

exec >&6- # This is like closing the exec 6 block?, why you use "-"?

6)

./proc # I think I understand this a little bit, this is like in your first code, closign "cat" block, 
only that is like executing a script because of the "./", right?

7-) Regarding the "cat proc" block, do you know how many lines can store or how is the capacity in memory before print them?

Many thanks for your great help.

Regard.

clx · October 10, 2011, 5:07am

Most of the part you will understand yourself if you think about it.
I will try to help you explaining and/or the source where you can get the help. (man pages are the easiest and best way offcourse).

replacing "_vi" with "_ci" in the variable "$i". ( output to STDOUT, hence redirected to file). Check for string manipulation in bash/ksh.

opening the file for reading with file descriptor 6. ( you can use any number form 3 to 9. since 0,1 and 2 are the standard descriptor used for STDIN,STDOIUT and STDERR respectively.) See IO Redirection for more details

Its your choice. "-r" invokes the "restricted shell". man man bash ("bash") and search for RESTRICTED SHELL.

Reading input line by line from the file (which is FD 6 now ) each line is in the variable "vi".

Yes. syntax for closing FD.

"proc" is the name of the file. executing with "./". No relation with "cat".

"cat proc" will show the contents of the file "proc" to the screen (STDOUT).

Hope it helps.

cgkmal · October 11, 2011, 9:18pm

Many thanks anchal_khare, for sure it will help me a lot, I've read the links you suggested me and I understand many more things of your script now.

Great support and help from you.

Much appreciated.

Regards