CSV to XML

Iam pretty new to UNIX and would like to convert a CSV to an XML file using AWK scripts. Can anybody suggest a solution? My CSV file looks something like this :

Serial No Growth% Annual % Commission % Unemployed %
1 35% 29% 59% 42%
2 61% 37% 34% 86%
3 68% 47% 62% 18%
4 78% 83% 91% 78%
5 53% 44% 67% 56%

You mean like this?

"35%","29%","59%","42%"
"61%","37%","34%","86%"
"68%","47%","62%","18%"
"78%","83%","91%","78%"
"53%","44%","67%","56%"

Otherwise it would not be CSV file.

Yes, that's what I mean.

Hi photon,
Can you help me with that problem I had - converting my CSV file to XML using AWK Scripts? I need it really urgently and I need your help. Since Iam new to Unix, it is making it harder to deduce and try to understand quickly. Please help.

Thanks.

given data file - pj.txt

SerialNo Growth% Annual% Commissio% Unemployed%
1 35% 29% 59% 42%
2 61% 37% 34% 86%
3 68% 47% 62% 18%
4 78% 83% 91% 78%
5 53% 44% 67% 56%

and script - pj.awk

BEGIN {OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
            tag=$i
         print "<" node "XML>"}
NR != 1 {print "   <" node ">"
         for (i = 1; i <= NF; i++)
            print "      <" tag ">" $i "</" tag ">"
         print "   </" node ">"}
END {print "</" node "XML>"}

nawk -v node=whatever -f pj.awk pj.txt

produces:

<whateverXML>
   <whatever>
      <SerialNo>1</SerialNo>
      <Growth%>35%</Growth%>
      <Annual%>29%</Annual%>
      <Commissio%>59%</Commissio%>
      <Unemployed%>42%</Unemployed%>
   </whatever>
   <whatever>
      <SerialNo>2</SerialNo>
      <Growth%>61%</Growth%>
      <Annual%>37%</Annual%>
      <Commissio%>34%</Commissio%>
      <Unemployed%>86%</Unemployed%>
   </whatever>
   <whatever>
      <SerialNo>3</SerialNo>
      <Growth%>68%</Growth%>
      <Annual%>47%</Annual%>
      <Commissio%>62%</Commissio%>
      <Unemployed%>18%</Unemployed%>
   </whatever>
   <whatever>
      <SerialNo>4</SerialNo>
      <Growth%>78%</Growth%>
      <Annual%>83%</Annual%>
      <Commissio%>91%</Commissio%>
      <Unemployed%>78%</Unemployed%>
   </whatever>
   <whatever>
      <SerialNo>5</SerialNo>
      <Growth%>53%</Growth%>
      <Annual%>44%</Annual%>
      <Commissio%>67%</Commissio%>
      <Unemployed%>56%</Unemployed%>
   </whatever>
</whateverXML>

vgersh99,
thanks for your help. Can you explain it step by step, so I can understand better. When I open my unix box, and say openssh servername
and enter user name and password. after that how do I execute this script?
Do let me know as Iam entirely new to this project. I would be very greatly indebted to you for your timely help.

Thanks,
pjanakir

Hi vgersh99,
Also the file that Iam sending should only be a CSV file, and not a text file. You said pj.txt, it should be pj.csv. Just to let you know about it.

Thanks,
pjanakir

vgersh99,
Since you gave the input as a text file, the output that you got was correct. Instead of using the text file, I used the csv file as under :

pj.csv file
---------
"Store Number", "Quarter 1 Sales", "Quarter 2 Sales", "Quarter 3 Sales", "Quarter 4 Sales"

"1",                "35%",              "29%",              "59%",              "42%"
"2",                "61%",              "37%",              "34%",              "86%"
"3",                "68%",              "47%",              "62%",              "18%"
"4",                "78%",              "83%",              "91%",              "78%"
"5",                "53%",              "44%",              "67%",              "56%"

pj.awk file
----------
BEGIN {OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
tag[i]=$i
print "<" node "XML>"}
NR != 1 {print " <" node ">"
for (i = 1; i <= NF; i++)
print " <" tag [i]">" $i "</" tag [i]">"
print " </" node ">"}
END {print "</" node "XML>"}

Script Execution Command
-------------------------

nawk -v node=whatever -f pj.awk pj.csv

Output :
--------

<whateverXML>
<whatever>
</whatever>
<whatever>
<"Store>"1",</"Store>
<Number",>"35%",</Number",>
<"Quarter>"29%",</"Quarter>
<1>"59%",</1>
<Sales",>"42%"</Sales",>
</whatever>
<whatever>
<"Store>"2",</"Store>
<Number",>"61%",</Number",>
<"Quarter>"37%",</"Quarter>
<1>"34%",</1>
<Sales",>"86%"</Sales",>
</whatever>
<whatever>
<"Store>"3",</"Store>
<Number",>"68%",</Number",>
<"Quarter>"47%",</"Quarter>
<1>"62%",</1>
<Sales",>"18%"</Sales",>
</whatever>
<whatever>
<"Store>"4",</"Store>
<Number",>"78%",</Number",>
<"Quarter>"83%",</"Quarter>
<1>"91%",</1>
<Sales",>"78%"</Sales",>
</whatever>
<whatever>
<"Store>"5",</"Store>
<Number",>"53%",</Number",>
<"Quarter>"44%",</"Quarter>
<1>"67%",</1>
<Sales",>"56%"</Sales",>
</whatever>
</whateverXML>

Also, please could you explain me the steps, Sir, so I can understand better.

Thanks once again for your help.

pjanakir

Sir,
I need the Input file to be a .CSV file so that I could get the proper output. Could you please help me?

Thanks,
pjanakir

assuming you don't have any embedded commas within the quoted fields.....

pj.awk

BEGIN {FS=",";OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
            tag=$i
         print "<" node "XML>"}
NR != 1 {print "   <" node ">"
         for (i = 1; i <= NF; i++)
            print "      <" tag ">" $i "</" tag ">"
         print "   </" node ">"}
END {print "</" node "XML>"}

Sir,
I don't have words to thank you. But I want to understand it, so I can explain it to my supervisor. Can you go ahead and explain the .awk script and the execution command too? I would appreciate if you could do so? I thought awk was just a command used to search for patterns, but never knew that we could also use for file conversion. I really did not understand, what you meant by FS, OFS, NR, NF, node etc. Can you please go thru in detail step by step?

I really thank you once again.

Thanks,
pjanakir

BEGIN {
      # input Field Separator set to ','
   FS=",";
      # output Field Separator set to '\n' [new line]
   OFS = "\n"}

NR == 1 {
            # for the FIRST record/line in a file - iterate though the fields 
            # from '1' to 'NF' [NF stands for 'Number_of_Fields'
         for (i = 1; i <=NF; i++)
                # populate an array 'tag' indexed by the field NUMBER with the
                # VALUE of the field - here're we're reading the HEADERS
            tag=$i

             # print the BEGINNING of the XML document tagged with the value
             # with the value of the variable 'node' passed into the script
         print "<" node "XML>"
}
   # for the other records/lines - other than the FIRST line.....
NR != 1 {
             # this is the BEGINNING of the of the XML record
         print "   <" node ">"
             # iterate though the fields........ "i" is the iterator
         for (i = 1; i <= NF; i++)
              # outputing the header for this field "tag" and the VALUE of this
              # of this field "$i"
            print "      <" tag ">" $i "</" tag ">"

            # this is the END of the XML record
         print "   </" node ">"}
    # at the end of the file processing CLOSE the entire XML document
END {
   print "</" node "XML>"
}

do 'man nawk' for the meaning of the builtin variables [FS, OFS, NF etc...] and the syntax of the language.

Thanks a lot.

Hi vgersh99,
Assuming that we don't know what the name of the jar file, we are searching for is, do we have any way, of finding a string or a file in it. Is there any Unix command for it?

thanks.

pls don't piggy-back threads - start a new one for the unrelated questions.

A fantastic article - thanks

One slight question - the fields (Header and Results) are enclosed in double quotes in my CSV sources

"Col1","Col2","Col3"
"AA","222","2009-04-01"

The pj.awk file converts the results brilliantly -

<whateverXML>
<whatever>
<"Col1">"AA"</"Col1">
<"Col2">"222"</"Col2">
<"Col3">"2009-04-01"</"Col3">
</whatever>
</whateverXML>

However is there a switch that could be applied remove the quotes from the XML elements - I could throw the file at a simple edit command (sed) but that would add time to the overall execution process

Many thanks