splitting a file (xml) into multiple files

sasi_u · July 24, 2010, 10:31am

To split the files

Hi,
I'm having a xml file with multiple xml header. so i want to split the file into multiple files.

Test.xml
---------

<?xml version="UTF_8">
<emp: ....>
 <name>a</name>
 <age>10</age>
</emp>
<?xml version="UTF_8">
<emp: ....>
 <name>b</name>
 <age>10</age>
</emp>
<?xml version="UTF_8">
<emp: ....>
 <name>c</name>
 <age>10</age>
</emp>

I want to split the test.xml into 3 files (each xml) like below

test1.xml
---------

<?xml version="UTF_8">
<emp: ....>
 <name>a</name>
 <age>10</age>
</emp>

test2.xml
---------

<?xml version="UTF_8">
<emp: ....>
 <name>b</name>
 <age>10</age>
</emp>

test3.xml
---------

<?xml version="UTF_8">
<emp: ....>
 <name>c</name>
 <age>10</age>
</emp>

I tried with the awk command but still didn't get thru.
Pls help on this.

Thanks,

vgersh99 · July 24, 2010, 10:42am

what exactly have you tried?

sasi_u · July 24, 2010, 11:01am

awk '/<?xml version="UTF_8">/{n++}{print > f n}' f=test test.xml

Then even i modified the xml to include BEGINXML and ENDXML (for each xml wise) and tried with the below command

awk '/BEGINXML/{f="doc."++d} f{print > f} /ENDXML/{close f; f=""}' test.xml

its not working.

Pls suggest

Christoph_Spohr · July 24, 2010, 11:38am

Hi,

try:

awk '/xml/{c++}{print > "file" c ".xml"}' file

HTH

Chris

aigles · July 24, 2010, 11:50am

try and adapt the following awk script :

awk '
FNR==1 {
   path = namex = FILENAME;
   sub(/^.*\//,   "", namex);
   sub(namex "$", "", path );
   name = ext  = namex;
   sub(/\.[^.]*$/, "", name);
   sub("^" name,   "", ext );
}
/<\?xml / {
   if (out) close(out);
   out = path name (++file) ext ;
   print "Spliting to " out " ...";
}
/<\?xml /,/<\/emp>/ {
   print $0 > out
}
' sasi.xml

Input file (sasi.xml)

$ cat sasi.xml
<?xml version="UTF_8">
<emp: ....>
 <name>a</name>
 <age>10</age>
</emp>
<?xml version="UTF_8">
<emp: ....>
 <name>b</name>
 <age>10</age>
</emp>
<?xml version="UTF_8">
<emp: ....>
 <name>c</name>
 <age>10</age>
</emp>
$ ./sasi.sh
Spliting to sasi1.xml ...
Spliting to sasi2.xml ...
Spliting to sasi3.xml ...
$ more -999 sasi[0-9].xml
::::::::::::::
sasi1.xml
::::::::::::::
<?xml version="UTF_8">
<emp: ....>
 <name>a</name>
 <age>10</age>
</emp>
::::::::::::::
sasi2.xml
::::::::::::::
<?xml version="UTF_8">
<emp: ....>
 <name>b</name>
 <age>10</age>
</emp>
::::::::::::::
sasi3.xml
::::::::::::::
<?xml version="UTF_8">
<emp: ....>
 <name>c</name>
 <age>10</age>
</emp>
$

Jean-Pierre.

fpmurphy · July 24, 2010, 1:51pm

By the way, your document declaration is invalid

<?xml version="UTF_8">

The version should be either 1.0 or 1.1. There is no valid XML version called "UTF_8". UTF-8 is a character encoding scheme.

The following is probably what you want:

<?xml version="1.0" encoding="UTF-8"?>

ygemici · July 24, 2010, 5:56pm

# cat Test.xml
<?xml version="UTF_8">
<emp: ....>
 <name>a</name>
 <age>10</age>
</emp>
<?xml version="UTF_8">
<emp: ....>
 <name>b</name>
 <age>10</age>
</emp>
<?xml version="UTF_8">
<emp: ....>
 <name>c</name>
 <age>10</age>
</emp>

# ./justdoit Test.xml
 
1. Splitted File Name  -> "test1.xml"
<?xml version="UTF_8">
<emp: ....>
 <name>a</name>
 <age>10</age>
</emp>
 
2. Splitted File Name  -> "test2.xml"
<?xml version="UTF_8">
<emp: ....>
 <name>b</name>
 <age>10</age>
</emp>
 
3. Splitted File Name  -> "test3.xml"
<?xml version="UTF_8">
<emp: ....>
 <name>c</name>
 <age>10</age>
</emp>

 
# cat justdoit
 
#!/bin/bash
totalcnt=$(sed -n '/<?xml/,/emp>/p' $1 | sed -n '$=')
mycnt=$(sed -n '1,/emp>/p' $1 | sed -n '$=')
count=`expr $totalcnt / $mycnt `
first=1;endof=$mycnt;in=1
 
  while [ $(( count -=1 )) -gt -1 ]
   do
       sed -n "${first},${endof}p" $1 > test"$in"
       echo -e "\n$in. Splitted File Name  -> \"test"$in".xml"\" ; cat test"$in"
       first=`expr $first + $mycnt `
       endof=`expr $endof + $mycnt `
       in=`expr $in + 1 `
   done

Regards
ygemici

kurumi · July 24, 2010, 10:27pm

linux$ csplit file '/^<emp/4' "{*}"

aigles · July 25, 2010, 6:21am

Thanks kurumi, i learn something today.

Another csplit approch :

$ cat sasi.xml
<?xml version="UTF_8">
<emp: ....>
 <name>a</name>
 <age>10</age>
</emp>
<?xml version="UTF_8">
<emp: ....>
 <name>b</name>
 <age>10</age>
</emp>
<?xml version="UTF_8">
<emp: ....>
 <name>c</name>
 <age>10</age>
</emp>
$ csplit -f sasi -b _%d.xml -z sasi.xml '/<\/emp>/1' '{*}'
73
73
73
$ ls -l sasi_*.xml
-rw-r--r-- 1 Jean-Pierre Aucun 73 2010-07-25 12:18 sasi_0.xml
-rw-r--r-- 1 Jean-Pierre Aucun 73 2010-07-25 12:18 sasi_1.xml
-rw-r--r-- 1 Jean-Pierre Aucun 73 2010-07-25 12:18 sasi_2.xml
$ head sasi*_*.xml
==> sasi_0.xml <==
<?xml version="UTF_8">
<emp: ....>
 <name>a</name>
 <age>10</age>
</emp>

==> sasi_1.xml <==
<?xml version="UTF_8">
<emp: ....>
 <name>b</name>
 <age>10</age>
</emp>

==> sasi_2.xml <==
<?xml version="UTF_8">
<emp: ....>
 <name>c</name>
 <age>10</age>
</emp>
$

rdcwayx · July 25, 2010, 8:01am

awk 'BEGIN{a=0;i=0}
      /<?xml/ {a=1} 
      /<\/emp>/ {print > "File" i ".xml";i++;a=0} 
      {if (a==1) print > "File" i ".xml"}' urfile

sasi_u · July 26, 2010, 12:09am

Hi All,

Thanks a lot for helping me in this.

aigles- Thanks vry much.

rdcwayx · July 26, 2010, 12:51am

csplit command has many limits in this case. For example, if there are other lines between </emp> and <?xml version="UTF_8">, you will get wrong o/p.