awk file split

Hi all,

First of all I' like to mention that I'm pretty new to unix scripting. :frowning:

I'm trying to split an large xml with awk and rename it based on the values of two attributes.
Example XML

<RECORD>
<element1>11</element1>
<element2>22</element2>
<element3>33</element3>
<element4>a</element4>
<element5>b</element5>
<element6>c</element6>
</RECORD>
<RECORD>
<element1>44</element1>
<element2>55</element2>
<element3>66</element3>
<element4>a</element4>
<element5>b</element5>
<element6>c</element6>
</RECORD>

The desired output would be :

a file named 1122.xml
<RECORD>
<element1>11</element1>
<element2>22</element2>
<element3>33</element3>
</RECORD>

a file named 4455.xml
<RECORD>
<element1>44</element1>
<element2>55</element2>
<element3>66</element3>
</RECORD>

Up till know I have been able to split the file using this command

awk '/<RECORD/{close("row"count".xml");count++}count{f="row"count".xml";print $0 > f}' file.xml

but couldn't figure how to pass the values of the elements 1 & 2 in the filename. Ideally I would like to add xml tags as this is intended to split xml files and produce valid xml format output.

Any ideas on how to do it with awk ?
Thank you in advance.

PS. I know that it has been addressed in a couple of posts, such as shell-210529-xml-split-extract-string-between-chars.html
though I could use the proposed solution, it always produce me only one file with the final record.:frowning:

again, thanks

Hello,

Could you please use the following code for same.

awk '!/\<RECORD\>/ && !/\<\/RECORD\>/ {f=f"\n"$0} /\<\/RECORD\>/ {print f > "Record"++i".txt";f=""}' check_records

It will create 2 files named Record2.txt and Record1.txt.

NOTE: Where check_records is the Input file name.

Thanks,
R. Singh

1 Like

Hello RavinderSingh13,

Thank you for your quick reply.
Unfortunately, this produces pretty much the same output with the code I posted earlier :(, n files where the file will have a counter in the filename as output.
However, I was trying to get the value of two tags in each filename :frowning:

Thanks again

Give this a try:

awk '/<\/RECORD/{
                 print raw""$0 > name".xml"
                 name=""
                 raw=""
                 next}
    /element[1-2]/{name=sprintf("%s%s",name,$3)}
    {raw=sprintf("%s%s\n",raw,$0)}' FS="(>)|(<)" file.xml
1 Like

Hello,

Could you please try following.

awk '!/\<RECORD\>/ && !/\<\/RECORD\>/ && !/\<element4\>/ && !/\<element5\>/ && !/\<element6\>/ {f=f"\n"$0} /\<\/RECORD\>/ {print f > "Record"++i".xml";f=""}' check_records

It will create 2 output files as follows Record2.txt and Record1.txt.

Output files will be as follows.

$ cat Record1.xml
<element1>11</element1>
<element2>22</element2>
<element3>33</element3>
$
 
 
$ cat Record2.xml
<element1>44</element1>
<element2>55</element2>
<element3>66</element3>
$

Thanks,
R. Singh

1 Like

Hi Klashxx,

Unfortunately the size of the file is too large in order to be handled with sprintf
Error: awk: program limit exceeded: sprintf buffer size=2040

---------- Post updated at 12:05 PM ---------- Previous update was at 11:57 AM ----------

Hi RavinderSingh13,

Yes this creates the output you describe, however I would like to have the following output

$ 1122.xml
<element1>11</element1>
<element2>22</element2>
<element3>33</element3>
$
 
 
$ 4455.xml
<element1>44</element1>
<element2>55</element2>
<element3>66</element3>
$

without the string Record or a counter :confused:. Thanks again for your help:o

Could you please try followign and let me know.

awk -vi=1122 '!/\<RECORD\>/ && !/\<\/RECORD\>/ && !/\<element4\>/ && !/\<element5\>/ && !/\<element6\>/ {f=f"\n"$0} /\<\/RECORD\>/ {print f > i".xml" ;f="";i=4455}' check_records

Thanks,
R. Singh

1 Like

Would this help you ?

$ cat file
<RECORD>
<element1>11</element1>
<element2>22</element2>
<element3>33</element3>
<element4>a</element4>
<element5>b</element5>
<element6>c</element6>
</RECORD>
<RECORD>
<element1>44</element1>
<element2>55</element2>
<element3>66</element3>
<element4>a</element4>
<element5>b</element5>
<element6>c</element6>
</RECORD>
awk  '
      /<.*>/ && NF==3{
                        s = 1; p = $0
                        next
                     }
                    s{
		        i = 0
                        p = p RS $0 
                        f = $3
		    while(1)
		      {
                         getline 
			 if(++i < tags)
                         	p = p RS $0
		         if(i < fname) 
                       	 	f = f $3
			 if(/<\/.*>/ && NF==3)break
                      }
			 s = 0
			 f = f".xml"
			 print p RS $0 > f
			 close(f)
                     }
    
      ' tags="3" fname="2" FS='[>|<]' file
$ cat 1122.xml 
<RECORD>
<element1>11</element1>
<element2>22</element2>
<element3>33</element3>
</RECORD>

$ cat 4455.xml 
<RECORD>
<element1>44</element1>
<element2>55</element2>
<element3>66</element3>
</RECORD>

change these variables tags="4" fname="3"

Example :

when tags="4" and fname="3"
then

$ cat 112233.xml 
<RECORD>
<element1>11</element1>
<element2>22</element2>
<element3>33</element3>
<element4>a</element4>
</RECORD>

$ cat 445566.xml 
<RECORD>
<element1>44</element1>
<element2>55</element2>
<element3>66</element3>
<element4>a</element4>
</RECORD>

RavinderSingh13, Works for the example but not in general. As stated at the beginning, there will be very large xml files where the input of element1 and element2 are unknown and their combination unique.
Therefore, I am trying to read these values in order to put them in the filename of each file created.:frowning:

Have you tried post #8 ?

Hello Akshay,

Thanks for great code. Could you please explain the code.

Thanks,
R. Singh

Try this if you don't want to print main tag

awk  '
      /<.*>/ && NF==3{
                        s = 1
                        next
                     }
                    s{
		         i = 0
                         p = $0 
                         f = $3
		    while(1)
		      {
                         getline 
			 if(++i < tags)
                         	p = p RS $0
		         if(i < fname) 
                       	 	f = f $3
			 if(/<\/.*>/ && NF==3)break
                      }
			 s = 0
			 f = f".xml"
			 print p > f
			 close(f)
                     }
    
      ' tags="3" fname="2" FS='[>|<]' file

hello Akshay Hegde, first of all thanks for the help and sorry for the late reply.
I've tried and tried to figure it out, however the produced output is not the one I would like.

I think I expressed myself clearly

The dimensions of the xml are not defined, the only certainties are:
i)split nodes:

<XML>
<RECORD>
<ISTITUTO>01 </ISTITUTO>
<CODE>01</>
.
.
.
</RECORD>
<RECORD>
<ISTITUTO>02 </ISTITUTO>
<CODE>02</>
.
.
.
</RECORD>
..........
.........
</XML>

While the final desired output

0101.xml

<XML>
<RECORD>
<ISTITUTO>01 </ISTITUTO>
<CODE>01</>
.
.
.
</RECORD>
</XML>


file : 0102.xml

<XML>
<RECORD>
<ISTITUTO>01 </ISTITUTO>
<CODE>02</>
.
.
.
</RECORD>
..........
.........
</XML>

Sorry for the inconvenience.

Thank you so much.

Please do check what you have posted in post #1 .

Regards,
Akshay Hegde

This was posted by in post #1

Far from it.

Nothing in your first post states that the element names are dummy placeholders.

Nothing in your first post specifies which elements to print. Knowing nothing about your real data, after looking at your original data sample, it is reasonable to assume that only the first three elements are relevant.

Even after your subsequent elaborations, the situation remains unclear. I have no idea if you want to print a fixed number of leading, numerically-valued elements. Or, if you want to print a variable number of leading, numerically-valued elements until the occurrence of a non-numerically valued element. Or perhaps you want to print all numerically-valued elements, ignoring any interleaved non-numerically valued elements. Or is it something else?

With those questions clearly answered, we would still not know what exactly is a numerically-valued element. From your original sample data, a reasonable method might test for the presence of a non-digit, e.g. [^0-9] or [^[:digit:]] . However, that reasonable method would fail with the data that you provided in post #13, due to the presence of at least one blank character (of which none are present in the original post's element values):

I have no doubt that I could have coded and tested a solution in less time than it took me to explain the ambiguities in your problem statement. Being specific, explicit, and providing actual data whenever possible is the best way to not waste anyone's time (including your own).

Regarding the file splitting problem itself, the simplest approach would be to not accumulate data in memory (as I believe all the suggestions in this thread do). Simply print relevant elements as they're read to a temp file. When the end of the record is reached, the permanent filename will have been constructed and mv can rename the temp file.

Regards,
Alister

you are right, after reviewing all the posts I did not expressed myself clearly and I sincerely apologize for the inconvenience.

1 Like

Good you realized.

It's not a problem f0usk4s we are here to help. But if you will be clear in your query then we can save our valueable time.

Thanks,
R. Singh

Just to close this topic, I've gone with the solution of grep of the tag i wanted to search and using temp files and was able to receive the result I needed.

Thank you all for the help.