a file named 1122.xml
<RECORD>
<element1>11</element1>
<element2>22</element2>
<element3>33</element3>
</RECORD>
a file named 4455.xml
<RECORD>
<element1>44</element1>
<element2>55</element2>
<element3>66</element3>
</RECORD>
Up till know I have been able to split the file using this command
but couldn't figure how to pass the values of the elements 1 & 2 in the filename. Ideally I would like to add xml tags as this is intended to split xml files and produce valid xml format output.
Any ideas on how to do it with awk ?
Thank you in advance.
PS. I know that it has been addressed in a couple of posts, such as shell-210529-xml-split-extract-string-between-chars.html
though I could use the proposed solution, it always produce me only one file with the final record.
Thank you for your quick reply.
Unfortunately, this produces pretty much the same output with the code I posted earlier :(, n files where the file will have a counter in the filename as output.
However, I was trying to get the value of two tags in each filename
awk '
/<.*>/ && NF==3{
s = 1; p = $0
next
}
s{
i = 0
p = p RS $0
f = $3
while(1)
{
getline
if(++i < tags)
p = p RS $0
if(i < fname)
f = f $3
if(/<\/.*>/ && NF==3)break
}
s = 0
f = f".xml"
print p RS $0 > f
close(f)
}
' tags="3" fname="2" FS='[>|<]' file
RavinderSingh13, Works for the example but not in general. As stated at the beginning, there will be very large xml files where the input of element1 and element2 are unknown and their combination unique.
Therefore, I am trying to read these values in order to put them in the filename of each file created.
awk '
/<.*>/ && NF==3{
s = 1
next
}
s{
i = 0
p = $0
f = $3
while(1)
{
getline
if(++i < tags)
p = p RS $0
if(i < fname)
f = f $3
if(/<\/.*>/ && NF==3)break
}
s = 0
f = f".xml"
print p > f
close(f)
}
' tags="3" fname="2" FS='[>|<]' file
hello Akshay Hegde, first of all thanks for the help and sorry for the late reply.
I've tried and tried to figure it out, however the produced output is not the one I would like.
I think I expressed myself clearly
The dimensions of the xml are not defined, the only certainties are:
i)split nodes:
Nothing in your first post states that the element names are dummy placeholders.
Nothing in your first post specifies which elements to print. Knowing nothing about your real data, after looking at your original data sample, it is reasonable to assume that only the first three elements are relevant.
Even after your subsequent elaborations, the situation remains unclear. I have no idea if you want to print a fixed number of leading, numerically-valued elements. Or, if you want to print a variable number of leading, numerically-valued elements until the occurrence of a non-numerically valued element. Or perhaps you want to print all numerically-valued elements, ignoring any interleaved non-numerically valued elements. Or is it something else?
With those questions clearly answered, we would still not know what exactly is a numerically-valued element. From your original sample data, a reasonable method might test for the presence of a non-digit, e.g. [^0-9] or [^[:digit:]] . However, that reasonable method would fail with the data that you provided in post #13, due to the presence of at least one blank character (of which none are present in the original post's element values):
I have no doubt that I could have coded and tested a solution in less time than it took me to explain the ambiguities in your problem statement. Being specific, explicit, and providing actual data whenever possible is the best way to not waste anyone's time (including your own).
Regarding the file splitting problem itself, the simplest approach would be to not accumulate data in memory (as I believe all the suggestions in this thread do). Simply print relevant elements as they're read to a temp file. When the end of the record is reached, the permanent filename will have been constructed and mv can rename the temp file.
Just to close this topic, I've gone with the solution of grep of the tag i wanted to search and using temp files and was able to receive the result I needed.