Splitting XML file on basis of line number into multiple file

ajju · May 7, 2014, 8:00am

Hi All,

I have more than half million lines of XML file , wanted to split in four files in a such a way that top 7 lines should be present in each file on top and bottom line of should be present in each file at bottom.

from the 8th line actual record starts and each record contains 15 lines means from 8th to 22nd line is the first record of the file.
so the total number of actual records are varying each time.

wanted to divide this actual record in four chunks and each chunks should move to new four files respectively below to the top 7 and above the bottom line.

say...

cat ABCD.xml |wc -l
 690728
Actual Record lines = 690728 -[top 7] -[bottom1]= 690720
690720/15 = Actual Record = 46048

then
46048/4 = 11512 or [if it is not exactly divisible then the remainder record should move to the last ]

so first 11512 record move to ABCD_part1.xml
second 11512 will move to ABCD_part2.xml
third 11512 will move move to ABCD_part3.xml
Fourth/remaining records 11512 will move to to ABCD_part4.xml

Please help on this .

clx · May 7, 2014, 8:07am

You already gave math.
You can use

sed -n 'x,yp' file

where x is the starting line number and y is the ending.

ajju · May 8, 2014, 2:35am

CLX,
thnx, wanted to make it generalized the scenario the math i have given you that is the example records may vary for each new file.

ajju · May 15, 2014, 3:33am

any hint in shell scripting to process the below request ?

SriniShoo · May 15, 2014, 4:33am

For any number of records, below code will create 4 parts with 7 lines of header and a trailer in each file
But, make sure to provide the file name twice as in the code

awk '
BEGIN{n = 1;
  prt = 4}
NR == FNR {
  if(FNR <= 7)
    {hd = (hd == "") ? $0 : (hd "\n" $0)}
  else
    {tr = $0;
    n = FNR}
  next}
FNR == 1 {fc = (n - 8) / prt;
  c = 0;
  next}
FNR > 7 && FNR <= (fc * c + 7) && FNR < n {
  print $0 > "ABCD_part" c ".xml";
  next}
FNR > (fc * c + 7) && FNR < n {
  if(FNR != 8)
    {print tr > "ABCD_part" c ".xml"};
    c++;
    print hd > "ABCD_part" c ".xml"
 print $0 > "ABCD_part" c ".xml"}
END {print tr > "ABCD_part" c ".xml"}' ABCD.xml ABCD.xml

ajju · May 16, 2014, 7:33am

Thank a lot ! it is as expected.

Only one thing is that in original file on seventh line we have a tag like

<TotalRecord>46048</TotalRecord>

and in each file it is replicating the same even though the records are 1/4th of the original means in each file the above tag is to be like 46048/4 =11512

part1

<TotalRecord>11512</TotalRecord>

part2

<TotalRecord>11512</TotalRecord>

......so on..

SriniShoo · May 16, 2014, 11:53am

Do you have a blank line at the beginning of the file or anywhere before 7 lines

ajju · May 17, 2014, 12:55am

No, there is no blank line in a file.

---------- Post updated at 11:55 PM ---------- Previous update was at 11:52 PM ----------

No,there is no blank line a file.

SriniShoo · May 17, 2014, 1:45am

Can you paste the first 10 lines of the file

Don_Cragun · May 17, 2014, 4:39pm

This is the second thread you have started on this topic. The other thread: Splitting a file into 4 files containing the same name pattern was closed because you changed the requirements after you had been given solutions to your original problem, refused to answer basic questions about what the input looked like, and refused to show us that you had made any attempt to solve the problem on your own.

This thread seems to be following a similar path. And, trying to piece together a description of your input file based on tidbits from both threads only makes it clear that the descriptions we have seen in this thread do NOT match the sample data provided in the other thread.

Before we spend any more time on this thread, please:

Show us what the file headers look like.
Show us what the file trailer looks like.
Show us ALL of the changes that need to be made to the file headers for the four files that you want your script to create.
Show us a sample of a few of the (15 line) records from your input (with private data, if there is any, scrubbed). Or much better, provide us with a complete sample input file we can upload containing 10 (doesn't have to be 10, but it needs to be at least 5 and not an even multiple of 4) records and provide us with four output files we can upload that your script should produce from that input file!
Show us what code you have written to try to solve your problem and show us what it does correctly and what it doesn't do correctly.

ajju · May 18, 2014, 10:12am

Sorry don for the confusion:

Here the header looks like..

<?xml version="1.0" encoding="UTF-8"?>
 <ns0:AbacusFile xmlns:ns0="urn:CPW:OTHERS:WGRS:rushandabacus">
 <AbacuslFileHeader>
 <RecordType>01</RecordType>
 <Date>20140405</Date>
 <TotalRecord>46048</TotalRecord>
 </AbacusFileHeader>

Trailer looks like

</ns0:AbacuslFile>

The only changes in header in each file in <totalRecord> tag. depending on the record splitted in each file.
means...

<TotalRecord>11971</TotalRecord>

SriniShoo · May 19, 2014, 2:35am

awk '
BEGIN{n = 1;
  prt = 4}
NR == FNR {
  if(FNR <= 7)
    {hd = (hd == "") ? $0 : (hd "\n" $0)}
  else
    {tr = $0;
    n = FNR}
  next}
FNR == 1 {sub("<TotalRecord>[0-9]*<", "<TotalRecord>" (n - 8) / (15 * prt) "<", hd);
  fc = (n - 8) / prt;
  c = 0;
  next}
FNR > 7 && FNR <= (fc * c + 7) && FNR < n {
  print $0 > "ABCD_part" c ".xml";
  next}
FNR > (fc * c + 7) && FNR < n {
  if(FNR != 8)
    {print tr > "ABCD_part" c ".xml"};
    c++;
    print hd > "ABCD_part" c ".xml"
 print $0 > "ABCD_part" c ".xml"}
END {print tr > "ABCD_part" c ".xml"}' ABCD.xml ABCD.xml

Don_Cragun · May 19, 2014, 3:05pm

It looks like ShriniShoo has given you code that will work fine as long as:

you always want to store the output in files named ABCD_part?.xml (no matter what the input file name is),
your input file always has a number of records that is a positive integral multiple of the number of output files you want to create,
you only have one input file to process,
and you want to read your input files twice.

If you want code that:

produces output files based on the input file name,
handles input files with zero or more records,
can process multiple input files,
only reads your input files once,
verifies that each input file has the number of input lines indicated by the TotalRecord tag,
prints status information for each input file processed, and
returns a non-zero exit status if one or more of the input files is malformed,

you could try something like:

#!/bin/ksh
awk '
function eofcheck(	e, i) {
	# Close output files for previous input file.
	for(i = 1; i <= nf; i++)
		close(of)
	# Perform end-of-file error checks...
	if(tlp == ntl) return
	e = 0
	for(i = 1; i <= nf; i++)
		if(c > 0) {
			printf("\t*** Missing %d+%d records for part %d.\n",
				int(c / lpr), (c % lpr) > 0, i)
			e = 1
		}
	if(e) ec = 2
	else {	printf("\t*** Expected %d trailer line%s; found %d.\n", ntl,
			ntl == 1 ? "" : "s", tlp)
		ec = 3
	}
}
BEGIN {	if(lpr == 0) lpr = 15	# lines per record (default 15)
	if(nf == 0) nf = 4	# # of output files (default 4)
	if(nhl == 0) nhl = 7	# # of header lines (default 7)
	if(ntl == 0) ntl = 1	# # of trailer lines (default 1)
	ec = 0			# final exit code
}
FNR == 1 {
	# If this is not the first input file, perform EOF checks on lsat file.
	if(NR > 1) eofcheck()
	# Generate output filenames...
	for(i = 1; i <= nf; i++)
		of = substr(FILENAME, 1, length(FILENAME) - 4) "_part" i \
			substr(FILENAME, length(FILENAME) - 3)
	# Set temporary value for ftl (it will be recalcuated when we process
	# the TotalRecord tag.
	ftl = 1
	# Clear number of trailer lines printed for current file.
	tlp = 0
}
FNR <= nhl || FNR >= ftl {
	# Look for input record count.
	if(split($0, rc, /<\/*TotalRecord>/) != 3 || rc[2] !~ /^[0-9]+$/) {
		# Copy other header lines and the trailer to all output files...
		for(i = 1; i <= nf; i++)
			print > of
		# Count number of trailer lines printed.
		if(FNR >= ftl) tlp++
		next
	}
	# We have the header line that defines the number of records present.
	irc = rc[2]		# input record count
	rpf = int(irc / nf)	# base output records / file
	rem = irc % nf		# records left over after even split among files
	printf("Found TotalRecord header in %s, %d input records.\n", FILENAME,
		irc)
	for(i = 1; i <= nf; i++) {
		# Calculate # of records for each output file.
		c = rpf + (rem >= i)
		# Print TotalRecord tag header lines.
		printf("%s<TotalRecord>%d</TotalRecord>%s\n", rc[1], c,
			rc[3]) > of
		printf("\tPreparing to write %d records to %s\n", c, of)
		# Convert count for each file from records to lines.
		c *= lpr
	}
	# Calculate First Trailer Line number and initialize output file number.
	ftl = nhl + 1 + lpr * irc	# line # of 1st trailer line
	ofn = 1			# output file number
	tlp = 0			# # of trailer lines printed
	next
}
ftl == 1 {
	# TotalRecord tag not found.
	printf("TotalRecord tag not found in %s headers; aborting.\n", FILENAME)
	exit 99
}
{	# Copy data lines to appropriate output file.
	while(c[ofn]-- <= 0)
		if(++ofn > nf) {
			printf("Internal error: FNR=%d, ftl=%d, ofn=%d\n",
				FNR, ftl, ofn)
			exit 98
	}
	print > of[ofn]
}
END {	eofcheck()
	exit ec
}' "$@"

As always, if you want to run this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

But, of course, this doesn't meet the conflicting requirements you have posted in this thread: You said that the TotalRecord tags are on line 7 in your headers, but your sample header has it on line 6. This code looks for a TotalRecord tag on any line in the headers. It will give you an error if no tag is found. It will produce multiple TotalRecord tags if more than one appears and use data from the last one found. If more than one set of TotalRecord tags appears on a single line, all of the tags on that line will be silently ignored. (Producing an error in these cases is left as an exercise for the reader.)

You said you wanted the same number of records in the first three files and any additional records added to the last file. This code spreads any extra records out such that if there is one extra record, it will go into the first output file; if there are two extra records, one will go into each of the first two output files; and if there are three extra records, one will go into each of the first three output files. (This made error checking simpler in cases where there are fewer records in the input file than there are output files. And, I think it make more sense to do it this way. If you disagree, feel free to modify the code to partition output records the way you want it.)

The awk script is fully parameterized to accept any positive number of header lines, any positive number of trailer lines, any positive number of lines per record, and any positive number of output files/input file (up to your system's awk's limit on the number of open files), but adding getopts code to parse options to this script to override the defaults is left as an exercise for the reader.

If you save the above code in a file named splitter and make it executable ( chmod +x splitter ), you can invoke it as:

./splitter ABCD.xml

to split ABCD.xml into four files named ABCD_part1.xml through ABCD_part4.xml. If you give it additional file operands it will split all of the give files.

This code assumes that it is working on XML files, but doesn't enforce any naming convention. Note, however, that this code assumes that the input file pathames end with a period followed by a three character filename extension (such as .xml or .XML ). If an input pathname contains less than four characters, the results are unspecified. Adding checks for this situation is left as an exercise for the reader.

ajju · May 22, 2014, 3:14am

Thank you very much SriniShoo and don, both codes are working as expected.

Can be closed as a RESOLVED THREAD

Akshay_Hegde · May 22, 2014, 4:57am

@ajju: You can use Thanks button -> bottom right, for right answer, don't just copy code, think effort behind response, and valuable time spent.