Split files with formatted numbers

bobbygsk · June 20, 2014, 12:48pm

How to split the file and have suffix with formatted numbers

Tried the following code

awk '{filename="split."int((NR-1)/2)".txt"; print >> filename}' split.txt

Current Result

Expected Result

blackrageous · June 20, 2014, 1:07pm

You should use a printf statement, like this with the d printing directive...

{printf("split.%03d.txt\n","1")}

bobbygsk · June 20, 2014, 1:53pm

how do I embed the printf in awk. I also need the content of original file in all these split file based on no. of lines I like to split

MadeInGermany · June 20, 2014, 2:00pm

sprintf prints to a string.

filename=sprintf("split.%03d.txt",(NR-1)/2)

Don_Cragun · June 20, 2014, 4:41pm

Since you're specifying 3 digit file sequence numbers, I assume you expect that you'll be producing more than a hundred files with this script. There is a good chance that awk will run out of file descriptors if you keep all of them open. You might want to consider something like:

awk '
NR%2 {	# Odd lines:
	fn = sprintf("split.%03d.txt", (NR - 1) / 2)
}
{	# all lines:
	print >> fn
}
(NR%2) == 0 {
	# Even lines:
	close(fn)
}' split.txt

If there are existing split.xxx.txt files when you start this script do you really want to append data to them, or do you want to remove any data that was there before and just keep what you find in the current input file?

If you want to append to existing files, the script above should work.

If you want to replace data instead of appending data, change:

	print >> fn

to:

	print > fn

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

bobbygsk · June 26, 2014, 9:22am

I get 2 to 10 mil records file. I have to split them with 100,000 records in each file. Assuming that i mostly get 3 mil records, so I have to split the file in 300 files. What is the limit that awk can handle certain number of file descriptors.

Besides, how do I get header (n records) and trailer with file number or some content in it.

Don_Cragun · June 26, 2014, 1:01pm

Simple, you slightly modify the code I gave you to put 100000 lines per output file instead of 2 lines per output file. The code I gave you already closes files when it is done with them so it only keeps one output file open at a time.

You're going to have to give us a lot more than "get header (n records) and trailer with file number or some content in it" to guess at what you want to put as headers and trailers in your files. Show us sample input and show us sample output! How is your script supposed to identify which lines are headers, which lines are trailers, and what data you want added to or removed from those headers as you copy parts of the input file to your hundreds of output files?

bobbygsk · June 27, 2014, 2:44pm

I guess something another print command to be placed before and after

for header and trailer. But how?

Expected result.

split.001.txt
=============
001 of n files
record 1
record 2
....
record 100,000
date

split.002.txt
=============
002 of n files
record 1
record 2
....
record 100,000
date

CarloM · June 27, 2014, 3:28pm

That section executes for every line (as the comment says) - you would probably want the header in a (new) NR%100000==1 section and the trailer in the NR%100000==0 section before the close . You'd also need an END section to handle the trailer for the last file (since it's unlikely to end on exactly 100000 lines, I assume).

If you have GNU awk you can use strftime() to get the date.

Getting the total number of files is an issue though, since awk won't know that until its processed the entire file. It might be easiest to work that out in a shell script wrapper and just pass it in as a variable.

Don_Cragun · June 27, 2014, 4:52pm

If you don't have GNU awk (or if you want code that should work on any system), you could try something like:

#!/bin/ksh
lc=$(wc -l < split.txt)
awk -v lc="$lc" '
BEGIN {	lpf = 1000000	# Lines per output file.
}
FNR == 1 {
	# This is not in the BEGIN section to allow the default value of lpf
	# to be overridden by an assignment before the filename operand.
	nf = int((lc + lpf - 1) / lpf)	# Total number of files to be created.
}
NR % lpf == 1 {
	# 1st line of output file:
	fn=sprintf("split.%03d.txt", ++ofc)
	printf("%03d of %03d files\n", ofc, nf) > fn
}
{	# all lines:
	print > fn
}
NR % lpf == 0 {
	# Last line of output file:
	trailer()
}
END {	if(NR % lpf)
		trailer()
}
function trailer() {
	close(fn)
	cmd = sprintf("date >> \"%s\"\n", fn)
	system(cmd)
}' "$@" split.txt

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

Although I normally use the Korn shell (and this was tested using ksh ), it will work with any shell that supports basic POSIX shell standard syntax (such as bash and ksh ).

bobbygsk · June 30, 2014, 10:11am

I have SunOS and awk did not work as it threw some error the following error

However

/usr/xpg4/bin/awk

worked.

bobbygsk · July 1, 2014, 11:59am

Again.. Issue.
I have another Server - AIX.
awk, nawk and /usr/xpg4/bin/awk is not working.
Throwing following error

syntax error The source line is 3.
 The error context is
                   BEGIN >>>
 <<<
 awk: Quitting
 The source line is 3.

CarloM · July 1, 2014, 12:17pm

Post your AIX version of the script.

bobbygsk · July 2, 2014, 9:16am

Fixed after having curly braces after "BEFORE" and "END" instead of having it next line.

Besides I have another issue. Initially I had requirement to pass date and now the date is not needed. But at time I was trying to format the date. From Don Cragun's code

cmd = sprintf("date >> \"%s\"\n", fn)

I changed it to
cmd = sprintf("date \+\'\%Y\%m\%d\'>> \"%s\"\n", fn) and
cmd = sprintf("date \+\"\%Y\%m\%d\">> \"%s\"\n", fn) .
It threw exception with the following error
With double quotes I get the following error

awk: There are not enough parameters in printf statement date +"%Y%m%d" >> "%s"
.

 The input line number is 2. The file is tstSplit.
 The source line number is 28.

With Single Quote, I get the following error

        cmd = sprintf("date \+\'\%Y\%m\%d\' >> \"%s\"\n", fn)
        system(cmd)
} ' "$@" tstSplit
split.ksh[3]: syntax error at line 32 : `"' unmatched

However if I replace date with Echo and some text with double quotes, it works.

cmd = sprintf("echo \"Trailer\" >> \"%s\"\n", fn)

I more interested in reason than correction in code(ofcourse I also want to know how the date can be formatted :))

Don_Cragun · July 2, 2014, 4:24pm

When you use the *printf() family of functions and you want to print a percent sign ( % ) rather than have it act as a format field introducing character, you need to use %% as in:

cmd = sprintf("date +%%Y%%m%%d >> \"%s\"\n", fn)

If you need to include characters in the date format operand that have to be escaped from the shell (such as if you wanted the date output to include spaces between fields), it would be something like:

cmd = sprintf("date \"+%%Y %%m %%d\" >> \"%s\"\n", fn)

bobbygsk · July 2, 2014, 7:30pm

Thanks Don Cragun. So % works like escape character within *print() functions

bobbygsk · July 10, 2014, 9:05am

Don Cragun's code works fine. But since I rarely use Unix, I'm not expert in awk.
My requirement changed and in header, it is needed to print no. of records in each file.
Though we are splitting 100,000 records, the last file might have less than 100,000.

So to display the number of records in each split files, I guess, I have to take FNR (record number in current file). But how do I print it. FNR is known only at the end of record and we are displaying header and all the records(lines) first.

So my split files header should look like the following

HD~<total records in this split file>~Total number of files

~ being the delimiter

Don_Cragun · July 10, 2014, 2:23pm

Even though you're not an expert in awk , which line in the code I supplied do you think needs to be changed? Did you make any attempt at changing that line to meet your new requirements? What part of what you tried is not working?

Do you want the line count in the header of each file to include the header and trailer in that file in the count, or just the number of lines in that file from the file that is being split?

Do you still want a 3 digit number (with leading zeros) for the "Total number of files" field at the end of the header line?

bobbygsk · July 10, 2014, 4:32pm

I tried the following before NR % lpf == 1

}
{       # count no. of lines
        ++cntRec
}
NR % lpf == 1 {
        # 1st line of output file:
        fn=sprintf("split.%03d.txt", ++ofc)
        # Header format HD~A~B (A:File no.;  B: Total Files)
        printf("HD~%03d~%03d\n", cntRec, nf) > fn
}
{       # all lines:
        print > fn
}

I do not know where to increment it.
I need header in each splitted file, how many records(lines) it has excluding header and footer.

jim_mcnamara · July 10, 2014, 4:59pm

I guess I missed something - generally I think it is better to use a command that does what you want than to write a script, in this case

csplit

is a possible choice. It is educational to write a script but a better idea to use known good commands for production work.

csplit  -f splitz -k  -n 3  csprap01.logscan 10000 {5}

Explanation: split csprap01.logscan into five files named splitz000..splitz004

-f splitz -prefix for numbered file name - splitz001 .. splits999

-n numbe r of decimal digits in the number: -n 3 means use zero filled numbers with 3 digits for output filenames

10000 means start from where you are in the file (usually the beginning) and stop 10000 lines later == lines 1-9999 are in the first split. 10000 - 19999 in the second.

{5} repeat five times - {*} (Linux csplit) means keep on repeating. This last option will cause you to overwrite the splitz000 file (and others) if you create more than 999 files as splits.

The line in red means the last file came up short of lines. With -k you lose no lines in the splits in case of error.

csplit  -f splitz -k  -n 3  csprap01.logscan 10000 {5}
1293851
1305465
1306543
2458441
1785104
/usr/local/bin/csplit: `10000': line number out of range on repetition 5
258231
jmcnama>
jmcnama > ls -lrt splitz*
-rw-r--r--   1 jmcnama  other    1293851 Jul 10 14:39 splitz000
-rw-r--r--   1 jmcnama  other    1305465 Jul 10 14:39 splitz001
-rw-r--r--   1 jmcnama  other    1306543 Jul 10 14:39 splitz002
-rw-r--r--   1 jmcnama  other    2458441 Jul 10 14:39 splitz003
-rw-r--r--   1 jmcnama  other    1785104 Jul 10 14:39 splitz004
-rw-r--r--   1 jmcnama  other     258231 Jul 10 14:39 splitz005

 jmcnama > wc -l splitz*
    9999 splitz000
   10000 splitz001
   10000 splitz002
   10000 splitz003
   10000 splitz004
    2093 splitz005
   52092 total
jmcnama >  wc -l csprap01.logscan
   52092 csprap01.logscan