Hello All,
I have a large file, more than 50,000 lines, and I want to split it in even 5000 records. Which I can do using
sed '1d;$d;' <filename> | awk 'NR%5000==1{x="F"++i;}{print > x}'
Now I need to add one more condition that is not to break the file at 5000th record if the 5000th record starts with "3" i.e. I need to split the file by further reading it until I encounter a record that does not start with 3.
Thanks 'Chubler',
When I put my actual file name and try to run, I am getting the below.
sed '1d;$d;' fiscal13 | awk 'NR%5==1{N++} N&&!/^\s*3/{if(x) close(x);x="Fa"++i;N=0}{print > x}'
awk: 0602-576 A print or getline function must have a file name.
The input line number is 1.
The source line number is 1.
1.It sets x in beginning
2.when NR becomes 6 remainder will be 1 and N will be incremented
3. if N is set and line doesn't start with 3 is true, it checks whether x is set or not, if x is set close x, increment i x will be the new file, reset N.
4. Last write line to file x
Hi Akshay,
Yes, you set x in the beginning; that isn't the problem. The problem is that if line (5000 * x) + 1 starts with a 3 you won't attempt to switch files until you have added another 5000 lines to the file. The request is to print 5000 lines per file but add single lines to a file such that the 1st line in an output file will never start with a 3 (with the possible exception of the first file).
Another (more complicated, but more efficient) way to do this is:
$!/bin/ksh
awk '
function nf() {
x = sprintf("F%02d", ++ofc)
cnt = 0
}
BEGIN { nf() # Set 1st output file name.
lpf = 5000 # Set # of lines to be included in each output file
}
NR == 1 {
# Skip 1st input line.
next
}
NR > 2 {# Print previous line.
print last > x
cnt++
}
{ # Save current line. Do not print it yet so we can skip the last line.
# When we hit EOF, last will contain the last line read, but we will
# not have printed it.
last = $0
}
cnt >= lpf && ! /^ *3/ {
# If we have a full file and current line does not start with a 3,
# close current output file and switch to a new output file name.
close(x)
nf()
}' "$@"
I use the Korn shell, but any shell that recognizes basic Bourne shell syntax will also work for this script.
This script is more efficient because it only reads the input file once. Rather than using sed to delete the 1st and last line and awk to split the remaining lines, this script just uses awk to skip the 1st and last lines and split the other lines.
It also uses Fxx as the output file name format in case the input is a little more than 50000 lines which would produce F1, F2, ... F10. Using two digits means that the output file names will sort in sequence instead of having to worry about special handling for F1, F10, F2, F3, ... F9.
If you name this script tester, make it executable, and invoke it as follows:
./tester fiscal13
it should split the submitter's real input file into approximately 5000 line chunks.
If the test input file is named file and you invoke the script as follows:
(The lpf=3 operand overrides the default 5000 lines per file setting set in the BEGIN clause.) Note that F02 contains 5 lines instead of 3 to avoid splitting files in the middle of a multi-line record (assuming that a line starting with a 3 is some kind of continuation line in a multi-line record) but 5 is not a multiple of 3.
Hello All,
I do appreciate all of your inputs, but again I have another little complicated thing to add.
My code was starting with something as below, sed '1d;$d;' XXXXXX
I was deleting the header and tail from the huge file. But I need to add that header and tail to each file.
If that's a huge file, head -1 might not be the most efficient ansatz. Why not sth like
BEGIN {lpf=5000} NR==1 {head = $0; nf()}
For the tail, I'm not sure if it's more efficient to open the huge file "from the end" like tac does or to reopen all output files produced so far and append the tail (e.g. with echo >> ).
The head and tail utilities should be pretty efficient (and not read the entire file) to extract the 1st and last lines, respectively, of your input file, but if I had been given this set of requirements to start with, I would have done something more like:
#!/bin/ksh
IAm=${0##*/}
if [ $# -gt 2 ] || [ $# -lt 1 ]
then printf "Usage: %s file [lines_per_file]\n" "$IAm" >&2
exit 1
fi
file=${1}
lpf=${2:-5000} # Set lines to be included in each output file.
awk -v lpf="${lpf}" '
function nf(fn) {
x = sprintf("F%02d", fn)
}
NR == 1 {
# Save header from 1st line.
h = $0
next
}
NR > 2 {if(cnt == 0) {
# Get next output file and add the header line.
nf(++ofc)
print h > x
cnt = 2 # Reserve space for trailer line to be added later.
}
# Add previous input line to current output file.
print last > x
cnt++
}
{ # Save current line. Do not print it yet so we can skip the last line.
# When we hit EOF, last will contain the trailer to be added to all of
# the output files.
last = $0
}
cnt >= lpf && ! /^ *3/ {
# If we have a full file and current line does not start with a 3,
# close current output file and clear output line count.
close(x)
cnt = 0
}
END { # Add trailer line to all output files.
while(ofc) {
nf(ofc--)
print last >> x
close(x)
}
}' "$file"
Depending on which version of awk you're using, you could comment out the first close(x) statement to avoid having to reopen the files as long as you don't run out of file descriptors. If you try it and you get a diagnostic about too many open files, put that close statement back in.