Split a large file in n records and skip a particular record

ibmtech · November 27, 2013, 4:19pm

Hello All,
I have a large file, more than 50,000 lines, and I want to split it in even 5000 records. Which I can do using

sed '1d;$d;' <filename> | awk 'NR%5000==1{x="F"++i;}{print > x}'

Now I need to add one more condition that is not to break the file at 5000th record if the 5000th record starts with "3" i.e. I need to split the file by further reading it until I encounter a record that does not start with 3.

Below is the sample of the huge file.

  100000035900015300007538   172359500000000000AA000000000Y000000000Y00
  100000035900015300007538   172359500000000000AA000000000Y000000000Y00
  100000035900015300007538   1166231200000000000AA000000000Y000000000Y00
  200000035900015300007538   11029684830A   000000000Y000000000Y01YA 
  200000035900015300007538   0127862850000000000000Y000000000Y00YY 
  200000035900015300007538   01282938700000000000AA000000000Y000000000Y00    
  300000035900015300007538   01282938701025828658A   000000000Y000000000Y01   
  300000035900015300007538   1282938700000000000AA000000000Y000000000Y00
  300000035900015300007538   1282938703028860515A   000000000Y000000000Y03   
  100000035900015300007538   172359500000000000AA000000000Y000000000Y00Y     
  100000035900015300007538   172359500000000000AA000000000Y000000000Y00Y        
  200000035900015300007538   1166231201029684830A   000000000Y000000000Y01YA 
  200000035900015300007538   01278628500000000000AA000000000Y000000000Y00YY

Any help is much appreciated.

Chubler_XL · November 27, 2013, 7:47pm

Try this, note I've added a close statement as if you get larger files you may run out of open file handles (depending on your OS and awk version):

sed '1d;$d;' <filename> | awk 'NR%5000==1{N++} N&&!/^\s*3/{if(x) close(x);x="F"++i;N=0}{print > x}'

ibmtech · November 29, 2013, 10:43am

Thanks 'Chubler',
When I put my actual file name and try to run, I am getting the below.

sed '1d;$d;' fiscal13 | awk 'NR%5==1{N++} N&&!/^\s*3/{if(x) close(x);x="Fa"++i;N=0}{print > x}'
awk: 0602-576 A print or getline function must have a file name.
 The input line number is 1.
 The source line number is 1.

Any help is much appreciated!

Corona688 · November 29, 2013, 11:09am

Try nawk.

ibmtech · November 29, 2013, 11:13am

Is nawk for AIX? I think its for Solaris.

FYI, I am using AIX. (7.1).

Thanks,

Corona688 · November 29, 2013, 11:20am

nawk is available on many systems, but I think I've spotted the error now:

awk 'BEGIN{x="F"++i } NR%5==1{N++} N&&!/^\s*3/{if(x) close(x);x="Fa"++i;N=0}{print > x}'

Akshay_Hegde · November 29, 2013, 11:52am

Try :

$ awk 'NR==1 || NR % 5000 == 1 && !/^\s*3/{close(f);f="File_"++i".tmp"}{print >f}' file

Let me know if I missed something.

ibmtech · November 29, 2013, 12:08pm

Thanks a lot Corona688, that works like a charm, that is what I was searching for.
I really appreciate it.
Thanks Bud.

@Akshay
Your works, but its not giving the desired results, some lines are 25000 and some are 15000 and some 5000.
Any way thanks for helping out.

Akshay_Hegde · November 29, 2013, 12:14pm

@ibmtech Thank you.

@Corona688 can please tell me, what's wrong in my code, if possible please explain, I will correct it.

Corona688 · November 29, 2013, 12:21pm

For starters you need to set the original value of 'f' so you won't be printing into a blank filename.

% can cause some precedence problems in C I know, I would bracket your expressions more carefully.

Akshay_Hegde · November 29, 2013, 12:29pm

But NR ==1 || ........ will set the filename right ? I still didn't get what might be wrong..

Chubler_XL · November 30, 2013, 1:20pm

The problem is that all the conditions must be true to change filename so if record 5000 starts with "3" its not tested again on record 5001.

My solution failed because record 1 had "3" so no starting filename was set.

Akshay_Hegde · November 30, 2013, 2:12pm

My understanding
Corona's code

awk 'BEGIN{x="F"++i } NR%5==1{N++} N&&!/^\s*3/{if(x) close(x);x="Fa"++i;N=0}{print > x}'

1.It sets x in beginning
2.when NR becomes 6 remainder will be 1 and N will be incremented
3. if N is set and line doesn't start with 3 is true, it checks whether x is set or not, if x is set close x, increment i x will be the new file, reset N.
4. Last write line to file x

My code

awk 'NR==1 || NR % 5000 == 1 && !/^\s*3/{close(f);f="File_"++i".tmp"}{print >f}' file

NR == 1 , close f, since f is not set, no effect on close(f), increment i thats 1 and f will be the name of file., instead of BEGIN block I used NR==1
when NR becomes 5001 remainder will be 1, and check whether line starts with digit 3 if not close f, increment i, file name will be changed
write line to file f

let me know if my understanding is wrong.

Don_Cragun · November 30, 2013, 4:01pm

akshay hegde:

My understanding
Corona's code
awk 'BEGIN{x="F"++i } NR%5==1{N++} N&&!/^\s*3/{if(x) close(x);x="Fa"++i;N=0}{print > x}'
1.It sets x in beginning
2.when NR becomes 6 remainder will be 1 and N will be incremented
3. if N is set and line doesn't start with 3 is true, it checks whether x is set or not, if x is set close x, increment i x will be the new file, reset N.
4. Last write line to file x

My code
awk 'NR==1 || NR % 5000 == 1 && !/^\s*3/{close(f);f="File_"++i".tmp"}{print >f}' file
NR == 1 , close f, since f is not set, no effect on close(f), increment i thats 1 and f will be the name of file., instead of BEGIN block I used NR==1

when NR becomes 5001 remainder will be 1, and check whether line starts with digit 3 if not close f, increment i, file name will be changed

write line to file f

let me know if my understanding is wrong.

Hi Akshay,
Yes, you set x in the beginning; that isn't the problem. The problem is that if line (5000 * x) + 1 starts with a 3 you won't attempt to switch files until you have added another 5000 lines to the file. The request is to print 5000 lines per file but add single lines to a file such that the 1st line in an output file will never start with a 3 (with the possible exception of the first file).

Another (more complicated, but more efficient) way to do this is:

$!/bin/ksh
awk '
function nf() {
        x = sprintf("F%02d", ++ofc)
        cnt = 0
}
BEGIN { nf()            # Set 1st output file name.
        lpf = 5000      # Set # of lines to be included in each output file
}
NR == 1 {
        # Skip 1st input line.
        next
}
NR > 2 {# Print previous line.
        print last > x
        cnt++
}
{       # Save current line.  Do not print it yet so we can skip the last line.
        # When we hit EOF, last will contain the last line read, but we will
        # not have printed it.
        last = $0
}
cnt >= lpf && ! /^ *3/ {
        # If we have a full file and current line does not start with a 3,
        # close current output file and switch to a new output file name.
        close(x)
        nf()
}' "$@"

I use the Korn shell, but any shell that recognizes basic Bourne shell syntax will also work for this script.

This script is more efficient because it only reads the input file once. Rather than using sed to delete the 1st and last line and awk to split the remaining lines, this script just uses awk to skip the 1st and last lines and split the other lines.

It also uses Fxx as the output file name format in case the input is a little more than 50000 lines which would produce F1, F2, ... F10. Using two digits means that the output file names will sort in sequence instead of having to worry about special handling for F1, F10, F2, F3, ... F9.

If you name this script tester, make it executable, and invoke it as follows:

./tester fiscal13

it should split the submitter's real input file into approximately 5000 line chunks.

If the test input file is named file and you invoke the script as follows:

./tester lpf=3 file

it will produces 3 files; F01 containing:

  100000035900015300007538   172359500000000000AA000000000Y000000000Y00
  100000035900015300007538   1166231200000000000AA000000000Y000000000Y00
  200000035900015300007538   11029684830A   000000000Y000000000Y01YA

F02 containing:

  200000035900015300007538   0127862850000000000000Y000000000Y00YY 
  200000035900015300007538   01282938700000000000AA000000000Y000000000Y00    
  300000035900015300007538   01282938701025828658A   000000000Y000000000Y01   
  300000035900015300007538   1282938700000000000AA000000000Y000000000Y00
  300000035900015300007538   1282938703028860515A   000000000Y000000000Y03

and F03 containing:

  100000035900015300007538   172359500000000000AA000000000Y000000000Y00Y     
  100000035900015300007538   172359500000000000AA000000000Y000000000Y00Y        
  200000035900015300007538   1166231201029684830A

(The lpf=3 operand overrides the default 5000 lines per file setting set in the BEGIN clause.) Note that F02 contains 5 lines instead of 3 to avoid splitting files in the middle of a multi-line record (assuming that a line starting with a 3 is some kind of continuation line in a multi-line record) but 5 is not a multiple of 3.

Akshay_Hegde · December 1, 2013, 1:10am

Thank you Don Cragun.

ibmtech · December 2, 2013, 11:19am

Hello All,
I do appreciate all of your inputs, but again I have another little complicated thing to add.

My code was starting with something as below,
sed '1d;$d;' XXXXXX
I was deleting the header and tail from the huge file. But I need to add that header and tail to each file.

Any help is appreciated.

Akshay_Hegde · December 2, 2013, 11:42am

In Don's solution add this

awk -v head="$(head -1 file)" '
function nf() {
        x = sprintf("F%02d", ++ofc)
        print head >x
        cnt = 0
}
.................
.................
.................
' file

RudiC · December 2, 2013, 12:28pm

If that's a huge file, head -1 might not be the most efficient ansatz. Why not sth like

BEGIN {lpf=5000} NR==1 {head = $0; nf()}

For the tail, I'm not sure if it's more efficient to open the huge file "from the end" like tac does or to reopen all output files produced so far and append the tail (e.g. with echo >> ).

ibmtech · December 2, 2013, 4:45pm

Thanks for the input guys.

Don_Cragun · December 2, 2013, 5:08pm

The head and tail utilities should be pretty efficient (and not read the entire file) to extract the 1st and last lines, respectively, of your input file, but if I had been given this set of requirements to start with, I would have done something more like:

#!/bin/ksh
IAm=${0##*/}
if [ $# -gt 2 ] || [ $# -lt 1 ]
then    printf "Usage: %s file [lines_per_file]\n" "$IAm" >&2
        exit 1
fi
file=${1}
lpf=${2:-5000}  # Set lines to be included in each output file.
awk -v lpf="${lpf}" '
function nf(fn) {
        x = sprintf("F%02d", fn)
}
NR == 1 {
        # Save header from 1st line.
        h = $0
        next
}
NR > 2 {if(cnt == 0) {
                # Get next output file and add the header line.
                nf(++ofc)
                print h > x
                cnt = 2 # Reserve space for trailer line to be added later.
        }
        # Add previous input line to current output file.
        print last > x
        cnt++
}
{       # Save current line.  Do not print it yet so we can skip the last line.
        # When we hit EOF, last will contain the trailer to be added to all of
        # the output files.
        last = $0
}
cnt >= lpf && ! /^ *3/ {
        # If we have a full file and current line does not start with a 3,
        # close current output file and clear output line count.
        close(x)
        cnt = 0
}
END {   # Add trailer line to all output files.
        while(ofc) {
                nf(ofc--)
                print last >> x
                close(x)
        }
}' "$file"

Depending on which version of awk you're using, you could comment out the first close(x) statement to avoid having to reopen the files as long as you don't run out of file descriptors. If you try it and you get a diagnostic about too many open files, put that close statement back in.