Sort and Split file with header and custom name

Saanvi1 · October 6, 2015, 1:11pm

Hi,

I am using SUN SOLARIS (SunOS sun4v sparc SUNW, T5240).
I have a huge data file with header and trailer. This file gets used into an ETL process. ETL skips the header record (which is the first record of the file) and loads the rest of the record. The file can be delimited (comma, tab, pipe) or fixed width.

I am trying to write a script that:

Sorts the records by first field. The value of first field is �Q� or �E�. So all �Q�record should sort first then �E� records.
After the sort above is complete, I want to split the file on every 30,000 records with rules below:

A. The first record in every split file is a header record or blank record- starting with "H"
B. The split file needs to have same name as master file with __(double underscore) and the number.

For e.g:
  Master File name is: MasterFile-HP-import-20151006.txt
  Split file name needs to be :  MasterFile-HP-import-20151006__1.txt  ; MasterFile-HP-import-20151006__2.txt and so on.

I tried the two things below for file split:

split -dl 30000 MasterFile-HP-import-20151006.txt MasterFile-HP-import-20151006__

But getting error "split: illegal option � d error"

Also, tried the solution that I got from internet:

awk '{filename = " MasterFile-HP-import-20151006__" int((NR-1)/30000) ".txt"; print >> filename}' MasterFile-HP-import-20151006.txt

But getting an error: awk: too many output files 10 . It generates the first 10 files and gives error on 11th file.

I would appreciate any help.

Thanks

RudiC · October 6, 2015, 2:27pm

Without sort ing, try

awk '
NR==1           {HD=$0
                 OF=FILENAME
                 next
                }
!((NR-2)%lines) {if (OF) close (OF)
                 sub (/(__[0-9])*\./, "__" ++CNT ".", OF)
                 print HD > OF
                }
                {print $0 > OF
                }
' lines=30000 MasterFile-HP-import-20151006.txt

---------- Post updated at 20:27 ---------- Previous update was at 20:01 ----------

Or, try

{
        { readlink /proc/$$/fd/0; line; sort; } | awk '
        NR==1           {OF=$0
                         next
                        }    
        NR==2           {HD=$0
                         next
                        }
        !((NR-3)%lines) {if (OF) close (OF)
                         sub (/(__[0-9])*\./, "__" ++CNT ".", OF)
                         print HD > OF
                        }
                        {print $0 > OF
                        }
        ' lines=30000
} < MasterFile-HP-import-20151006.txt

Saanvi1 · October 6, 2015, 4:32pm

Thank you for the response. I tried the code below:

{
        { readlink /proc/$$/fd/0; line; sort; } | awk '
        NR==1           {OF=$0
                         next
                        }    
        NR==2           {HD=$0
                         next
                        }
        !((NR-3)%lines) {if (OF) close (OF)
                         sub (/(__[0-9])*\./, "__" ++CNT ".", OF)
                         print HD > OF
                        }
                        {print $0 > OF
                        }
        ' lines=30000
} < ${NEW_FILE}/MasterFile-HP-import-20151006.txt

But getting the error below:

readlink:  not found
awk: syntax error near line 8
awk: bailing out near line 8

Also tried this:

awk '
NR==1           {HD=$0
                 OF=FILENAME
                 next
                }
!((NR-2)%lines) {if (OF) close (OF)
                 sub (/(__[0-9])*\./, "__" ++CNT ".", OF)
                 print HD > OF
                }
                {print $0 > OF
                }
' lines=30000 ${NEW_FILE}/MasterFile-HP-import-20151006.txt

Getting error:

awk: syntax error near line 6
awk: bailing out near line 6

Don_Cragun · October 6, 2015, 6:08pm

As always, on Solaris/SunOS systems change awk to /usr/xpg4/bin/awk or nawk .

Saanvi1 · October 6, 2015, 11:58pm

Thank you Don and RudiC for your help/advice. Appreciate your time.
I was able to run both the code suggested by RudiC using nawk and /usr/xpg4/bin/awk .

The split is working fine. But getting one small issue.The files are getting generated fine from 1 to 10. But after 10 the number starts appending to itself. Please see below:

MasterFile-HP-import-20151006__1.txt
MasterFile-HP-import-20151006__2.txt

�� and so on upto 10

MasterFile-HP-import-20151006__10.txt

After 10. The number starts appending to 10.

MasterFile-HP-import-20151006__10_11.txt
MasterFile-HP-import-20151006__10_11_12.txt
MasterFile-HP-import-20151006__10_11_12_13.txt
MasterFile-HP-import-20151006__10_11_12_13_14.txt

Would it be possible to get the 11th file as:

MasterFile-HP-import-20151006__11.txt
MasterFile-HP-import-20151006__12.txt

and so on....

Thanks again

Don_Cragun · October 7, 2015, 12:19am

Try changing:

                 sub (/(__[0-9])*\./, "__" ++CNT ".", OF)

to:

                 sub (/(__[0-9]+)*\./, "__" ++CNT ".", OF)