Splitting based on occurence of a Character at fixed position

Neelkanth · July 21, 2013, 9:36am

I have a requirement where i need to split a file based on occurence of a character which is present at a fixed position. Description is as below:

The file will be more than 1 Lakh records.
Each line will be of fixed length of 987 characters.
At position 28 in each line either 'C' or 'D' will be present.
I need to split the file whenever occurence of 'D' is there.
Also the file name of the splitted files should have some common characters, something like <Original File Name>_aa,<Original File Name>_ab,<Original File Name>_ac and so on.
PFB example of the file:

666617000338    INR        C           1800.0
655517000338    INR        C           1000.0
644417000338    INR        C           1800.0
655517000338    INR        C           1500.0
666617000338    INR        C           1200.0
699917000338    INR        C           1100.0
688817000338    INR        C           1500.0
644417000338    INR        D          10000.0
655517000338    INR        C           1800.0
677717000338    INR        C           1800.0
699917000338    INR        C           1800.0
622217000338    INR        D           3600.0

So the splitted files should be like:
First File:

666617000338    INR        C           1800.0
655517000338    INR        C           1000.0
644417000338    INR        C           1800.0
655517000338    INR        C           1500.0
666617000338    INR        C           1200.0
699917000338    INR        C           1100.0
688817000338    INR        C           1500.0
644417000338    INR        D          10000.0

and second file should be like:

655517000338   INR         C            1800.0
677717000338   INR         C            1800.0
699917000338   INR         C            1800.0
622217000338   INR         D            3600.0

ans so on.

bartus11 · July 21, 2013, 9:59am

Will the "C" or "D" character be always in the third column of the file?

Neelkanth · July 21, 2013, 10:08am

No the column is not fixed, only the position is fixed.

bartus11 · July 21, 2013, 10:10am

So why in your example the "C"/"D" is at position 18 and not 28?

cfajohnson · July 21, 2013, 10:38am

position=18
char=D

awk -v p="$position" -v c="$char" '
BEGIN { basefile = "txt"; filename = basefile "" ++x }
{print > filename}
(substr($0,p,1) == c) { filename = basefile "" ++x }
' "$file"

Neelkanth · July 21, 2013, 11:22am

Hi Bartus,

The postion is coming as 18 because multiple spaces after 338 are getting truncated while posting on the forum.

bartus11 · July 21, 2013, 11:24am

Use code tags to keep the original spacing.

Don_Cragun · July 21, 2013, 1:02pm

Hi Neelkanth,
I have added CODE tags to your original post in this thread. The fact that you omitted CODE tags explains why the responders to this thread saw the C and D in column 18 instead of 28. There is no indication in your posting that any line has any trailing spaces (or other data) following the last digit shown on each line. I hope that the video clip included in the infraction notice you received recently will help you understand how to use CODE tags so confusion like we've seen in this thread will not be a problem in future threads that you start.

RudiC · July 21, 2013, 3:18pm

cfajohnson's proposal is fine but does not create the filenames as specified. Try this adaption of his code:

awk -v p="$position" -v c="$char" -v EXT="aa" '
                                {print > FILENAME "_" EXT}
         substr($0,p,1) == c    {if (++x > 25) {y++; x=0} 
                                 EXT = sprintf ("%c%c", y + 97, x + 97)}
        ' file

Don_Cragun · July 21, 2013, 4:56pm

rudic:

cfajohnson's proposal is fine but does not create the filenames as specified. Try this adaption of his code:
awk -v p="$position" -v c="$char" -v EXT="aa" '
   {print > FILENAME "_" EXT}
   substr($0,p,1) == c    {if (++x > 25) {y++; x=0} 
   EXT = sprintf ("%c%c", y + 97, x + 97)}
   ' file

There may still be a couple of problems here. The standards don't clearly specify the precedence for the command:

print > FILENAME "_" EXT

so it can be evaluated as:

(print > FILENAME) "_" EXT

(as it is on Mac OS X) or as:

print > (FILENAME "_" EXT)

(as I think it is on some other systems) so to be sure you get what was intended, you need to add the parentheses as shown in the last form above.

Since there is no indication of the number of expected output files (other than that it could be inferred to be somewhere between 27 and 676 since the suffix string is two lower case slphabetic characters), awk will run out of file descriptors if files aren't closed when they will no longer be used for output.

So for the real data (instead of the tiny sample file), the following might work better:

awk -v p="$position" -v c="$char" -v EXT="aa" '
                                {print > (FILENAME "_" EXT)}
         substr($0,p,1) == c    {if (++x > 25) {y++; x=0}
                                 close(FILENAME "_" EXT)
                                 EXT = sprintf ("%c%c", y + 97, x + 97)}
        ' file

As always, if you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of /usr/bin/awk or /bin/awk .

Note also that this script won't work as specified on a system that uses EBCDIC or some other non-ASCII based codeset where 97 is not the encoding for "a" or the lowercase alphabetic characters are not all in consecutive numeric sequence.