Splitting based on occurence of a Character at fixed position

I have a requirement where i need to split a file based on occurence of a character which is present at a fixed position. Description is as below:

  1. The file will be more than 1 Lakh records.
  2. Each line will be of fixed length of 987 characters.
  3. At position 28 in each line either 'C' or 'D' will be present.
  4. I need to split the file whenever occurence of 'D' is there.
  5. Also the file name of the splitted files should have some common characters, something like <Original File Name>_aa,<Original File Name>_ab,<Original File Name>_ac and so on.
    PFB example of the file:
666617000338    INR        C           1800.0
655517000338    INR        C           1000.0
644417000338    INR        C           1800.0
655517000338    INR        C           1500.0
666617000338    INR        C           1200.0
699917000338    INR        C           1100.0
688817000338    INR        C           1500.0
644417000338    INR        D          10000.0
655517000338    INR        C           1800.0
677717000338    INR        C           1800.0
699917000338    INR        C           1800.0
622217000338    INR        D           3600.0

So the splitted files should be like:
First File:

666617000338    INR        C           1800.0
655517000338    INR        C           1000.0
644417000338    INR        C           1800.0
655517000338    INR        C           1500.0
666617000338    INR        C           1200.0
699917000338    INR        C           1100.0
688817000338    INR        C           1500.0
644417000338    INR        D          10000.0

and second file should be like:

655517000338   INR         C            1800.0
677717000338   INR         C            1800.0
699917000338   INR         C            1800.0
622217000338   INR         D            3600.0

ans so on.

Will the "C" or "D" character be always in the third column of the file?

No the column is not fixed, only the position is fixed.

So why in your example the "C"/"D" is at position 18 and not 28?

position=18
char=D

awk -v p="$position" -v c="$char" '
BEGIN { basefile = "txt"; filename = basefile "" ++x }
{print > filename}
(substr($0,p,1) == c) { filename = basefile "" ++x }
' "$file"

Hi Bartus,

The postion is coming as 18 because multiple spaces after 338 are getting truncated while posting on the forum.

Use code tags to keep the original spacing.

Hi Neelkanth,
I have added CODE tags to your original post in this thread. The fact that you omitted CODE tags explains why the responders to this thread saw the C and D in column 18 instead of 28. There is no indication in your posting that any line has any trailing spaces (or other data) following the last digit shown on each line. I hope that the video clip included in the infraction notice you received recently will help you understand how to use CODE tags so confusion like we've seen in this thread will not be a problem in future threads that you start.

cfajohnson's proposal is fine but does not create the filenames as specified. Try this adaption of his code:

awk -v p="$position" -v c="$char" -v EXT="aa" '
                                {print > FILENAME "_" EXT}
         substr($0,p,1) == c    {if (++x > 25) {y++; x=0} 
                                 EXT = sprintf ("%c%c", y + 97, x + 97)}
        ' file

There may still be a couple of problems here. The standards don't clearly specify the precedence for the command:

print > FILENAME "_" EXT

so it can be evaluated as:

(print > FILENAME) "_" EXT

(as it is on Mac OS X) or as:

print > (FILENAME "_" EXT)

(as I think it is on some other systems) so to be sure you get what was intended, you need to add the parentheses as shown in the last form above.

Since there is no indication of the number of expected output files (other than that it could be inferred to be somewhere between 27 and 676 since the suffix string is two lower case slphabetic characters), awk will run out of file descriptors if files aren't closed when they will no longer be used for output.

So for the real data (instead of the tiny sample file), the following might work better:

awk -v p="$position" -v c="$char" -v EXT="aa" '
                                {print > (FILENAME "_" EXT)}
         substr($0,p,1) == c    {if (++x > 25) {y++; x=0}
                                 close(FILENAME "_" EXT)
                                 EXT = sprintf ("%c%c", y + 97, x + 97)}
        ' file

As always, if you want to try this on a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of /usr/bin/awk or /bin/awk .

Note also that this script won't work as specified on a system that uses EBCDIC or some other non-ASCII based codeset where 97 is not the encoding for "a" or the lowercase alphabetic characters are not all in consecutive numeric sequence.