EBCDIC File Split Based On Record Key

hanshot1stx · December 21, 2015, 1:02pm

I was wondering if anyone could explain to me how to split a variable length EBCDIC file into seperate files based on the record key. I have the COBOL layout, and so I need to split the file into 13 different EBCDIC files so that I can run each one through a C++ converter I have, and get the corresponding csv output file to put into a database. The records are:

Record Key Segment Name
01 GRROOT
02 GRCYCLE
. .
. .
. .
13 GRR3RMKS

If it helps at all, the PDF that comes with the EBCDIC file showing the COBOL layout states that the record length of the file is 422, the blocking factor is 77 and the blocksize is 32,494. There is additional information such as GRROOT length is 150 bytes, GRCYCLE is 72 bytes, etc.

Thanks for the help

Don_Cragun · December 22, 2015, 3:31am

There is an awful lot that is unspecified here:

What operating system are you using?
Is there any binary data in the COBOL files you're processing, or is it all text.
Is the record key the entire 422 byte record? If not what part of the record constitutes the key?
Why do you say the input is variable length and then say that the record length is 422 bytes per record and 77 records per block? What is variable other than the number of records in the file?

If you're trying to process EBCDIC files on an ASCII based system, the dd utility will probably be at the base of your processing. Look at the dd man page on your system and see if something like the following would be a good start to getting a file you can then split awk or grep :

dd if=YourEBCDICInputFileName of=YourASCIIOutputFileName ibs=422x77 cbs=422 conv=ascii,unblock,sync

and, after splitting YourASCIIOutputFileName into the files you want based on your keys, you could convert them back into fixed-length, blocked, EBCDIC files using dd again with obs=422x77 instead of ibs=422x77 and conv=ebcdic,block... and appropriate if= and of= parameters.

hanshot1stx · December 22, 2015, 11:59am

Thanks for the reply Don. I am doing this for a little side work project. Here are some of the specifics:

1) Any OS. I have a machine that runs ubuntu, and my work computer is windows 10. It sounds like ubunutu would be my preference here.
2) There is binary data that is being processed, packed decimal fields if that sounds right.
3) Reading through the cobol the record key is the first two bytes (1,2) of each record
4) The 422 and 77 bytes were numbers that appear in the front of the PDF, but then later it says that each record is of variable length, and gives me the length of each record. The total number of records would change each month, since this is a monthly dataset.

As I am typing this, it sounds like I would need to use the dd command and be able to change the number of bytes that is read each time based on what the record key is. So lets say I use dd and I want to read the first two bytes. If the ASCII conversion of those bytes = 01, then I know that the record length is 150 bytes, so I want to read the 150 and write them to a new EBCDIC file, that will later be sent through a program that unpacks the fields and converts to a csv. Then I would want to skip 150 bytes and read the next two bytes. Lets say those = 02, so I know that the record is 72 bytes. So on and so forth

Don_Cragun · December 22, 2015, 7:08pm

You can convert the entire EBCDIC file to ASCII just using:

dd if=EBCDICfile of=ASCIIfile conv=ascii

but, with packed decimal fields in the COBOL input file, you may end up with null bytes in the output file. And, you can't have null bytes in a text file. If you aren't working on text files, many of the standard utilities produce undefined results.

But, you can use cut and paste even if the files being processed are not text files. So, after converting your file to ASCII, you could walk through the file in a loop, starting with offset=1 :

grabbing two bytes (with cut ) to determine record type,
based on the type, grab x more bytes (again with cut ) to complete the record you started reading,
writing the complete x+2 byte record to the appropriate output file (again, based on record type), and
incrementing offset by x+2.

As long as you don't modify the parts of the record that contain packed decimal data, converting back from ASCII to EBCDIC should still have correct packed decimal data in the resulting EBCDIC file.

jgt · December 22, 2015, 9:31pm

why not install gnucobol and write a cobol program.

hanshot1stx · December 24, 2015, 1:30pm

Alright Don, so I have spent the last few days playing with this now and have run into a couple quirks. First off are some things about the file. We will call the original EBCDIC file with all of the data data.ebc. I go ahead and do the simple conversion using dd to get a new file, data.ascii. Running the wc command gives me

0 lines in data.ebc, with 64454170 bytes
5948 lines in data.ascii with 64454170 bytes

Then I use the tr command to get rid of newlines so that I have one line in my new.ascii file. Then I go through new.ascii and cut the first two bytes, get 01, and write that to a file, increment and repeat. This works perfectly until I get to bytes 16880, in which the program then gets thrown off. Interestingly in data.ascii, there are 16507 bytes in the first line. So somehow I need to make it so that I have either a file that has only one line (since using tr to delete '\n' seems to be causing issues) or I need a file that has 422 bytes on each line, so that the first two bytes of each line correspond to either 01,02,03,...,12,13.

Don_Cragun · December 24, 2015, 2:29pm

OK. This is good! You translated EBCDIC bytes to the corresponding ASCII bytes and no bytes were added or lost. But, even though this is an ASCII file, it is not a text file; the <newline> characters are just binary data in your file; not line terminators.

Ouch. No! Don't remove ANY bytes from data.ascii . Those <newline> characters you're seeing in that file are probably the ASCII byte values corresponding to some of the binary packed decimal data bytes in your input.

The data in data.ascii is just a stream of bytes containing the records in your data; there are no record separators in data.ebc nor in data.ascii . In addition to <newline> characters, there are probably also <nul> (all bits 0) bytes that should not appear in a text file. But, we aren't going to treat data.ascii (or data.ebc ) as a text file.

Can you show us a table where the 1st column gives us the 1st two characters of your records (the two bytes that specify the record type), the 2nd column gives us the length in bytes of records of that type (either with or without the two bytes specifying the record type, but tell us whether or not the record size given includes those bytes), and the 3rd column gives us the name of the file that to which records of this type should be appended? (Are these output files supposed to be ASCII or EBCDIC? On first read of your requirements, I thought wanted to feed ASCII data to your C++ converter and then take the output from your C++ converter and translate that back to EBCDIC. Reading your first post again, it isn't clear to me whether the C++ converter wants EBCDIC input or ASCII input.)

hanshot1stx · December 24, 2015, 4:15pm

After reading through the COBOL, it appears that the meaningful data in a record is of variable length, but then each record has a filler at the end of it so that the total byte count comes to 420, plus the 2 bytes for the record for a total of 422 bytes. So when I cut -b 1-2, I get 01. Then when I cut -b 423-424, I get 13.

A small example
Key = 01
Length = 150
Filler = 270

Key = 02
Length = 170
Filler = 250

It seems that the main problem right now is how can I get bytes 1-2 or 423-424, since using cut -b 1-2 gives me the first two bytes of each row of the file, which I definitely do not want.

Ultimately, the C++ program takes an EBCDIC file to convert it to a csv, based on the COBOL structure of that record. This is the reason I need to break it out into the 13 different files, since there are 13 different record types, each with a different COBOL structuring, requiring a different decode algorithm.

Don_Cragun · December 24, 2015, 5:37pm

OK. So we probably won't need the entire ASCII file.

Does your C++ converter want 422 byte records (with the type, the data, and the filler), or (using type "01" as an example) 152 byte records (with the type, the data, and no filler), or 150 byte records (just the data; no type and no filler)?

And what file do you want to contain the extracted records for each record type? Is Typexx.ebc where xx is the record type OK?

PS... And, of course, we still need the other record types and lengths.

hanshot1stx · December 29, 2015, 12:54pm

Now that I am thinking about it a little more, maybe I can alter the C++ program to read specific bytes at a time, and then depending on the record, read xx bytes and convert that record, then move on. That would be the solution to splitting any of it, as I could just feed the main.ebc file through the program.

To answer a couple of your questions, Typexx.ebc is exactly what I am looking for. So the C++ converter would take Typexx.ebc as the input, do the conversion it needs to based on that record, and then spit out a csv file Typexx.csv that I can then use to load into the database.

Don_Cragun · December 29, 2015, 6:12pm

Modifying your C++ program would be a LOT faster, but the following seems to work for small a sample data file I created:

#!/bin/ksh
# Usage: splittype [ EBCDICfile.ebc ]
# The splittype utility shall extract records from the given input file
# (default data.ebc if no operand is given) into files named "Type"xx".ebc"
# (where xx is the record type identified by the 1st two bytes of each 422
# EBCDIC-encoded byte record in the input file).  The input file pathname is
# assumed to end with the extenions ".ebc".  If it doesn't, the results are
# unspecified.  Records found in the input file will be appended to the
# corresponding output files in the current directory.

# If an ASCII version of the input file does not exist in the current directory
# with the same basename as the input file and with the same size as the input
# file with the extension ".ascii", it will be created (or, if it exists with a
# different size, overwritten) before processing starts.
IAm=${0##*/}
ec=0			# Final exit code.
ifEBCDICname=${1:-data.ebc}
ifASCIIname=${ifEBCDICname##*/}
ifASCIIname="${ifASCIIname%.ebc}.ascii"
ofname_prefix="Type"	# Output filenames will be...
ofname_suffix=".ebc"	#	"$ofname_prefix$type$ofname_suffix"

fixlen=522		# Fixed length of records in input file.
spot=0			# # of bytes processed so far from input file.

# Verify that the input file exists...
if [ -f "$ifEBCDICname" ]
then	read junk junk junk junk fsize junk <<-EOF
		$(ls -l "$ifEBCDICname")
	EOF
else	printf '%s: ERROR: File "%s" not found.\n' "$IAm" "$ifEBCDICname" >&2
	exit 1
fi
printf '%s: NOTE: Processing input file "%s" (%d bytes)\n' "$IAm" \
    "$ifEBCDICname" "$fsize" >&2

# Look for ASCII version of input file and create it if needed...
if [ ! -f "$ifASCIIname" ] ||
   [ "$(ls -l "$ifASCIIname" | (read x x x x Afsize x; echo "$Afsize"))" -ne \
	$fsize ]
then	printf '%s: NOTE: Creating ASCII version of "%s"\n' "$IAm" \
	    "$ifEBCDICname" >&2
	if ! dd if="$ifEBCDICname" of="$ifASCIIname" conv=ascii
	then	printf '%s: ERROR: Could not create ASCII file "%s".\n' "$IAm" \
		    "$ifASCIIname" >&2
		exit 2
	fi
fi
while [ $spot -lt $fsize ]
do	type="$(dd if="$ifASCIIname" bs=1 skip=$spot count=2 2>/dev/null)"
	case "$type" in
	(01)	typelen=152;;
	(02)	typelen=172;;
	(*)	printf '%s: Unknown file type ("%s") found at offset %d\n' \
		    "$IAm" "$type" $spot >&2
		spot=$((spot + fixlen))
		ec=3
		continue;;
	esac
	dd if="$ifEBCDICname" bs=1 skip=$spot count=$typelen >> \
	    "$ofname_prefix$type$ofname_suffix" 2>/dev/null
	spot=$((spot + fixlen))
done
exit $ec

Since it invokes dd twice for each record found in your input file, it will be SLOW, but it seems to get the job done. (Of course, you'll have to add the missing record types and assign the correct lengths for the other record types; you only given us the sizes for record types 01 and 02 . The code above assumes that you want to include the record type and the data (but not the padding from the end of the input records) in the output files.

Hoping this helps...

hanshot1stx · December 29, 2015, 6:17pm

This is great Don. I'm going to look through this and make sure I understand each part.

I'll follow up on this after I have had a chance to look through.