Linux shell script to insert new lines based on delimiter count

digitalnirvana · November 18, 2016, 7:18am

The input file is a .dat file which is delimited by null (^@ in Linux). On a windows PC it looks something like this (numbers are masked with 1).

The entire file is in one row but it has multiple records - each record contains 80 fields i.e. there are 81 counts of the delimiter (null or ^@). After all of the records there is a trailer of 20 fields delimited by ^@

Please suggest a sh script on how to split the file into multiple rows, after every 81 count of the delimiter ^@

The output should be like

record 1 comprising of 81 count of ^@
/n record 2 comprising of 81 count of ^@
/n ...
/n trailer record (we need not count the trailer as after the last record's 81 count it will remain as is)

RudiC · November 18, 2016, 7:50am

split -C might not be the perfect fit but something to look into. Or, try

sed 's/./&\n/480;s/./&\n/400;s/./&\n/320;s/./&\n/240;s/./&\n/160;s/./&\n/80;' file

digitalnirvana · November 18, 2016, 8:12am

Does that code start splitting from 480th position i.e. 6th record and work backwards? Apologies for the stupid question but where is it matching the delimiter?

The file can be of any length, ie have any number of records. The records in the file are also of variable length so the only way to identify the end of a record is that it is after each 81st occurrence of the "^@" delimiter.

Ie from the start of the file till 81st delimiter is one record, 82nd till 163rd is the second record, and so on.

The records are currently all in one line and they need to be in multiple lines ie separated by \n

Is there a way to do the 81st delimiter check till end of file

RudiC · November 18, 2016, 8:32am

Looks like I've misread/misinterpreted your specification. Sorry for that.
So, in your picture, we're seeing many empty fields (e.g. 15 after the first), 81 fields make up a record of unpredictable length, and there's no <NL> (\n, ^J, 0x0A) char in it.
Are other non-printable, control characters possible, like <TAB>s? Or are all field contents printable alphanumeric characters?

Please note that it is far better to post or attach a sample input file, be it abbreviated, to work (and test) upon, than to show a picture.

digitalnirvana · November 18, 2016, 9:09am

Apologies, my bad. I should've uploaded the file. Attached is a masked .dat file renamed as .txt for uploading.

On opening it with notepad++ in Windows, the null characters show up as boxes. In a Linux vi the nulls are ^@.

All the records in this file are in one row. This particular file has 2 records followed by the trailer record.

First record = starts at the beginning of the file 00000230 (this field gives the length of the record in bytes)
Second record = starts at the next 00000230 (it is a coincidence, here both records have same length)
Trailer record = starts at 0000096 (the trailer length is of 96 bytes and it also has 80 delimiters of ^@ or null characters. Ignore my earlier post saying trailer has 20 delimiters. It has 80 actually)

As the field lengths are variable so we cannot define a record in terms of total length of its fields or total bytes. This is why we are defining a record as effectively having length of 80 ^@ delimiters.

I require the 1st record in one row, 2nd record in next row and so on till the end of the file, with the trailer in the last row. If there is a way of adding a newline after every 80th ^@ from the beginning till the eof, then perhaps it will work?

The only unprintable character is null ^@, no TABS or other spaces, all other characters are alphanumeric.

Please let me know if any questions. Thanks for the help

RudiC · November 18, 2016, 9:37am

The *nix text utilities are NOT the best to deal with binary files like the one that you post. Would this come close to what you need?

sed 's/\o000/&\n/162;s/\o000/&\n/81' /tmp/MED_BIL_accmasked.DAT.txt
0000023000353123456789272050123456789100UNKNOWN00353123456789101710511-05-2016 01:01:03ABC MEDIATION50100353830000002===EABCTwE2E2k+GFLBBE35383000000200000230
00353123456789272050123456789100UNKNOWN00353123456789101710711-05-2016 01:01:10ABC MEDIATION20100353830000002===AAAAAAAIeQmVGFLBBA35383000000200000096
30000502

Use e.g. od -bc to verify the result.

digitalnirvana · November 22, 2016, 3:06am

Apologies for the delayed response. While the code here would work that is because we know it has 2 records at the 80th and 160th positions, in actual files there would be thousands or tens of thousands of records. So is there a way to cut or grep each record (from the beginning of the file counting 80 ^@ delimiters), then move it to a temp file, add a \n, then add the second record, and so on, till the end of the file. So that the end result is a file with each record in a row.

Appreciate all your assistance. Thanks.

Don_Cragun · November 22, 2016, 5:14am

As long as the NUL bytes are field terminators (not field separators) and there are 80 fields per record, the following will add a <newline> character at the end of each input record in the file MED_BIL_accmasked.DAT and store the results in a file named MED_BIL_withNL.DAT :

tr '\0' '\r' < MED_BIL_accmasked.DAT |
awk '
BEGIN {	RS = ORS = "\r"
}
1
!(NR % 80) {	printf("\n")
}' |
tr '\r' '\0' > MED_BIL_withNL.DAT

I have tested the above code on a macOS system (using a BSD version of awk and it does what I understand your requirements to be. (But, since you didn't post your desired output for your sample input, I can't be sure.)

If you want to try the above code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

If you are using gawk , it might let you get by with just:

gawk '
BEGIN {	RS = ORS = "\0"
}
1
!(NR % 80) {	printf("\n")
}' MED_BIL_accmasked.DAT > MED_BIL_withNL.DAT

but I haven't tested this with gawk and do not know that it will accept a NUL character as an input or output record separator.

digitalnirvana · November 23, 2016, 4:56am

Hi Don,

Thanks and sorry for not responding earlier. I've tried this out on our test Linux box (running redhat) and it's outputting a blank file.

I'll try and give you a properly formatted expected output file tonight.

What do you mean by field separators vs field terminators please? In our case each record has 80 fields separated by 80 NULL (^@) characters.

Don_Cragun · November 23, 2016, 9:46pm

In a CSV formatted file, you have a character that is a field separator and a record is terminated by a <newline> character. In a CSV record with 80 field separators, there are 81 fields. If your file has field separators (instead of field terminators), there is no way to know where the last field of one record ends and the 1st field of the next record starts without counting bytes based on the number that you say is in the 1st field of each record. If you have field terminators (instead of field separators), we just have to add a <newline> after the last field terminator and don't ever have to look at the first field and count bytes to find the end of a record.

From the sample you downloaded before, it appears that the NUL characters are field terminators and that there are 80 NUL terminated fields per record and that the sample contained three records. The script I suggested in post #8 inserts a <newline> character into the output after finding 80 field terminators.