Parsing large files in Solaris 11

os2mac · August 18, 2015, 7:11pm

I have a 1.2G file that contains no newline characters. This is essentially a log file with each entry being exactly 78bits long. The basic format is /DATE/USER/MISC/. The single uniform thing about the file is that that the 8 character is always ":"

I worked with smaller files of the same data before using the following command

 ggrep -E -o ".{0,8}\:.{0,67}" LOG.txt

but the problem with this particular file is the size of the file itself. At 1.2G ggrep runs out of memory....

ggrep: memory exhausted

looking for a way to break up the file or get around the memory limits.

Don_Cragun · August 18, 2015, 8:01pm

Having an entry that is 78 bits long that contains characters is very strange. Most entries in a file are a stream of 8 bit bytes. So, to split your entries (each of which is 9.75 bytes) into 11 byte lines (your 9.75 bytes per entry plus 2 bits for byte packing and a newline so the output is a text file), you're probably going to find writing a C program to read bytes and rotate bits into the proper positions easier than doing it in a shell script.

What two bits should be added to your entries to produce 10 characters (assuming ASCII or EBCDIC) from your input entries?

If your entries are all 78 bits long, why is your grep looking for a varying number of characters before and after the colon and why is the string it is matching varying from 1 to 76 characters (not bits or bytes) inclusive instead of the 78 bits you specified???

Please show us the first 200 bytes of your input file piped through the command:

od -bcx

neutronscott · August 18, 2015, 9:05pm

Looking at the example I think OP meant 78 bytes

RudiC · August 19, 2015, 2:14am

If exists on your system, would

fold -w78 file

work?

Corona688 · August 19, 2015, 12:18pm

If the records are all of fixed size, dd can be used to insert a newline after them. An example with 4 byte fixed size records:

# bs is 1 minus the record size, cbs is the record size.
$ printf "AAA:BBB:CCC:DDD:" | dd bs=3 cbs=4 conv=unblock

AAA:
BBB:
CCC:
DDD:

$

dd is unaffected by line length limitations. You chould chain this before an awk or grep or what have you.

dd if=filename ... | grep whatever

Don_Cragun · August 19, 2015, 12:53pm

corona688:

If the records are all of fixed size, dd can be used to insert a newline after them. An example with 4 byte fixed size records:
# bs is 1 minus the record size, cbs is the record size.
$ printf "AAA:BBB:CCC:DDD:" | dd bs=3 cbs=4 conv=unblock

AAA:
BBB:
CCC:
DDD:

$
dd is unaffected by line length limitations. You chould chain this before an awk or grep or what have you.
dd if=filename ... | grep whatever

I assume you meant bs=4 instead of bs=3 , but when processing a 1.2Gb file, dd will run noticeably faster with its default block size (512 bytes) or a larger size like bs=1024000 . The dd bs=n parameter specifies how many bytes dd will read at a time from its input file and how many bytes at a time it will write to its output file.

With conv=unblock , it is just the conversion buffer size (specified by cbs=n ) that determines the output line length produced by the dd utility.

Corona688 · August 19, 2015, 12:58pm

No, I meant bs=3. That is what it seemed to require from empirical testing.

You are correct. It appeared to require it but that was my mistake (probably from still using the sync option at the time).

You might even do bs=4M.

Scrutinizer · August 19, 2015, 1:03pm

Although I presume the system cache and the filesystem block size will soften the blow in the sense that usually this should translate into io sizes that are equivalent to the filesystem block size or multiples thereof, depending on how smart the filesystem is...

Corona688 · August 19, 2015, 1:04pm

He's got a point though. The system can only soften the blow of 128 system calls vs 1 so much. Try running dd with a bs of 1, it's slow. I only suggested a tiny block size since it seemed necessary which was my mistake too, from leaving sync in the conv options.