Conversion from ASCII to binary for physical simulation code in C/C++

Cybertib · March 23, 2011, 5:59pm

Good evening, everybody

A good math friend told me that it would be possible to shrink the size of the numerical datas I produce with a physical simulation code I programmed for my PhD.

It usually writes at least 100 GB to complete the simulation, and it seems that it is too high. There are some quotas to respect, and I have been told that it would possible to use "binary" datas instead of ASCII datas.

Here are the inputs of the problem.

I calculate and produce the datas using a C/C++ simulation code.
I make post treatments using bash and awk.
I make plots using GNUplot 4.4 [splot works great with pm3d now Have a look to "Not so frequently asked question" website to be aware of GNUplot possibilities]

I found how to use binary mode in Gawk and GNUplot, but the main point is missing: which conversion should I make to decrease the volume of the datas in a loseless way ? They are all numerical datas, then I would convert double precision number coded in ASCII (visible with "more" command) into "binary" (which I assume to be a language abuse, because converting datafile into binary using "od" command just multiply the size of the initial file... ).

Example. Each line of my generated main data file is:
3000 -3.9e-13 -4.24661e-05 0 299.964 300 1.50018e+16 1.50005e+16 0 00 1 0 0

What conversion do you recommand to optimize space needed ?

By writing this post, I feel that I should convert double precision numbers coded with characters in a ascii file into double precision datas coded with numbers only. How can I do that? "od" command is sufficient ?

Glad of any help,

Cheers from France,
Thibault

Corona688 · March 23, 2011, 6:23pm

You could compress the data? You can get 4:1 compression on text easily, and don't have to store the decompressed data on disk to use it. This will cause some more CPU usage though.

$ program_that_spews_gigs_of_data | gzip > data.gz
# Tell the gnuplot script to process "/dev/stdin" or "/proc/self/fd/0" instead
# of a filename
$ gunzip < data.gz | gnuplot file.script

Binary data could be smaller yet, but telling gnuplot how to use it, while possible, may be difficult.

$ gnuplot
> help plot
...
Subtopics available for plot:
    acsplines         axes              bezier            binary
    csplines          cumulative        datafile          errorbars
    errorlines        every             example           frequency
    index             iteration         kdensity          matrix
    parametric        ranges            sbezier           smooth
    special-filenames style             thru              title
    unique            using             with

Subtopic of plot: binary

 The `binary` keyword allows a data file to be binary as opposed to ASCII.
 There are two formats for binary--matrix binary and general binary.  Matrix
 binary is a fixed format in which data appears in a 2D array with an extra
 row and column for coordinate values.  General binary is a flexible format
 for which details about the file must be given at the command line.

 See `binary matrix` or `binary general` for more details.

Subtopics available for plot binary:
    general           matrix

Subtopic of plot binary: matrix

 Gnuplot can read matrix binary files by use of the option `binary` appearing
 without keyword qualifications unique to general binary, i.e., `array`,
 `record`, `format`, or `filetype`.  Other general binary keywords for
 translation should also apply to matrix binary.  (See `binary general` for
 more details.)

 In previous versions, `gnuplot` dynamically detected binary data files.  It
 is now necessary to specify the keyword `binary` directly after the filename.

 Single precision floats are stored in a binary file as follows:

       <N+1>  <y0>   <y1>   <y2>  ...  <yN>
        <x0> <z0,0> <z0,1> <z0,2> ... <z0,N>
        <x1> <z1,0> <z1,1> <z1,2> ... <z1,N>
         :      :      :      :   ...    :
...

As for how to write a binary value in C? Easy as pie. You just write it.

{
        double close_enough=3.14;
        FILE *fout=fopen("filename", "w");
        write(&close_enough, 1, sizeof(close_enough), fout);
        fclose(fout);
}

Cybertib · March 23, 2011, 6:37pm

Thank you for your answer.

double close_enough=3.14;
        FILE *fout=fopen("filename", "w");
        write(&close_enough, 1, sizeof(close_enough), fout);
        fclose(fout);

This is almost what I've used to produce the 4 times larger file.
I used fwrite and a buffer instead (to collect whole the datas to put in a line) as follows

sprintf(buffer, "%lg\t%lg\t%lg\t%lg\t%lg\t%lg\t%lg\t%lg\t%lg\t%lg\t%lg\t%lg\t%lg\t%lg\n", 
							fluence, t, x, z, Te(i,j), 
							Ti(i,j), eDensity(i,j), hDensity(i,j), Intensity(i), intensity(i,j), 
							H(i,j), phase(i,j), angleIncidence, angleRefracted(i,j));
						fwrite(buffer, 1, sizeof(buffer), MainOutput);

Do you think it is same as you proposed ?

Corona688 · March 23, 2011, 6:56pm

No, they're completely different... The variables in a C program start as binary, and one of sprintf's jobs is to convert binary into ASCII. You're converting binary numbers into an ASCII string then writing the ASCII string to file.

My example doesn't convert -- it writes the variable direct, as binary. You could read them in as ASCII with with fgets and sscanf, then just write them back out raw as binary.

Cybertib · March 23, 2011, 7:17pm

Ok then, that was a dummy question, and not an expert one...
Thank you for everything!

---------- Post updated at 12:17 AM ---------- Previous update was at 12:14 AM ----------

What about converting the old files into binary ?

"od" command seems good, but after several tests, conversion don't compress the data as expected. You gzip idea is interesting to not reach quotas, but the data need to be extracted before being treated with gnuplot in an ascii form. Using binary might much more fast, following the gnuplot documentation.

Idea?

Thank you again

Thibault

Corona688 · March 23, 2011, 8:17pm

Like I said: read them in with fgets and sscanf, write them back out as binary with fwrite(). To make an example that works I'll need to see what your data looks like.

No doubt: It does the precise opposite, dumping binary files in a variety of ASCII forms.

Could be. Also means that if you make a mistake in your C program, you've !$@^ed up 100 gigs of data faster than you ever could before.

It's just occurred to me that doing it in double-precision is pointless anyway; you've already processed it with single-precision awk before this point.

The gnuplot "matrix" format is out. It stores everything as floating point numbers, even the number of rows, which means you get 8 million rows max before it starts expressing the number of rows in exponential notation and ending up with slightly too few or too many.

Their documentation in this area seems especially impenetrable. I'm working on something.

---------- Post updated at 06:17 PM ---------- Previous update was at 06:07 PM ----------

// Generate a raw binary file for gnuplot to work with.
// compile with -lm

#include <stdio.h>
#include <math.h>

int main(void)
{
        int n, points=100;
        FILE *fp=fopen("sin.bin", "wb");

        for(n=0; n<points; n++)
        {
                float v[3]= { (2*3.14159*n)/(points-1) };
                v[1]=sin(v[0]);         v[2]=cos(v[0]);
                // writes three floats in a row.  x, sin(x), cos(x)
                fwrite(v, 3, sizeof(float), fp);
        }
        fclose(fp);
        return(0);
}

$ gcc graph.c -lm
$ ./a.out
$ ls -l sin.bin
-rw-r--r-- 1 monttyle monttyle 1200 Mar 23 18:15 sin.bin
$ gnuplot
> plot "sin.bin" binary format='%f%f%f' using 1:2;
(pops up a picture of a sine wave)
>

How to cram that into a surface plot or whatever I'm not sure but it's something to work from.

For doubles, use %lf.

Cybertib · March 24, 2011, 2:32pm

Hello back!
Thanks for your sin.bin example Helped me a lot to not waste time on this part!

Then I succeeded in using GNUplot with the binary datas I generate with my simulation code using

double buffer[]={x, y, Z(i,j)}; 
fwrite(buffer, sizeof(double), sizeof(buffer), Mesh);

which successfully:

produce smaller and binary datas
and permit to make plots with gnuplot using

gnuplot> plot "Mesh0.dat" binary format='%3lf' using 2:3 w d

But now, the following point is the last: how to select my datas using awk with binary datas ?
With ascii datas, I had to write in gnuplot the line

gnuplot> plot "< awk '{if($1==0) print }' Mesh0.dat" using 2:3 with dots

But what about filtering binary datas with awk ?

After few hours (!!) of walking around gnu.org awk guide, I some tricks to make conversions

invoque awk --use-non-decimal-numbers
use the following script The GNU Awk User’s Guide#Ordinal-Functions or #Bitwise-Functions [did not posted 5 posts yet, can send full link sorry]

Quite hard! However, everybody use binary data in good simulation codes... How do they manage data filtering ?

When this gonna work, I think I will turn to perl to avoid such mess.
PS: we can use any program under gnuplot using the "<" operator in plot command, then if nothing is possible with binary in awk, we can add a step with "od" or perl. Those stuff are new for me!

Thanks a lot,
Thibault

---------- Post updated at 07:32 PM ---------- Previous update was at 07:27 PM ----------

Other methods here

how to read binary data file? in Awk
How do you use binary conversion in python/bash/awk

Corona688 · March 25, 2011, 7:45pm

I don't think you can. You could write a C utility to do that:

int main(void)
{
        int columns=6;
        float *f=malloc(sizeof(float)*columns);
        float comp=0.0f;

        // Read binary data piped into stdin
        while(fread(f, sizeof(float), columns, stdin) == columns)
        {
                // Never compare floating point with == unless you know
                // you're really going to get the exact value right
                // down to the last bit.  Choose a value sufficiently close
                // to equal instead.
                if(abs(f[0]-comp) < 0.000001)
                        // If the row matches, write it back to standard out.
                        fwrite(f, sizeof(float), columns, stdout);
        }
}

For that matter, cutting awk out entirely could let you do everything with full, 64-bit double-precision variables.

---------- Post updated 03-25-11 at 05:45 PM ---------- Previous update was 03-24-11 at 10:07 PM ----------

I just realized something here:

double buffer[]={x, y, Z(i,j)}; 
fwrite(buffer, sizeof(double), sizeof(buffer), Mesh);

I think you're writing too much data. sizeof(buffer) doesn't give you the number of elements, it gives you the size in bytes! That should be

fwrite(buffer, sizeof(double), sizeof(buffer)/sizeof(double), Mesh);

Or just "3".