Reading a particular line from a .txt file

Hi,
I have a .txt file which contains the x, y and z co-ordinates of particles which I am trying to cast for a particular compound. The no. of particles present is of the order of 2 billion and hence the size of the text file is of the order of a few Gigabytes. The particles have been casted layer wise - thus, if there are 15000 layers in which I have casted the particles, there are approx. 2 billion/15000 particles in each layer. Thus, every 2 billion/15000 particles have the same Y co-ordinate. Now, I need to read the particles at a given value of Y (say y = 10). I wrote a small program, where I had used fin.seekg( ). However I realized that the seeking of the position from where the file has to be read is not done line-wise, but is done character-wise. Could someone please tell me how I could start seeking from a particular line in the file using a simple C++ program.

When the lines are a fixed record length, you can call seek() or fseek() in C++, C to get to a known line position.
this is c:

void seek_to_line(FILE *in, const long recl, const long lineno)
{
     fseek(in, lineno * recl, SEEK_SET);
}

This places the file pointer at the beginning of lineno, assuming recl is fixed.

Otherwise you can try to optimize I/O (see Steven's Advanced Programming in the UBIX Environment) by increasing
buffersize

FILE *in=fopen("somefile, "r");
int lineno=250002;
char tmp[256]={0x0};
char buf[16384]={0x0};

setvbuf(in, buf, 16384, _IOFBF );
while(--lineno)
   fgets(tmp, sizeof(tmp), in);

you have to call setvbuf BEFORE any I/O on the stream

That's only going to work if you can guarantee each line fits into your buffer.

If you can do that, great. If not, you pretty much have to count newline characters. This works, and should be fairly fast as you'd be relying on the OS to page in the data, which should be fast enough. If it's a really big file, and you know you're only going through it once, it'd be faster to use open() and read() with direct IO set so you bypass any page cache (if you're only looking once at each byte of a bunch of gigabytes of data, any caching is wasted cycles):

struct stat sb;
int fd = open( filename, O_RDONLY );
fstat( fd, &sb );
char *ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
uint64_t offset;
uint64_t line_count = 0;
uint64_t desired_line = 123455;

for ( offset = 0; offset < sb.st_size; offset++ )
{
    if ( '\n' == ptr[ offset ] )
    {
        line_count++;

        if ( line_count == desired_line )
        {
            break;
        }
    }
}

// if offset is less than file size, the desired line was found
if ( offset < sb.st_size )
{
    // line starts at offset + 1
    offset++;
        .
        .
        .
}

Note that has no error checking.

Considering the layout of your file to be in this form

X Y Z
-5.55 4.44 6.5
10.66 44.5 85.99
.....
......
.....
The values are separated by whitespace(s).

A simple awk liner will rid of of your messy C++ code.

 awk -v y=  '$2==y{print "x="$1,"z="$3}' file.txt

subsitute the value of y . e.g. awk -v y=4.44 '.........'
output from above - x=-5.55 z=6.5

Hope this helps,
Regards,
Gaurav

Thanks Guys!
Problem solved :slight_smile:

How did you solve it. Please share it with us. We would like to know.

Regards,
Gaurav.