File - reading - Performance improvement

dhanamurthy · May 21, 2008, 11:31pm

Hi All
I am reading a huge file of size 2GB atleast. I am reading each line and cutting certain columns and writing it to another file.

Here is the logic.

int main()
{
     
	  string u_line;
	  string Char_List;
	  string u_file;
	  int line_pos;
	  string temp_form_u_file;
	  ofstream temp_u_file;
	  u_file=getenv("u_file");
	  temp_form_u_file=getenv("DATA_DIR");
	  ifstream U_File;
	  temp_u_file.open(temp_form_u_file.c_str(),ios::app);
      
	  
	  if (temp_u_file.fail()) {
      cout << "Unable to open file "<<temp_form_u_file<<" for writing" << endl;
      exit(1);
      }
       
	    
      U_File.open(u_file.c_str());
      if (U_File.fail())
      {
         cout<<"File "<<u_file<<" unable to open for reading\n";
         cout<<"dart_report job failed\n";
	     exit(3);
      } 
    
      while (! U_File.eof() )
      {

		 line_pos=72;
		 u_line.erase();
		 getline (U_File,u_line);

		 if ( ! u_line.empty())  {
         while (line_pos< u_line.length())
	     {
           
	       if (u_line.substr(line_pos,2)!= "  ")
	       {
  
				Char_List=u_line.substr(line_pos,41);
				Char_List.append(u_line.substr(16,4));
				Char_List.append("\n");
				temp_u_file<< Char_List;
          
           } 
                line_pos=line_pos+41;

	    }

      }
    }  
}

When i run this program it takes 2.5 to 3 hours to read the 2 GB file. I am trying to reduce the time taken to reading. Is there any way i can reduce the processing time of the program.

Kindly let me know. If i can use Shell Script it is also okay. But i feel 'C' will be faster than Shell Scripting.

Please give me your suggestions.

Regards
Dhana

gautamdheeraj · May 22, 2008, 12:50am

I believe shell script should be faster. With C/C++, there is a lot of copying of data to/from kernel, which makes C/C++ programs slow. To make C/C++ programs faster, you may use multithreading also.

Dheeraj

unisuraj · May 22, 2008, 2:44am

Hi,
I would suggest to use fread that is read data in bulk say thousands at a time and then manipulate it.You will surely get the performance improvement.

dhanamurthy · May 22, 2008, 9:11am

Definitely C/C++ is faster than Shell Script.
Can you explain how fread is faster because i am going to read line by line only.

Regards
Kuttalaraj

aamirglb · May 22, 2008, 11:07pm

I think, read and write are the most low level system calls. All the other function like fread and fwrite again uses some low level function to do their work.
I think, using read for reading a chunk of data can improve the performance since their is not much overhead involved.

Regards,
Aamir

dhanamurthy · May 22, 2008, 11:22pm

HI
read(fd, buffer, n_to_read)
I am trying to use the above call, but i will not be able to read the entire line as i will not now the length of the line before hand.

This part is little tricky to handle.
If you have any idea please let me know.

Regards
Dhana

aamirglb · May 23, 2008, 1:24am

Hello!
What you can try out is: have a huge circular buffer for example say around 6144 (6KB) , you can experiment with the size!!
What i mean by circular is have two pointers, start_ptr and processed_ptr.

offset = 0;
read(fd, &buffer[offset], 3KB);
if(offset == 0)
{
    // next time read in the next chunk of buffer
    offset = 3KB;
    start_ptr = 0;
}
else
{    
    offset = 0;
    start_ptr = 3KB; 
}

bytes_read = start_ptr - processed_ptr;

//start processing it
while(bytes_read >= minimum_size_of_record)
{
     ret_val = check_for_complete_record(processed_ptr);
    // incomplete record
     if(ret_val == -1)
     {
             // don't modify processed_ptr since the record is not complete
             break; //without modifying the pointers 
     }
     else
     {
          // in this case check_for_complete_record will return the size of record
          bytes_read = bytes_read - ret_val;
          processed_ptr = processed_ptr + ret_val;
     }      
}

1) Have start_ptr and processed_ptr as global
2) You must take care of rollover of processed_ptr for every read

     if(processed_ptr >= MAX_BUFFER_SIZE) // in this case 6KB
               processed_ptr = 0;

Regards,
Aamir

dhanamurthy · May 23, 2008, 4:55pm

Hi
This helps.
But a concern here is that i need to put a while loop in place for reading the bulk characters until i come across "\n" character as my aim is to get line by line from the file.

Thanks for the idea.

Regards
Dhana

seanrowens · June 4, 2008, 8:44pm

There are two sets of functions for reading data,

open/read/write/gets/close

that operate on file 'handles', and

fopen/fread/fwrite/fgets/fclose

that operate on FILE * 'streams'.

The big advantage of using the streams is that they are buffered whereas the file handles are not. What this means is that for the nonbuffered functions, every time you call read() it has to go out to the physical disk and read some data.

With the buffered functions, it allocates a block of memory internally (I believe 8kb but I'm not sure) and when you call fread() or fgets() it only hits the disk if there isn't enough data already in the buffer. This is much faster.

By the way, you can increase buffer size with setbuf() and you can use fgets() to get the next line (next occurrence of \n) rather than a fixed number of characters.

To get the fastest possible speed, as mentioned above, you would have to use a big buffer, read a large chunk of file at once and then go through it looking for line ends. This avoids extra copying the data, i.e. it's copied from disk into memory and then out again.

But I'd try just using fgets() first as it probably is fast enough.

matrixmadhan · June 4, 2008, 10:48pm

or what ever be the mode of opening a file, set it to buffered using setvbuf, that should turn on buffering mode

jim_mcnamara · June 6, 2008, 10:00am

Steven's book on advanced unix programming has a table showing read performance on files.

Since you are returning lines, somewhere down inside the C++ stdio module is calling something like fgets. It does call read() to fill a buffer. Steven's table show that buffer sizes of 4096 are probably close optimum. There are other examples that show using
struct statvfs.f_frsize - the block size of the filesystem in question will also help.

See man setvbuf.

The other components of speed are the i/o queue request length, on board disk caching
and how "far above" the native read call your code operates. The first two are system related. If you call this low-level read routine directly and parse out you own lines it will probably speed things up - use 4096 or f_frsize as the number of bytes to read:

This is taken from M. Rochkind's book - example:

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>

ssize_t readall(int fd, void *buf, size_t nbyte)
{
         ssize_t nread=0;
         ssize_t n=0;

         memset(buf,0x0, nbyte+1);
         do
         {
                 if ((n = read(fd, &((char *)buf)[nread], nbyte - nread)) == -1)
                 {
                         if (errno == EINTR)
                                 continue;
                         else
                                 return (-1);
                 }
                 if (n == 0)
                                 return nread;
                 nread += n;
         } while (nread < nbyte);
         return nread;
}



void foo()
{
    ssize_t result=0;
	char buf[4200]={0x0};
	FILE *fp=fopen("somefile","r");

	if(fp!=NULL)
	{
		result=readall(fileno(fp), buf, 4096);
		if(result>0)
		{
			printf("%s", buf);
		}
	    if (result== (-1))
	    {
	    	perror("file I/O error");
	    	exit(1);
	    }
	}
	else
             {
		perror("file open error");
                          exit(1);
               }
}