Issue with Unix cat command

RcR · October 29, 2007, 7:55am

Hi Experts,

I am finding the performance of cat command is very wierd, it is taking more time to merge the files into a single file. We have a situation where we would be merging more than 100 files into a single file, but with cat command it is running slow. I tried doing with paste, join and cat, but cat is working faster than any of these. Please guide me if any one has faced similar issue, please help. How about sed and awk commands usage for merging bigger files.

Thanks,
Rajiv

matrixmadhan · October 29, 2007, 8:09am

your statements are contradicting each other.

Could you please clarify that ?

RcR · October 29, 2007, 8:16am

Matrixmadan,

We are currently using cat command to combine files, basically appending files together into a single file.

cat file1 file2 file3 .....file 100 > final_file

I tried creating a file list and ran cat command this way....

cat filelist | xargs cat >> final_file

There is no performance gain. The file1, file2 ...file 100 are of varying size, and the final_file is very big, somewhere around 15 GB. I wanted to know is there a better way to combine multiple files into one.

Thanks,
Rajiv

matrixmadhan · October 29, 2007, 8:49am

#! /opt/third-party/bin/perl

use strict;

my @contents;
my @files_arr;

my $DIR1 = "/source/directory/";
my $big = "thebigfile.txt";

opendir(DIR, $DIR1) || die "Unable to open files in Dir : $DIR1 <$!> \n";
  @files_arr = readdir(DIR);
closedir(DIR);

open(BIG, ">", $big) or die "Unable to open file $big <$!> \n";

foreach (@files_arr) {
  if( ! /^\./ && ! /^\.\./ ) {
    print "$_\n";
    $_ = $DIR1 . $_;
    open(FILE, "<", $_) or die "Unable to open file $_ <$!> \n";
      @contents = <FILE>;
    close(FILE);
    print BIG "@contents";
  }
}

close(BIG);

exit 0

RcR · October 29, 2007, 9:40pm

MatrixMadan,

I tried running your script, but perl component is not available. I am getting error as "Perl not installed".

Thanks,
Rajiv

porter · October 29, 2007, 9:47pm

cat is pretty efficient.

The operation is going to take a combination of CPU and IO. With the volumes you are talking about the IO will have more overhead that the CPU component.

RcR · October 29, 2007, 9:54pm

Porter,

I do agree with you. But still how far can we tune to speed up the process. Becuase of this file appending, whole load time is impacted. Let me know if you have any other thoughts on this...

Thanks,
Rajiv

porter · October 29, 2007, 10:03pm

There is only one open() being called for each file, and one creat() for the final_file, then reads/writes for the data. How can you reduce that?

matrixmadhan · October 30, 2007, 9:26am

Exploit CPU power,

run as multiple processes,

following is just a sample modify the regex to accomadate files from 1 to n

#! /bin/zsh

i=1
while [ $i -le 9 ]
do
cat ab"$i"[0-9][0-9] >> final$i &
i=$(($i + 1))
done

while :
do
jobs > tmp
lc=`wc -l < tmp`
if [ $lc -eq 0 ]
then
  cat final* > big
  rm final* tmp
  break
fi
done

exit 0

kahuna · October 30, 2007, 10:13am

I would think this would take longer because you have twice the I/O. Once to output the final$i files and once to read them.

matrixmadhan · October 30, 2007, 10:23am

No.

Though the I/O is quite higher compared to the normal process, its executed as multiple slices. So the additional overhead is comparatively less with the efficiency gained with multiple slices.

I did a sample run for 1000 files with 5600 lines each.

with the approach cat f* > final
it took around 1 m 18 s

and with this approach it came down to
14 s

which I feel its much better !

matrixmadhan · October 30, 2007, 10:26am

How it could be twice the I/O ?

The overhead is purely dependent upon the number of instances ( cat instances ) we would like to run

it could be either
cat f1 .. f10 [ I/O is comparatively less due to less number of files ]
cat f1 .. f5 [ its more due to more number of files ]

and its not twice the I/O compared to the direct approach

kahuna · October 30, 2007, 10:40am

It is twice the IO because if the files contain a total of X number of bytes, you are reading and writing X bytes once to put them into final$i, and the again reading and writing X bytes to put them into big file.

I'm surprised that you found it faster. A possible explanation is that because of the IO buffer cache, your second read is coming from cache rather than from disk. But for large files, cache would probably not play a significant factor.

matrixmadhan · October 31, 2007, 5:54am

It is twice the IO because

You are (exactly) right !

Basically 1 open & 1 close for each and every file ( 1 .. n )
+ 1 open + 1 close for the final file,

should be exploited the other way, hence I had to make use of the internal I/O buffer cache.

My approach is a flaw, when it comes to huge files with small I/O buffer.

O/P has to give a rough count about the number of bytes in the file.

Thanks for the pointing that !