I am finding the performance of cat command is very wierd, it is taking more time to merge the files into a single file. We have a situation where we would be merging more than 100 files into a single file, but with cat command it is running slow. I tried doing with paste, join and cat, but cat is working faster than any of these. Please guide me if any one has faced similar issue, please help. How about sed and awk commands usage for merging bigger files.
We are currently using cat command to combine files, basically appending files together into a single file.
cat file1 file2 file3 .....file 100 > final_file
I tried creating a file list and ran cat command this way....
cat filelist | xargs cat >> final_file
There is no performance gain. The file1, file2 ...file 100 are of varying size, and the final_file is very big, somewhere around 15 GB. I wanted to know is there a better way to combine multiple files into one.
#! /opt/third-party/bin/perl
use strict;
my @contents;
my @files_arr;
my $DIR1 = "/source/directory/";
my $big = "thebigfile.txt";
opendir(DIR, $DIR1) || die "Unable to open files in Dir : $DIR1 <$!> \n";
@files_arr = readdir(DIR);
closedir(DIR);
open(BIG, ">", $big) or die "Unable to open file $big <$!> \n";
foreach (@files_arr) {
if( ! /^\./ && ! /^\.\./ ) {
print "$_\n";
$_ = $DIR1 . $_;
open(FILE, "<", $_) or die "Unable to open file $_ <$!> \n";
@contents = <FILE>;
close(FILE);
print BIG "@contents";
}
}
close(BIG);
exit 0
The operation is going to take a combination of CPU and IO. With the volumes you are talking about the IO will have more overhead that the CPU component.
I do agree with you. But still how far can we tune to speed up the process. Becuase of this file appending, whole load time is impacted. Let me know if you have any other thoughts on this...
Though the I/O is quite higher compared to the normal process, its executed as multiple slices. So the additional overhead is comparatively less with the efficiency gained with multiple slices.
I did a sample run for 1000 files with 5600 lines each.
with the approach cat f* > final
it took around 1 m 18 s
It is twice the IO because if the files contain a total of X number of bytes, you are reading and writing X bytes once to put them into final$i, and the again reading and writing X bytes to put them into big file.
I'm surprised that you found it faster. A possible explanation is that because of the IO buffer cache, your second read is coming from cache rather than from disk. But for large files, cache would probably not play a significant factor.