-T, --temporary-directory=DIR
use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories
S, --buffer-size=SIZE
use SIZE for main memory buffer
It looks like SAH even supports Base-64 which will keep the associative array table to the minimum size (what I suspect was breaking my initial AWK routine).
You did not say how much RAM you have, which is a definite factor.
The MD5 result is impressive. Is MD5 as cheap as a good hash? Perhaps CPUs have gotten so much faster than disk that it is not a factor!
In perl/C/C++/JAVA you can mmap the input file both for input and so your hash map can hold a 64 bit char* for exact verification, reducing copying and space allocation overhead.
The exact verification seems to expand the vm footprint a lot, but the most likely case is that the md5 or hash is new and so the exact compare is not done, greatly reducing processing and the vm footprint with a small minority of duplicates. If there were a lot of duplicates, that hurts the VM footprint with more exact verifications, but conversely there is less final data in the map.
Yes I had considered a method of storing the file position against each MD5 sum and when a potential collision occurs one could then fseek back and re-read the original data to verify a duplicate. This would only double the memory required but does require a random access file, so no stream processing.
As long as the frequency of duplicates is low I wouldn't expect a significant increase in speed.
This updated version checks for different lines that have the same MD5 and fails with an exit code so at least you can be confident nothing is wrong. This needs a filename parameter as it needs random access and streams don't support that.
#!/usr/bin/perl -w
use Digest::MD5 qw(md5);
open(DATA, "<", $ARGV[0]) || die "Can't open $ARGV[0]: $!";
my %seen;
while (1) {
my $pos = tell DATA;
my $line=<DATA>;
last unless defined $line;
my $crc= md5($line);
if(!$seen{$crc}) {
$seen{$crc}=$pos+1;
print $line;
} else {
$pos = tell DATA;
seek DATA, $seen{$crc}-1, 0;
$!=255;
die "Duplicate CRC found" unless (<DATA> eq $line);
seek DATA, $pos, 0;
}
}
Takes 110 seconds on the already duplicate line removed database. Size and wc -l match.
I will have to build the raw database again (~70 min) to see if there are collisions. I am leaving work in a few min and will let it run when I am gone.
And they did not happen:D Just curious (as I don't know any Perl) does the script complete the output file and give the error messages or does it terminate at the error?
The database in now up to about 4GB. Chubler_XL's duplicate line removal script is still working great. I now build my database in segments and concatenate and I accidentally has more than one header in there which tested how the script handles a crc collision (terminated and output file is empty except for header).
This got me thinking about the possibility of instead of terminating, reverting to testing the entire line's text just in the event of the incredibly rare crc collision?