Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

What about this ?

sort -S 1500M -T $HOME/path -u input.txt > sortedUniq.txt
-T, --temporary-directory=DIR
              use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories
S, --buffer-size=SIZE
              use SIZE for main memory buffer

I am aware of the collision issues. There is a fair amount of entropy in the data (timestamps) so I'm probably not gaming the odds down with patterns.

Does perl support SHA-1 or SHA-2? That would take the odds from very very unlikely to (even more) astronomical.

Mike

Perl supports everything and anything, but I'm not sure it'd come with it by default. Try it and see. Digest::SHA2 - search.cpan.org

Looks like I should try regular SHA

It looks like SAH even supports Base-64 which will keep the associative array table to the minimum size (what I suspect was breaking my initial AWK routine).

Mike

You did not say how much RAM you have, which is a definite factor.

The MD5 result is impressive. Is MD5 as cheap as a good hash? Perhaps CPUs have gotten so much faster than disk that it is not a factor!

In perl/C/C++/JAVA you can mmap the input file both for input and so your hash map can hold a 64 bit char* for exact verification, reducing copying and space allocation overhead.

The exact verification seems to expand the vm footprint a lot, but the most likely case is that the md5 or hash is new and so the exact compare is not done, greatly reducing processing and the vm footprint with a small minority of duplicates. If there were a lot of duplicates, that hurts the VM footprint with more exact verifications, but conversely there is less final data in the map.

I have 16 GB of ram but my "disks" are also solid state. It's a fast system. The md5 solution is working well for me.

Mike

Yes I had considered a method of storing the file position against each MD5 sum and when a potential collision occurs one could then fseek back and re-read the original data to verify a duplicate. This would only double the memory required but does require a random access file, so no stream processing.

As long as the frequency of duplicates is low I wouldn't expect a significant increase in speed.

Duplicates appear to be ~3%.

Mike

This updated version checks for different lines that have the same MD5 and fails with an exit code so at least you can be confident nothing is wrong. This needs a filename parameter as it needs random access and streams don't support that.

#!/usr/bin/perl -w
use Digest::MD5 qw(md5);
open(DATA, "<", $ARGV[0]) || die "Can't open $ARGV[0]: $!";

my %seen;
while (1) {
  my $pos = tell DATA;
  my $line=<DATA>;
  last unless defined $line;
  my $crc= md5($line);
  if(!$seen{$crc}) {
     $seen{$crc}=$pos+1;
         print $line;
  } else {
      $pos = tell DATA;
      seek DATA, $seen{$crc}-1, 0;
          $!=255;
          die "Duplicate CRC found" unless (<DATA> eq $line);
          seek DATA, $pos, 0;
  }
}
1 Like

Takes 110 seconds on the already duplicate line removed database. Size and wc -l match.

I will have to build the raw database again (~70 min) to see if there are collisions. I am leaving work in a few min and will let it run when I am gone.

Mike

It's really just for you piece of mind as the chance of two different lines having the same MD5 is so low it should never happen.

And they did not happen:D Just curious (as I don't know any Perl) does the script complete the output file and give the error messages or does it terminate at the error?

Mike

You'll have to be more specific. It's not black and white.

It terminates setting the error num $? to 255. You will get the partial file up to the error output.

The database in now up to about 4GB. Chubler_XL's duplicate line removal script is still working great. I now build my database in segments and concatenate and I accidentally has more than one header in there which tested how the script handles a crc collision (terminated and output file is empty except for header).

This got me thinking about the possibility of instead of terminating, reverting to testing the entire line's text just in the event of the incredibly rare crc collision?

Mike