Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

Michael_Stora · March 26, 2014, 3:08pm

I have a script that builds a database ~30 million lines, ~3.7 GB .cvs file. After multiple optimzations It takes about 62 min to bring in and parse all the files and used to take 10 min to remove duplicates until I was requested to add another column. I am using the highly optimized awk code:

awk '!($0 in a) { a[$0]; print }'

In this case a[$0] mearly initializes the array element without writing to it and ($0 in a) aparently accesses an array table but not the elements themselves.

The traditional solution took many hours even with test chunks of the data.

I previously tried to debug some files with corruptions by adding a filename@path column and that exceeded the maximum pipe size. I was asked to add a relatively short column and it does not break any pipes but does not complete even after 12 hours.

I've removed several pipes in my script and replaced them with temp files to hold data between commands, but the addition of that short column has caused the duplicate line removal to go from ~10 minutes to God knows how long.

Any other options to do the duplicate line removal more efficiently (in segments, sorting first, etc.)?

Do I have any other options? My environment is:
GNU Awk 4.0.2
GNU bash, version 4.1.10(4)-release
CYGWIN_NT-6.1-WOW64 1.7.17(0.262/5/3) 2012-10-19 14:39 i686 Cygwin
Windows 7 Enterprise Ver 6.1 Build 7601 Service Pack 1

DGPickett · March 26, 2014, 4:07pm

If I recall correctly, cygwin and win7 not -64 is 32 bit, so 'a' may get too big for the address space of the awk process. Depending on ram size, it might eventually thrash a bit. The awk solution, a hash map, does not support parallism.

The classic, robust solution is 'sort -u <file_set>' but tends to be slower. You can parallelize the sort with a command of the form:

sort -mu <( sort -u <file_set_1> )  <( sort -u <file_set_2> ) <( sort -u <file_set_3> ) . . . .

where the nicer ksh or bash makes named pipes of the '<(...)' that run concurrently. I like twice the core count <(sort)'s, assuming 50% i/o delay. The final pass of the <(sort)'s feeds the sort -m merge parent.

ETL programs like Ab Initio know how to tell parallel processes to split up big files and process each part separately, even when the files are linefeed delimited (they all agree to search up (or down) for the dividing linefeed closest to N bytes down file). Does anyone know of a utility that can split a file this way (without reading it sequentially)? 'GNU parallel?'

blackrageous · March 26, 2014, 5:27pm

I highly recommend using linux rather than cygwin for better performance. You might try using a virtual linux image (i.e. vmware). I also sometimes will create a ram disk to speed things up (on linux) when I am processing a lot of lines in a file.

Corona688 · March 26, 2014, 5:33pm

I agree that sort is an efficient way to do this. The sort utility in unxutils seems to work well if you control where it puts its temporary files.

Chubler_XL · March 26, 2014, 6:07pm

Do you really need to compare the whole line for uniqueness?

You should get a big reduction in memory usage by only checking the first 100 chars with:

awk '!(substr($0,1,100) in a) { a[substr($0,1,100)]; print }'

Michael_Stora · March 26, 2014, 6:20pm

Thanks for all the suggestions. My workaround so far has been to skip the duplicate line removal and add it to the querries I run against the database instead. This has allowed me to procede with my analysis.

I will go back and fix the database at some point however when the current deadline passes.

I will have to check closely which fields alone can indicate a duplicate record. Since the addition of a column (and therefore the length of $0) is what broke it, taking something out may help. I'm not sure as I am already taking the important columns from two types of database files and the "housekeeping columns" are not included.

Is there a way to do a checksum or fairly robust hash in awk? That might be the best way to shorten the array names which appears to what is killing awk.

Mike

PS. I would love to be using real Linux instead of Cygwin (as I do at home). Unfortunately that is a boundary condition.

Chubler_XL · March 26, 2014, 7:12pm

Mike, you could look at algorithm - Calculating CRC in awk - Stack Overflow where there is posted a nice POSIX chksum compatible crc32() function.

But I'm not sure how much overhead this will add to your code.

---------- Post updated at 09:12 AM ---------- Previous update was at 09:01 AM ----------

Here is a quick adaptation for you task:

awk 'BEGIN {
 # Initialize CRC32 table
    T[0]=0x00000000;
    T[1]=0x04c11db7;  T[2]=0x09823b6e;  T[3]=0x0d4326d9;  T[4]=0x130476dc;  T[5]=0x17c56b6b;
    T[6]=0x1a864db2;  T[7]=0x1e475005;  T[8]=0x2608edb8;  T[9]=0x22c9f00f; T[10]=0x2f8ad6d6;
   T[11]=0x2b4bcb61; T[12]=0x350c9b64; T[13]=0x31cd86d3; T[14]=0x3c8ea00a; T[15]=0x384fbdbd;
   T[16]=0x4c11db70; T[17]=0x48d0c6c7; T[18]=0x4593e01e; T[19]=0x4152fda9; T[20]=0x5f15adac;
   T[21]=0x5bd4b01b; T[22]=0x569796c2; T[23]=0x52568b75; T[24]=0x6a1936c8; T[25]=0x6ed82b7f;
   T[26]=0x639b0da6; T[27]=0x675a1011; T[28]=0x791d4014; T[29]=0x7ddc5da3; T[30]=0x709f7b7a;
   T[31]=0x745e66cd; T[32]=0x9823b6e0; T[33]=0x9ce2ab57; T[34]=0x91a18d8e; T[35]=0x95609039;
   T[36]=0x8b27c03c; T[37]=0x8fe6dd8b; T[38]=0x82a5fb52; T[39]=0x8664e6e5; T[40]=0xbe2b5b58;
   T[41]=0xbaea46ef; T[42]=0xb7a96036; T[43]=0xb3687d81; T[44]=0xad2f2d84; T[45]=0xa9ee3033;
   T[46]=0xa4ad16ea; T[47]=0xa06c0b5d; T[48]=0xd4326d90; T[49]=0xd0f37027; T[50]=0xddb056fe;
   T[51]=0xd9714b49; T[52]=0xc7361b4c; T[53]=0xc3f706fb; T[54]=0xceb42022; T[55]=0xca753d95;
   T[56]=0xf23a8028; T[57]=0xf6fb9d9f; T[58]=0xfbb8bb46; T[59]=0xff79a6f1; T[60]=0xe13ef6f4;
   T[61]=0xe5ffeb43; T[62]=0xe8bccd9a; T[63]=0xec7dd02d; T[64]=0x34867077; T[65]=0x30476dc0;
   T[66]=0x3d044b19; T[67]=0x39c556ae; T[68]=0x278206ab; T[69]=0x23431b1c; T[70]=0x2e003dc5;
   T[71]=0x2ac12072; T[72]=0x128e9dcf; T[73]=0x164f8078; T[74]=0x1b0ca6a1; T[75]=0x1fcdbb16;
   T[76]=0x018aeb13; T[77]=0x054bf6a4; T[78]=0x0808d07d; T[79]=0x0cc9cdca; T[80]=0x7897ab07;
   T[81]=0x7c56b6b0; T[82]=0x71159069; T[83]=0x75d48dde; T[84]=0x6b93dddb; T[85]=0x6f52c06c;
   T[86]=0x6211e6b5; T[87]=0x66d0fb02; T[88]=0x5e9f46bf; T[89]=0x5a5e5b08; T[90]=0x571d7dd1;
   T[91]=0x53dc6066; T[92]=0x4d9b3063; T[93]=0x495a2dd4; T[94]=0x44190b0d; T[95]=0x40d816ba;
   T[96]=0xaca5c697; T[97]=0xa864db20; T[98]=0xa527fdf9; T[99]=0xa1e6e04e;T[100]=0xbfa1b04b;
  T[101]=0xbb60adfc;T[102]=0xb6238b25;T[103]=0xb2e29692;T[104]=0x8aad2b2f;T[105]=0x8e6c3698;
  T[106]=0x832f1041;T[107]=0x87ee0df6;T[108]=0x99a95df3;T[109]=0x9d684044;T[110]=0x902b669d;
  T[111]=0x94ea7b2a;T[112]=0xe0b41de7;T[113]=0xe4750050;T[114]=0xe9362689;T[115]=0xedf73b3e;
  T[116]=0xf3b06b3b;T[117]=0xf771768c;T[118]=0xfa325055;T[119]=0xfef34de2;T[120]=0xc6bcf05f;
  T[121]=0xc27dede8;T[122]=0xcf3ecb31;T[123]=0xcbffd686;T[124]=0xd5b88683;T[125]=0xd1799b34;
  T[126]=0xdc3abded;T[127]=0xd8fba05a;T[128]=0x690ce0ee;T[129]=0x6dcdfd59;T[130]=0x608edb80;
  T[131]=0x644fc637;T[132]=0x7a089632;T[133]=0x7ec98b85;T[134]=0x738aad5c;T[135]=0x774bb0eb;
  T[136]=0x4f040d56;T[137]=0x4bc510e1;T[138]=0x46863638;T[139]=0x42472b8f;T[140]=0x5c007b8a;
  T[141]=0x58c1663d;T[142]=0x558240e4;T[143]=0x51435d53;T[144]=0x251d3b9e;T[145]=0x21dc2629;
  T[146]=0x2c9f00f0;T[147]=0x285e1d47;T[148]=0x36194d42;T[149]=0x32d850f5;T[150]=0x3f9b762c;
  T[151]=0x3b5a6b9b;T[152]=0x0315d626;T[153]=0x07d4cb91;T[154]=0x0a97ed48;T[155]=0x0e56f0ff;
  T[156]=0x1011a0fa;T[157]=0x14d0bd4d;T[158]=0x19939b94;T[159]=0x1d528623;T[160]=0xf12f560e;
  T[161]=0xf5ee4bb9;T[162]=0xf8ad6d60;T[163]=0xfc6c70d7;T[164]=0xe22b20d2;T[165]=0xe6ea3d65;
  T[166]=0xeba91bbc;T[167]=0xef68060b;T[168]=0xd727bbb6;T[169]=0xd3e6a601;T[170]=0xdea580d8;
  T[171]=0xda649d6f;T[172]=0xc423cd6a;T[173]=0xc0e2d0dd;T[174]=0xcda1f604;T[175]=0xc960ebb3;
  T[176]=0xbd3e8d7e;T[177]=0xb9ff90c9;T[178]=0xb4bcb610;T[179]=0xb07daba7;T[180]=0xae3afba2;
  T[181]=0xaafbe615;T[182]=0xa7b8c0cc;T[183]=0xa379dd7b;T[184]=0x9b3660c6;T[185]=0x9ff77d71;
  T[186]=0x92b45ba8;T[187]=0x9675461f;T[188]=0x8832161a;T[189]=0x8cf30bad;T[190]=0x81b02d74;
  T[191]=0x857130c3;T[192]=0x5d8a9099;T[193]=0x594b8d2e;T[194]=0x5408abf7;T[195]=0x50c9b640;
  T[196]=0x4e8ee645;T[197]=0x4a4ffbf2;T[198]=0x470cdd2b;T[199]=0x43cdc09c;T[200]=0x7b827d21;
  T[201]=0x7f436096;T[202]=0x7200464f;T[203]=0x76c15bf8;T[204]=0x68860bfd;T[205]=0x6c47164a;
  T[206]=0x61043093;T[207]=0x65c52d24;T[208]=0x119b4be9;T[209]=0x155a565e;T[210]=0x18197087;
  T[211]=0x1cd86d30;T[212]=0x029f3d35;T[213]=0x065e2082;T[214]=0x0b1d065b;T[215]=0x0fdc1bec;
  T[216]=0x3793a651;T[217]=0x3352bbe6;T[218]=0x3e119d3f;T[219]=0x3ad08088;T[220]=0x2497d08d;
  T[221]=0x2056cd3a;T[222]=0x2d15ebe3;T[223]=0x29d4f654;T[224]=0xc5a92679;T[225]=0xc1683bce;
  T[226]=0xcc2b1d17;T[227]=0xc8ea00a0;T[228]=0xd6ad50a5;T[229]=0xd26c4d12;T[230]=0xdf2f6bcb;
  T[231]=0xdbee767c;T[232]=0xe3a1cbc1;T[233]=0xe760d676;T[234]=0xea23f0af;T[235]=0xeee2ed18;
  T[236]=0xf0a5bd1d;T[237]=0xf464a0aa;T[238]=0xf9278673;T[239]=0xfde69bc4;T[240]=0x89b8fd09;
  T[241]=0x8d79e0be;T[242]=0x803ac667;T[243]=0x84fbdbd0;T[244]=0x9abc8bd5;T[245]=0x9e7d9662;
  T[246]=0x933eb0bb;T[247]=0x97ffad0c;T[248]=0xafb010b1;T[249]=0xab710d06;T[250]=0xa6322bdf;
  T[251]=0xa2f33668;T[252]=0xbcb4666d;T[253]=0xb8757bda;T[254]=0xb5365d03;T[255]=0xb1f740b4;

# Init raw data to int lookup table
  for(i=0;i<=255;i++)X[sprintf("%c",i)]=i;
}
function u32(v) { 
   return and(v,0xffffffff)
}
function crc32(str,crc,i,A) {
   crc=u32(0)
   A[0]=split(str,A,"")
   len +=A[0]
   for(i=1;i<=A[0];i++)
      crc=u32(xor(u32(lshift(crc,8)),T[u32(and(xor(rshift(crc,24),X[A]),0xFF))]));

  # Step 3) End CRC32 calculation. Calculate the total size of buf read, and write into CRC
  while(len){crc=u32(xor(u32(lshift(crc,8)),T[u32(and(xor(rshift(crc,24),and(len,0xFF)),0xFF))]));len=rshift(len,8);}
  return u32(compl(crc));
}
{ v1=crc32($0);v2=crc32("XXX"$0);if(v1 v2 in a) next; a[v1 v2]} 1'

Edit: Changed to calculate two checksums (v1 and v2) to try and avoid collisions, as you have such a large number of records

You could also try perl:

#!/usr/bin/perl -w
use Digest::MD5 qw(md5);
my %seen;
while (<>) {
  print $_ unless ($seen{md5($_)}++);
}

Michael_Stora · March 27, 2014, 12:01pm

My original code spits out about 2.5 Gb (which is less than the full dataset) after about 70 min and then never completes.

I tried Chubler_X's code above and terminated it after 16 hours with no output.

One way I can definitely split without duplicates spanning split files is to split on date. It only takes about one minute to split the data in two with the following code:

awk -F, -v startDT="$startDate" -v endDT="$endDate" '
            BEGIN { s=mktime(startDT); e=mktime(endDT)}
            NR == 1 {print}
            NR > 1 { t = $6; gsub(/\-|:/, " ", t); t = mktime(t); if ( s <= t && t <= e) {print} }'

Mike

Corona688 · March 27, 2014, 2:41pm

I take it sorting is absolutely out of the question...? It would safely handle files of arbitrary size.

Perl would be better for comparing via hashes. Doing an md5 or the like in awk would mean calling an external md5 utility 30 million times, where you can at least get a built-in module for Perl.

Michael_Stora · March 27, 2014, 2:48pm

Would definately want a built-in esp in a Windows environment where it is very efficient to create new threads but new processes come with a lot of extra overhead.

Sorting is not out of the question unless it has serious performance problems just like the duplicate removal does.

Mike

Chubler_XL · March 27, 2014, 4:04pm

Did you try my perl code? I have fairly high hopes it could do the job.

DGPickett · March 27, 2014, 4:06pm

Still no reply -- is this 32 bit windows running cygwin running whatever? If so, the 4G address space can make hash tools fail, often not gracefully, and often well before the 4G nominal limit, like 1.7G, stumbling over some signed int4 in the process and the address space usage of code and other data. It sounds like you need a 64 bit CPU and O/S.

Once you run past the RAM, the sequential reading and writing of sort may outperform the random activity of hash. Also, not all hash are written for dynamic expansion of hash bucket count, so the amount of linear searching inside the bucket may increase. In rouguewave, for instance, you should set the bucket count according to the size of the set at the start. Extendible hashing - Wikipedia, the free encyclopedia

Michael_Stora · March 27, 2014, 4:12pm

Sorry, It did not appear to me that you were asking me a question in your response.
I am running 64-bit windows with dual i7-2860QM CPUs.
It looks like CYGWIN 1.7.17 is 32-bit. 64-bit started with 1.7.22 in july of last year. I will try it.

Mike

Chubler_XL · March 27, 2014, 4:17pm

Agreed it would be a very tight squeeze to solve in-memory on a 32bit environment with 30M records we would only get about 50bytes per record to play with. This is why large datasets are usually stored in databases.

Michael_Stora · March 27, 2014, 4:21pm

I'd love to but I don't know any perl. How do I handle the file I/O in perl?

Mike

Chubler_XL · March 27, 2014, 4:25pm

Save this as uniq.pl:

#!/usr/bin/perl -w
use Digest::MD5 qw(md5);
my %seen;
while (<>) {
  print $_ unless ($seen{md5($_)}++);
}

Then run is like this:

$ ./uniq.pl < big_ass_file_with_dups.csv > slightly_smaller_no_dups.csv

Michael_Stora · March 27, 2014, 4:51pm

Goodness! I need to learn some perl!

Executed in 92 seconds! Output looks correct.

Mike

Edit: Under 64-bit Cygwin is took about 74 seconds.

alister · March 27, 2014, 8:15pm

Should an md5 collision occur, it will suppress a record which should be printed. Be aware that it's possible (if not probable) that your results may incorrectly omit data.

Regards,
Alister

Chubler_XL · March 27, 2014, 8:57pm

1 in 3x10� I like those odds!

alister · March 27, 2014, 9:19pm

However remote the possibility, it merits mentioning in case it's a critical task.

Regards,
Alister