Performance problem with removing duplicates in a huge file (50+ GB)

I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file.

I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far..

Any suggestions please ?

Thanks !!

The problems are:

Using associative arrays in awk can use lots of memory for a file that large, so it may not work well. If the key(s) are a small size you might get it to work. And awk is really good for this application.

Please post a few records from the big file and carefully define the columns or keys you use to detect duplicates.

1 Like

The sort -u has less scaling difficulty than associative arrays in a shell. If the order of lines is important, number them first, then sort out the duplicates, then sort it back to original order and then remove numbering. The sort will save the first of any duplicates.

JAVA and C++ PPIs like RogueWave H++ have the same associative array functionality, more accurately called a hash map container. The file could be mmap64() mapped to the program so the hash map just has hash keys and offsets. The clean file could be recreated or even updated in place and truncated with fcntl(). The shell has noticably more overhead. Big jobs often need sharper tools.

1 Like

Sample records from file:

14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480000127,A00127,A000127,143245730649,A00127, 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480000127,A00127,A000127,143245730649,A00127, 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480042302,A42302,A000127,143245800913,A00127, 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480041999,A41999,A000127,143245801337,A00127, 
14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480000163,A00163,A000163,143245730774,A00163,4133403 
14480042302,A42302,A000127,143245800913,A00127, 

Desired Output:-

14480020180,A20180,A020180,143245765381,A00062,17284171796 
14480000127,A00127,A000127,143245730649,A00127, 
14480000163,A00163,A000163,143245730774,A00163,4133403 
14480041999,A41999,A000127,143245801337,A00127, 
14480042302,A42302,A000127,143245800913,A00127, 

I also want to add the fact that this file contains 40-50% (20-25 GB) of duplicate records.
And unfortunately, all columns need to considered as part of the key to determine duplicates.

The order of the data (sorted/unsorted) in the resultant file doesn't matter. Only the removal of duplicates is essential.

As is, the data is going to be very difficult to manage.

My suggestion would be to try transforming the data into something more suitable for sort -u, then transforming it back after.

Any DBAs around with some spare disk space?

Use a DB server. Create a single column table with a unique index on that column. Insert each line as a row into the table, ignoring duplicate entry failures. Export the data.

Might not be super fast, but it'll be faster than any script. And it's easy.

2 Likes

I don't think there is a super-fast way to handle 50G of data.

Taking advantage of a database index sounds as good a way as any, a proper DB is designed to duplicate-check data larger than memory.

What about a regular sort and then awk:

sort infile | awk '$1!=p; {p=$1}'

--
Never mind, It doesn't make much of a difference...

Using mmap64() means you only need components of a hash map of offsets on the heap. If the hash is new, it is unique so far, can be written and the offset needs to be remembered under that hash. If not new, the actual keys are compared between the two mmap'd locations. If it is a miss, it is added as a second record in the hash bucket. The empty hash map is an array on N null hash bucket pointers, where keys are hashed modulo N. If you write your own container, you can make N modulo 2 so the modulus operation is a mask of low bits. On 50 GB file with 50 byte records and a 4 GHZ CPU, every 4 cycles per record is a second of run time.

Conversely, some sorts can use parallel resources. Professional tools like abInitio divide up files for parallel processing. While record boundaries are not predictable, the threads know that they are responsible for only new lines above their starting point and possibly past their end point. Eventually, merging the sorts generally bottlenecks things, unless input threads send records to N key-segmented output sort threads. Then, each output is just concatenated. SyncSort makes an app out of it.

1 Like

Probably an iteration?
Assuming awk can handle 10000 lines.

awk '{if (NR<10000) {if (! s[$0]++) print >> uniq} else {if (! $0 in s) print}}' < file1 > file2

Then repeat with

awk '...' < file2 > file1

...

The result is in the file uniq.