Unique values from a Terabyte File

Legend986 · October 14, 2008, 1:35pm

Hi,

I have been dealing with a files only a few gigs until now and was able to get out by using the sort utility. But now, I have a terabyte file which I want to filter out unique values from.

I have a server having 8 processor and 16GB RAM with a 5 TB hdd. Is it worthwhile trying to use sort again for this type of a problem or is there a better solution for this? Any help is much appreciated.

matrixmadhan · October 14, 2008, 2:17pm

Not really.

Running again a plain sort on a tera-byte problem wont scale up properly and that is not needed as well.

These type of problems for which computational complexity increases with more number of records to be processed can be handled by the map-reduce problem. This should probably be done by splitting the files into 'n' chunks and collaborating each of the processed chunks.

Legend986 · October 14, 2008, 3:05pm

So, if I have just a single server with 8 processors, would I be able to execute such an algorithm? I am a little new to these things so I apologize if the question is silly. I was just wondering if there is an algorithm to just split up the original file and then process it bit by bit...

And also, what is the main problem encountered if I create a hashmap? I mean, if there are only a few unique values, where would the problem come from in the first place?

treesloth · October 14, 2008, 3:20pm

If I may ask, what type of file is this? On a single-instance, rather urgent job, I was able to take a plain-text file and use the split command. It bothered me a bit, since HDD was hit pretty hard, but the job got done. Would your file work with something that primitive?

Legend986 · October 14, 2008, 3:23pm

Oh.. this is a text file too with a bunch of numbers from a network simulation experiment... I was thinking of actually splitting the file and getting the job done, but was just curious if there are better ways of doing things like matrixmadhan expressed....

matrixmadhan · October 25, 2008, 3:01pm

( I keep forgetting about this, sorry bad memory )

Probably you could try what I had posted in the below post for your other question.

It kind of handles these kind of huge dataset problems. Running sort over such a big file would be really tiring and best is to split and achieve the same.

jim_mcnamara · October 25, 2008, 6:57pm

A hashmap or associative arrays (another word for them) is probably best.

You might even try awk if your version handles largefiles. Assume your map key is characters 1-10 of the record.

awk '!arr[substr($0,1,10)++' myTBfile