In a huge file, Delete duplicate lines leaving unique lines

krishnix · August 2, 2011, 10:06am

Hi All,

I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x[$0]++' are not working as its running out of buffer space.

I dont know if this works : I want to read each line of the File in a For Loop, and want to delete all the matching lines leaving 1 line. This way I think it will not use any buffer space.
PS: Idea is not use any second file.
Suggestions please.

input data:

adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56

output data:

adsf123
asdlfkjlasdfj
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56

Thanks,
Krish

radoulov · August 2, 2011, 10:10am

What's the exact error message returned by the awk command?

awk '!x[$0]++' infile

krishnix · August 2, 2011, 10:21am

awk: cmd. line:1: (FILENAME=result FNR=6094197) fatal: assoc_lookup: bucket->ahname_str: can't allocate 423 bytes of memory (Cannot allocate memory)

---------- Post updated at 09:21 AM ---------- Previous update was at 09:19 AM ----------

command i tried:

awk '!x[$0]++' result > result_new

radoulov · August 2, 2011, 10:21am

Try with Perl:

perl -ne'
  print unless $_{$_}++ 
  ' infile

yazu · August 2, 2011, 10:27am

You can split the file (with "split" command), then "sort -u" the chunks separately and then merge them with "sort -m". (Of course whether you need it depends on the memory size of your system).

krishnix · August 2, 2011, 10:34am

Sorry not the mention .. "thanks for your prompt replies".

Now I am running ..

perl -ne'print unless $_{$_}++' result > result_new2

I am getting the error messge - Segmentation fault

radoulov · August 2, 2011, 10:43am

Try the solution suggested by yazu:

split -l 1000000 infile

for f in x*; do
  sort -u "$f" > "$f"_sorted
done

sort -u x*_sorted > final.out

I believe the final sort should be with -u (not -m).

yazu · August 2, 2011, 10:52am

Excuse me but "sort -m". This will require much less memory.

OOPS. Yes. With -m it's possible that duplicates can stay. But if the last sort wouldn't work (because of lack of memory), then it's possible "sort -m | uniq"

radoulov · August 2, 2011, 10:55am

Consider the following:

% cat infile 
adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56
adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56
% split -l 5 infile 
% for f in x*; do sort -u "$f" > "$f"_sorted; done
% sort -m x*_sorted
adsf123
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj
asdlfkjlasdfj
asdlfkjlasdfj343
asdlfkjlasdfj343
asdlfkjlasdfj56
asdlfkjlasdfj56

Or you were suggesting something different?

yazu · August 2, 2011, 11:02am

Yes. You are very right. I've corrected my previous post.

krishnix · August 2, 2011, 11:19am

I am trying the split and sort, I will let you know once it is done. Meanwhile, I have a doubt, why can't we implement something like below so that it will not take much space ..
for line in `cat infile`
do
#delete all lines in infile matching $line leaving 1 $line#
done
exit 0

yazu · August 2, 2011, 11:28am

Because this is O(n^2) algorithm. And for 4GB file it will work really very long. (Days, months? Who knows... )

krishnix · August 2, 2011, 11:38am

sort: write failed: /tmp/sortmO2esr: No space left on device
during the last sort execution.

vgersh99 · August 2, 2011, 11:47am

'man sort' yields:

       -T Directory
            Places all temporary files that are created into the directory specified by the
            Directory parameter.

specify '-T' with the directory with 'enough' space for the temp files.

Corona688 · August 2, 2011, 12:35pm

Also because it's a frequent shell mistake which attempts to load the entire thing into memory at once. Whatever the limit for shell variables on your system is, it's probably less than 40 gigabytes!

alister · August 2, 2011, 12:52pm

You probably won't have to split anything manually. Many (if not most) sort implementations (GNU, *BSD, Solaris, HP-UX, to name a few) will do this for you automatically. They compare the size of the file to be sorted against the system's available memory and make a conservative guess. Intermediate files are then created in $TMPDIR.

As vgersh99 pointed out, often there'll be a -T option to override the enviroment variable, although if this option is missing, you can simply override the environment default when invoking sort (TMPDIR=/lots/of/space sort ...).

Regards,
Alister

krishnix · August 4, 2011, 4:47am

Thank you everyone.

I am using awk '!($0 in a){a[$0];print} ' which I found efficient out of all options.