In a huge file, Delete duplicate lines leaving unique lines

Hi All,

I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x[$0]++' are not working as its running out of buffer space.

I dont know if this works : I want to read each line of the File in a For Loop, and want to delete all the matching lines leaving 1 line. This way I think it will not use any buffer space.
PS: Idea is not use any second file.
Suggestions please.

input data:

adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56

output data:

adsf123
asdlfkjlasdfj
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56

Thanks,
Krish

What's the exact error message returned by the awk command?

awk '!x[$0]++' infile
awk: cmd. line:1: (FILENAME=result FNR=6094197) fatal: assoc_lookup: bucket->ahname_str: can't allocate 423 bytes of memory (Cannot allocate memory)

---------- Post updated at 09:21 AM ---------- Previous update was at 09:19 AM ----------

command i tried:

awk '!x[$0]++' result > result_new

Try with Perl:

perl -ne'
  print unless $_{$_}++ 
  ' infile

You can split the file (with "split" command), then "sort -u" the chunks separately and then merge them with "sort -m". (Of course whether you need it depends on the memory size of your system).

1 Like

Sorry not the mention .. "thanks for your prompt replies".

Now I am running ..

perl -ne'print unless $_{$_}++' result > result_new2

I am getting the error messge - Segmentation fault

Try the solution suggested by yazu:

split -l 1000000 infile

for f in x*; do
  sort -u "$f" > "$f"_sorted
done

sort -u x*_sorted > final.out

I believe the final sort should be with -u (not -m).

Excuse me but "sort -m". This will require much less memory.

OOPS. Yes. With -m it's possible that duplicates can stay. But if the last sort wouldn't work (because of lack of memory), then it's possible "sort -m | uniq"

Consider the following:

% cat infile 
adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56
adsf123
asdlfkjlasdfj
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj343
asdlfkjlasdfj56
% split -l 5 infile 
% for f in x*; do sort -u "$f" > "$f"_sorted; done
% sort -m x*_sorted
adsf123
adsf123
asdfasdf12341234
asdfasdf12341234
asdlfkjlasdfj
asdlfkjlasdfj
asdlfkjlasdfj343
asdlfkjlasdfj343
asdlfkjlasdfj56
asdlfkjlasdfj56

Or you were suggesting something different?

1 Like

Yes. You are very right. I've corrected my previous post.

I am trying the split and sort, I will let you know once it is done. Meanwhile, I have a doubt, why can't we implement something like below so that it will not take much space ..
for line in `cat infile`
do
#delete all lines in infile matching $line leaving 1 $line#
done
exit 0

Because this is O(n^2) algorithm. And for 4GB file it will work really very long. (Days, months? Who knows... :slight_smile: )

sort: write failed: /tmp/sortmO2esr: No space left on device
during the last sort execution.

'man sort' yields:

       -T Directory
            Places all temporary files that are created into the directory specified by the
            Directory parameter.

specify '-T' with the directory with 'enough' space for the temp files.

Also because it's a frequent shell mistake which attempts to load the entire thing into memory at once. Whatever the limit for shell variables on your system is, it's probably less than 40 gigabytes!

You probably won't have to split anything manually. Many (if not most) sort implementations (GNU, *BSD, Solaris, HP-UX, to name a few) will do this for you automatically. They compare the size of the file to be sorted against the system's available memory and make a conservative guess. Intermediate files are then created in $TMPDIR.

As vgersh99 pointed out, often there'll be a -T option to override the enviroment variable, although if this option is missing, you can simply override the environment default when invoking sort (TMPDIR=/lots/of/space sort ...).

Regards,
Alister

3 Likes

Thank you everyone.

I am using awk '!($0 in a){a[$0];print} ' which I found efficient out of all options.