I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x[$0]++' are not working as its running out of buffer space.
I dont know if this works : I want to read each line of the File in a For Loop, and want to delete all the matching lines leaving 1 line. This way I think it will not use any buffer space.
PS: Idea is not use any second file.
Suggestions please.
You can split the file (with "split" command), then "sort -u" the chunks separately and then merge them with "sort -m". (Of course whether you need it depends on the memory size of your system).
Excuse me but "sort -m". This will require much less memory.
OOPS. Yes. With -m it's possible that duplicates can stay. But if the last sort wouldn't work (because of lack of memory), then it's possible "sort -m | uniq"
I am trying the split and sort, I will let you know once it is done. Meanwhile, I have a doubt, why can't we implement something like below so that it will not take much space ..
for line in `cat infile`
do #delete all lines in infile matching $line leaving 1 $line#
done
exit 0
Also because it's a frequent shell mistake which attempts to load the entire thing into memory at once. Whatever the limit for shell variables on your system is, it's probably less than 40 gigabytes!
You probably won't have to split anything manually. Many (if not most) sort implementations (GNU, *BSD, Solaris, HP-UX, to name a few) will do this for you automatically. They compare the size of the file to be sorted against the system's available memory and make a conservative guess. Intermediate files are then created in $TMPDIR.
As vgersh99 pointed out, often there'll be a -T option to override the enviroment variable, although if this option is missing, you can simply override the environment default when invoking sort (TMPDIR=/lots/of/space sort ...).