Remove dupes in a large file

gimley · October 13, 2018, 5:50am

I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job

!x[$0]++

The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would be appreciated.
I work under Windows Environment and hence Unix tools don't work
Many thanks
p.S. I have checked an earlier solution available in the repository but it is just as slow if not slower.

Scrutinizer · October 13, 2018, 5:58am

gimley:

I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
!x[$0]++
The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.

Hi,

I presume you mean you want to dedupe the file (because that is what your script does and that is in the title), not necessarily sort it.

You can try the difference between

awk '!X[$0]++' file > file.dedup

and

sort -u file > file.deduped_sort

The awk version is typically a lot faster because the file does not have to be sorted. Whether the file is sorted or not should make no difference for the awk command.
They both dedupe, but the second one sorts as well.

------- Edit ---------

I just did a test with a 1.6 GiB file and it took under 3 minutes to dedup it, so I would examine what you are doing exactly.

Are you deduping and then sorting?
Are you running out of memory and is your system paging/swapping?

Otherwise can you post the exact script/command that you are using?

MadeInGermany · October 13, 2018, 12:45pm

In case there is a RAM shortage, the following variant helps (saves some bytes per line).

awk '!($0 in X) { print; X[$0] }' file > file.dedup

RudiC · October 13, 2018, 4:31pm

Hi MadeInGermany,

mind to explain that approach? Is that because X[$0]++ becomes a number and consumes a "float" 's space, whereas X[$0] has just an index but points to nowhere?

MadeInGermany · October 14, 2018, 4:45am

Exactly, X[$0]++ holds a number value; i.e. each new line consumes a number's space.