I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
!x[$0]++
The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would be appreciated.
I work under Windows Environment and hence Unix tools don't work
Many thanks
p.S. I have checked an earlier solution available in the repository but it is just as slow if not slower.
I presume you mean you want to dedupe the file (because that is what your script does and that is in the title), not necessarily sort it.
You can try the difference between
awk '!X[$0]++' file > file.dedup
and
sort -u file > file.deduped_sort
The awk version is typically a lot faster because the file does not have to be sorted. Whether the file is sorted or not should make no difference for the awk command.
They both dedupe, but the second one sorts as well.
------- Edit ---------
I just did a test with a 1.6 GiB file and it took under 3 minutes to dedup it, so I would examine what you are doing exactly.
Are you deduping and then sorting?
Are you running out of memory and is your system paging/swapping?
Otherwise can you post the exact script/command that you are using?
mind to explain that approach? Is that because X[$0]++ becomes a number and consumes a "float" 's space, whereas X[$0] has just an index but points to nowhere?