I try to sort using options -fd and it takes very long! is that normal or I can speed it up in some ways?
I dont want to split the file since this one is already splitted.
What Operating System and version are you using?
How many records?
How long does the process take?
Do you have spare memory and disc space which you could give to this sort process.
If your sort has a "-o filename" parameter , use this to specify the output file not a Shell redirect (> filename). It will be much much faster.
If set, what is the value of $TMPDIR ? Can it be set to point to a fast filesystem with at least twice the space of the size of the unsorted file?
You can get a dramatic improvement in the performance of unix sort by tuning the "-y kmem" parameter. It is very important that you start with enough memory allocated to do some serious sorting on the first pass.
Off topic: If you have a database engine it is often quicker to load a large file into a database table with suitable keys, then write the file out in the required order.
I'm using linux. infact it takes more than 20 hours and it is not finished yet! I've allocated enough memory on tmpdir! I have no memory problem since it does not run out of memory!
I have about 10 million lines. I haven't set the -y kmem option and I have no idea how to use it. I need a fast improvement! I have no hard disk limitation and I can have a large ram as well.
Here are my timings for sorting a 600 Mb file after giving "sort" one Gigabyte of memory and a very large workspace. It used about 800 Mb of disc workspace and didn't make a dent in the memory. The unsorted file is random order but I also reverse sorted it to be sure that the test is representative.
This test server is nothing special - a 10 year old HP 9000 with HP-UX 11i and slowish 36Gb 10k rpm discs.
Ordinary sort:
date;sort -o bigfile.sor -T /workspace -y 1048576 bigfile;date
Tue Feb 8 12:25:47 GMT 2011
Tue Feb 8 12:28:10 GMT 2011
Dictionary sort:
date;sort -fd -o bigfile.sor -T /workspace -y 1048576 bigfile;date
Tue Feb 8 12:31:17 GMT 2011
Tue Feb 8 12:36:19 GMT 2011
Reverse sorting the output from the previous sort:
date;sort -r -fd -o bigfile.rev -T /workspace -y 1048576 bigfile.sor;date
Tue Feb 8 12:44:26 GMT 2011
Tue Feb 8 12:49:07 GMT 2011
I don't think we identified your O/S beyond it being a Linux. There is much variation.
It is possible that your "sort" command does not have a "-y" switch or other switch to pre-allocate memory. Have you checked "man sort" or perhaps "info sort"?
Maybe you are a non-root user and have a memory quota which is too low to do this large sort?
Perhaps you have a basic kernel and the sort is trying to open more files than is allowed?
Have you checked the directory where you expect to find the sort workfiles? Are they there? Is there enough disc space in that filesystem?
Is the running sort using CPU? I'm starting to wonder if your "sort" program is faulty.
Afterthought. We assume that this is a unix standard format text file with each line terminated with a line-feed character (only) and that it has not come from a Microsoft platform.
Googling your version of "sort" and the symptoms uncovered a can of worms.
For example: If your locale is anything other than "C" the performance of sort can be atrocious. There are other variants on this theme including the program ignoring the buffer parameter.
What is the output from the "locale" command ?
Suggest you take up the issue with your software supplier in case a fixed version is available.
UTF-8 is cited as the worst possible character set to sort or grep because it can't be sorted as a simple binary key.
If your data is actually US ASCII then I'd try the sort with with the locale set to "C".
If you must sort in UTF-8 order some recommend using the sort command in "perl". The "sort" command in "perl" is three times slower than unix "sort" for LANG=C , but ten times faster for LANG=en_US.UTF-8 .
Rant: The UTF-8 issue arises from the latest Posix standards. It has given me grief with mixed platform XML too.
Hi. If you don't want to create your own perl utility, you may be interested in:
msort - utility for sorting records in complex ways
...
msort fully supports Unicode. The text to be sorted, and all
specifications, should be in UTF-8 Unicode. (If you have plain ASCII
text, this is not a problem as ASCII is a subset of Unicode.) Full
Unicode case-folding is available, in Turkic and non-Turkic variants.
Unicode normalization is performed before sorting.
-- excerpt from man msort, q.v
http://billposer.org/Software/msort.html