sort takes a long time

voolek · February 8, 2011, 4:18am

Dear experts

I have a 200MG text file in this format:

text \tab number

I try to sort using options -fd and it takes very long! is that normal or I can speed it up in some ways?
I dont want to split the file since this one is already splitted.

I use this command: sort -fd file > sorted-file

Thanks for the helps and comments.

rdcwayx · February 8, 2011, 6:08am

Seems -f and -d are not useful in your case.

       -d, --dictionary-order
              consider only blanks and alphanumeric characters

       -f, --ignore-case
              fold lower case to upper case characters

So just sort it directly.

sort file > sorted-file

Or show your sample data and expect output.

voolek · February 8, 2011, 6:19am

they are useful since the first column is text (alphabetic).

methyl · February 8, 2011, 6:31am

What Operating System and version are you using?
How many records?
How long does the process take?
Do you have spare memory and disc space which you could give to this sort process.

If your sort has a "-o filename" parameter , use this to specify the output file not a Shell redirect (> filename). It will be much much faster.

If set, what is the value of $TMPDIR ? Can it be set to point to a fast filesystem with at least twice the space of the size of the unsorted file?

You can get a dramatic improvement in the performance of unix sort by tuning the "-y kmem" parameter. It is very important that you start with enough memory allocated to do some serious sorting on the first pass.

Off topic: If you have a database engine it is often quicker to load a large file into a database table with suitable keys, then write the file out in the required order.

voolek · February 8, 2011, 6:41am

I'm using linux. infact it takes more than 20 hours and it is not finished yet! I've allocated enough memory on tmpdir! I have no memory problem since it does not run out of memory!

I have about 10 million lines. I haven't set the -y kmem option and I have no idea how to use it. I need a fast improvement! I have no hard disk limitation and I can have a large ram as well.

methyl · February 8, 2011, 7:54am

"Linux" is a bit vague.

Here are my timings for sorting a 600 Mb file after giving "sort" one Gigabyte of memory and a very large workspace. It used about 800 Mb of disc workspace and didn't make a dent in the memory. The unsorted file is random order but I also reverse sorted it to be sure that the test is representative.
This test server is nothing special - a 10 year old HP 9000 with HP-UX 11i and slowish 36Gb 10k rpm discs.

Ordinary sort:
date;sort -o bigfile.sor -T /workspace -y 1048576 bigfile;date
Tue Feb  8 12:25:47 GMT 2011
Tue Feb  8 12:28:10 GMT 2011

Dictionary sort:
date;sort -fd -o bigfile.sor -T /workspace -y 1048576 bigfile;date
Tue Feb  8 12:31:17 GMT 2011
Tue Feb  8 12:36:19 GMT 2011

Reverse sorting the output from the previous sort:
date;sort -r -fd -o bigfile.rev -T /workspace -y 1048576 bigfile.sor;date
Tue Feb  8 12:44:26 GMT 2011
Tue Feb  8 12:49:07 GMT 2011

Are you sure that you file is only 200 Mb ?

voolek · February 8, 2011, 9:38am

very strange! my data is like this :

pleasant 2
festive 2
period 2
i declare 2
declare resumed 2
resumed the 2
the session 2
session of 2
of the 2
the european 2

and sorting it takes much longer! I just tried this :
sort -o sorted -d -y 1048576 file

and after 10 mins still nothing happened! I wonder how could you do that such a fast way! my file is 150 mb with about 10m lines.

methyl · February 8, 2011, 10:35am

I don't think we identified your O/S beyond it being a Linux. There is much variation.

It is possible that your "sort" command does not have a "-y" switch or other switch to pre-allocate memory. Have you checked "man sort" or perhaps "info sort"?

Maybe you are a non-root user and have a memory quota which is too low to do this large sort?

Perhaps you have a basic kernel and the sort is trying to open more files than is allowed?

Have you checked the directory where you expect to find the sort workfiles? Are they there? Is there enough disc space in that filesystem?

Is the running sort using CPU? I'm starting to wonder if your "sort" program is faulty.

Afterthought. We assume that this is a unix standard format text file with each line terminated with a line-feed character (only) and that it has not come from a Microsoft platform.

voolek · February 8, 2011, 10:46am

linux version:
Linux version 2.6.18-164.6.1.el5 (mockbuild@ls20-bc2-14.build.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46))

SORT:
sort (GNU coreutils) 5.97

it does not have -y but it has -S (buffersize) and it does not help neither!

also my sort program is fine! and it does use CPU!

methyl · February 8, 2011, 12:46pm

Googling your version of "sort" and the symptoms uncovered a can of worms.

For example: If your locale is anything other than "C" the performance of sort can be atrocious. There are other variants on this theme including the program ignoring the buffer parameter.

What is the output from the "locale" command ?

Suggest you take up the issue with your software supplier in case a fixed version is available.

voolek · February 9, 2011, 3:06am

locale: LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

so you guess it's because of my version?

methyl · February 9, 2011, 5:23am

I think we have found the problem.

UTF-8 is cited as the worst possible character set to sort or grep because it can't be sorted as a simple binary key.
If your data is actually US ASCII then I'd try the sort with with the locale set to "C".

On my system (yours has more values):

LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=

If you must sort in UTF-8 order some recommend using the sort command in "perl". The "sort" command in "perl" is three times slower than unix "sort" for LANG=C , but ten times faster for LANG=en_US.UTF-8 .

Rant: The UTF-8 issue arises from the latest Posix standards. It has given me grief with mixed platform XML too.

drl · February 9, 2011, 7:30am

Hi. If you don't want to create your own perl utility, you may be interested in:

msort - utility for sorting records in complex ways

...

       msort fully supports Unicode. The text to be sorted, and all
       specifications, should be in UTF-8 Unicode. (If you have plain ASCII
       text, this is not a problem as ASCII is a subset of Unicode.) Full
       Unicode case-folding is available, in Turkic and non-Turkic variants.
       Unicode normalization is performed before sorting.

-- excerpt from man msort, q.v

http://billposer.org/Software/msort.html

Good luck ... cheers, drl