Sort a big data file

rubber08 · October 17, 2010, 8:45am

Hello,
I have a big data file (160 MB) full of records with pipe(|) delimited those fields. I`m sorting the file on the first field.
I'm trying to sort with "sort" command and it brings me 6 minutes.
I have tried with some transformation methods in perl but it results "Out of memory". I was wondering to find any way (perl or unix shell script) to perform the fastest sort method of a big data file?.
Thanks,
bye.

DGPickett · October 17, 2010, 9:50am

The sort is usually pretty good, but depends on file speed, especially in its temp -T dir. If the file was many files, they could be sorted separately in parallel and merged with sort -m, possibly directly with named pipes. The named piped can be managed by the ksh on UNIX's with /dev/fd/0-# pseudo file file descriptor devices:

sort -m <( sort file1 ) <( sort -m file2 )

or you can /usr/sbin/mknod -p named_pipe_path. Any additional options of your sort go on all sorts. This way, there is no delay writing intermediate files. It might work to assign different line number ranges to each, and since the sort sub-scripts are reading in parallel, the cost of selecting line # ranges is reduced.

sed '
   1,20000d
    40000q
 ' | sort

You can estimate the line count to be divided, shooting high, by a factor divided into the file byte size.

There are some exotic options to sort, but they are not usually recommended.

---------- Post updated at 09:50 AM ---------- Previous update was at 09:28 AM ----------

Another way to divide the data evenly to N sorts is my tool xdemux, which calloc's an array of $1 FILE*, does a popen() of $2 to write to lead all $1 cells of that array, and then reads stdin byte by byte (no line length concerns or extra copying) sending the lines down the pipes in rotation, and at EOF does fclose on the pipes so it does not wait for child status. In your case this would be

mknod -p /tmp/p.$$
xdemux 5 "sort your_args -o /tmp/p.$$" <your_file &
sort your_args -m /tmp/p.$$ /tmp/p.$$ /tmp/p.$$ /tmp/p.$$ /tmp/p.$$
rm -f /tmp/p.$$

A named pipe connects the next open() to write to a waiting, blocked on open() to read, not vice versa, so one named pipe can do for all. Here is xdemux.c, not sure if it is the latest as described above, but definitely close:

#include <stdio.h>
#include <stdlib.h>
#include <strings.h>

static void usage(){

	fputs(
"\n"
"Usage: xdemux <ct> <cmd> [ -l <line_ct> ]\n"
"\n"
"Runs <ct> copies of <cmd> and sends <line_ct> (default 1) lines to each\n"
"in rotation.\n"
"\n",
		stderr );
	exit( 1 );
 }

int main( int argc, char ** argv ){

	FILE **fp = NULL ;
	int i, x, c, l, lct = 1 ;

	if ( argc < 3
	  || 2 > ( x = atoi( argv[1] ))){
		usage();
	 }

	if ( argc > 3
	  && ( argc != 5
            || strcmp( argv[3], "-l" )
	    || 1 > ( lct = atoi( argv[4] )))){
		usage();
	 }
		
	if ( !( fp = (FILE **)calloc( x, sizeof (FILE *)))){
		perror( "calloc()" );
		exit( 2 );
		}

	for ( i = 0 ; i < x ; i++ ){
		if ( !( fp = popen( argv[2], "w" ))){
			perror( "popen( $2 )" );
			exit( 3 );
			}
		}

	i = l = 0 ;

	while ( EOF != ( c = getchar())){
		if ( EOF == putc( c, fp )){
			perror( "putc( popen( $2 ))" );
			exit( 4 );
			}
		if ( c == '\n'
		  && ++l == lct ){
			l = 0 ;
			if ( ++i == x ){
				i = 0 ;
			 }
		 }
	 }

	if ( ferror( stdin )){
		perror( "stdin" );
		exit( 5 );
		}

	for ( i = 0 ; i < x ; i++ ){
                if ( 0 > fclose( fp )){
			perror( "fclose( popen( $2 ))" );
			}
		}

	exit( 0 );
}

You can enable a variety of power user excesses with xdemux!

methyl · October 18, 2010, 7:05am

It always helps to know what Operating System you have and to see the command you typed.
In this case we'd also need to know the amount of memory you can devote to this "sort".

The biggest single improvement to the unix "sort" command is usually to give it more memory at the outset with the "-y kmem" parameter and to put temporary files (-T parameter) on a fast disc with at least twice as much free space as the size of the original file.