Alternative for wc -l

rajesh_2383 · October 15, 2010, 6:54am

Hi techies ..

This is my first posting hr ..

Am facing a serious performance problem in counting the number of lines in the file. The input files i get will be in some 10 to 15 Gb of size or even sometimes more ..and I will load it to db

I have used wc -l to confirm whether the loader has loaded entire data ...the couting operaion alone consumes more than 45 to 50 mins ...could someone suggest me any other way to get the count of lines in files ..

pls note I cant use sed due to the coding standards followed here ..so pls excuse ..

Any other swifting workaround would be really helpfull ..and appreciated:(

cabrao · October 15, 2010, 6:58am

awk 'END{print NR}' yourfile

methyl · October 15, 2010, 7:24am

Out of interest, I tried this on a 5 million record text file and the result came out as an exponential number. The result from "wc -l" was correct.
Anybody know how to get awk to count a large number of records?

@rajesh_2383
How many records in a typical file?
Are they fixed length records? If so, we could calculate the number of records from the file size.

What database engine is this? It may be quicker to write a count program in a high level language.

zaxxon · October 15, 2010, 7:39am

@Methyl
Maybe use printf instead?

I think wc is just optimized for this task. Anyway here is a little C program you can compile with your favourite C compiler, for example:

gcc -Wall -o wcc wcc.c
# and then issue
./wcc yourfile

and try it out.

#include <stdio.h>
#include <stdlib.h>


#define MAX 2048

int main(int argc, char** argv)
{
        char zbuf[MAX];
        long int z=0;
        FILE *fp;

        fp=fopen(argv[1],"r");
        if( !fp )
        {
                fprintf(stderr, "Error: File %s could not be opened.\n", argv[1]);
                exit (EXIT_FAILURE);
        }
        else
        {
                while ( fgets(zbuf, MAX, fp) )
                {
                        z++;
                }
        }
        fclose(fp);
        printf("Line count: %li\n", z);
        exit (EXIT_SUCCESS);
        return 0;
}

I have set the maximum line length to 2048 - maybe you want to increase this if it is not sufficient for your file. Maybe worth a try. I am no C programmer so maybe someone has even an idea to improve it.

It could be also the case that your hardware/OS is the bottle neck - just a guess.

felipe.vinturin · October 15, 2010, 7:52am

Have you tried this:

awk 'END{printf ("%d\n", NR)}' yourfile

methyl · October 15, 2010, 8:14am

Brilliant. Took 37 seconds for 5 million 80-character records.
Let's see how the O/P gets on.

felipe.vinturin · October 15, 2010, 8:18am

To me:

# echo "10000000000000000000" | awk '{printf ("%d\n", $0)}'
10000000000000000000
# echo "100000000000000000000" | awk '{printf ("%d\n", $0)}'
1e+20

Maybe there is another way, that is also fast!

zaxxon · October 15, 2010, 1:27pm

Lol ok, forget my program

DGPickett · October 15, 2010, 2:21pm

Lines can occur at any byte, so brute force is it, and a hybrid approach is needed:

capture the nominal file size using ls -l or the like.
use head -c size <file |wc -l to find the line count to that byte count, even if the file has grown since ls -l.
Report the total.
Save both.
Next time, capture the nominal file size using ls -l or the like.
Calculate the size delta.
use tail -c +old_size <file | head -c delta | wc -l to count just the new lines to that byte count, even if the file has grown since ls -l.
Add the new lines to the past lines.
report the total.
Save the new size and total lines for the next time.

Seeks by byte count are fast.

Corona688 · October 15, 2010, 3:29pm

Yes, you can usually count on a purpose-built utility being faster than something bodged in a shell or string language!

DGPickett · October 15, 2010, 5:02pm

The only compromise in "wc -l" is the probable FILE* i/o, which could be rewritten to do raw read() or, even faster and more dangerous to system throughput, mmap64() using a 64 bit compiler (flushes all RAM to backing store). It is still brute force.

That is why I mentioned the 'storage of earlier byte counts' method. You could have it run all day, periodically updating the byte total report.

If the app logged as it wrote, every N lines or N seconds, which ever came first, then you could just tail the log.

If it is not our code, you could even write a pass-through logger and use a (possibly named) pipe to get access to the output stream, if the app can be configured for output to a (possibly named) pipe.

---------- Post updated at 05:02 PM ---------- Previous update was at 03:42 PM ----------

Not the weepy face! Try this fast wc -l stdin C bit using read() and quarter meg buffer:

#include <stdio.h>
#include <errno.h>

int main(){

        char buf[262144];
        long long ct = 0L ;
        char *cp ;
        int ret ;


        while ( 0 < ( ret = read( 0, buf, sizeof(buf)))
         || ( ret == -1
           && ( errno == EAGAIN
             || errno == EINTR ))){

                if ( ret < 0 )
                        continue ;

                for ( cp = buf + ret - 1 ; cp >= buf ; cp-- ){
                        if ( *cp == '\n' ){
                                ct++ ;
                         }
                 }
         }

        if ( ret ){
                perror( "fwcl: stdin" );
                exit( 1 );
         }

        printf( "%lld\n", ct );

        exit( 0 );
 }

"mysrc/fwcl.c" line 37 of 37 --100%-- 

$ wc -l <.profile
70
$ fwcl <.profile
70
$

drl · October 16, 2010, 11:08am

Hi.

If one can be satisfied with an estimate, then a code that samples the file can be very fast.

As DGPickett said, seeks are fast. This demo code, esmele, reads the first 100 lines of the file (almost a GB), and skips to 6000 characters before the EOF, reading again (88 lines in this situation). The mean lengths are calculated and then the estimate is made based on another quickly-accessible characteristic, the length of the file via stat. The accuracy compared to wc is within 2%. The time is (essentially) constant, although if one were to choose to read percentages of the file, say 3% at the beginning, middle, and end, one could be more accurate, at the expense of taking more time.

% ./compare-esmele-wc 

-----
 File characteristics:
-rw-r--r-- 1 955M Oct 16 05:54 /tmp/test-one-gb

-----
 Time and result of esmele on /tmp/test-one-gb:

real	0m0.011s
user	0m0.008s
sys	0m0.004s
14958698

-----
 Time and result of wc on /tmp/test-one-gb:

real	0m2.739s
user	0m1.212s
sys	0m0.480s
14754910

-----
 Ratio of wc / es counts:
0.986377

Best wishes ... cheers, drl

DGPickett · October 16, 2010, 2:05pm

My next play would be a tool that tailed the file (stdin) and wrote periodic line counts to a log (stdout), so you could start it and just check the log occasionally. It could be scripted as I described above, or PERL or C. You could even have it do cr as line separator and just look at the dedicated xterm for the counts as they overwrite periodically.

PS: the critical right buffer size for fwcl varies by system, so it might be nice to try sizes from 8 K up and see how it varies. You want to empty any disk cache or controller block, but not exceed it. Since many files are sequentially written to media, big blocks ensure fewer seeks and other sequential advantages are mined.

I suppose you could partition the file and do separate processes or threads to count each segment. Probably, the advantage dies after 2 threads, as the disk i/o is saturated. However, as the disk gets less sequential, this might help by queuing a lot of requests, driving a good disk queue manager to sweep the carriage in and out satisfying block requests in cylinder order, and keeping the queue on every SCSI spindle from going empty.

---------- Post updated at 02:00 PM ---------- Previous update was at 01:52 PM ----------

You can get the size cheap with ls -l, and there is very likely an average line length, but if you just have to know the line count, estimates will not satisfy that daemon, which is not logic, but psychology.

---------- Post updated at 02:05 PM ---------- Previous update was at 02:00 PM ----------

Once I wrote a tool that took file names from stdin, mmap64()'d each file, did a string search in the map and munmap64(). With a long file list, it was amazingly good at stopping every other process dead -- rolled out. So, mmap() is fastest, but this task is not the foremost priority of this system.