count lines of file

panknil · June 6, 2007, 10:18am

dear all,
i want to count the lines of a flat(text) file using awk.i have tried with {print NR}
but its taking lot of time for a big file like 2GB file size.
so i want better efficiency...so can any body please help me with some other and better awk code?

Regards,
Pankaj

vgersh99 · June 6, 2007, 10:25am

why 'awk'?
wouldn't 'wc -l' be sufficient?

matrixmadhan · June 6, 2007, 10:26am

Did you try something similiar to this,

awk '{x++}END{ print x}' filename

As pointed out by vgersh,
wc -l < filename should do!

panknil · June 6, 2007, 10:42am

Dear vgersh99,

thanks for the reply but i have already tried with this command `wc -l`
its taking lot of time for 2GB size of file

Regards,
Pankaj

Shell_Life · June 6, 2007, 10:44am

Panknil,
The only way to know how many lines a file has is to count
line by line, which is what "wc -l" does.
Anything else will use the same logic.
If it is taking too long, there is nothing else you can do.

ghostdog74 · June 6, 2007, 10:51am

how fast do you want it to be?

vgersh99 · June 6, 2007, 10:52am

well.... as Shell_Life said... 'there ain't nothing you can do'.
a better question is 'WHY do you need to know the number of lines?'. Maybe you don't need to know the number of lines if you/we know the initial objective of all of this!

drl · June 6, 2007, 10:56am

Hi.

Are you concerned about wall-clock time or CPU time? ... cheers, drl

drl · June 6, 2007, 11:02am

( removed - duplicate post ... cheers, drl )

aigles · June 6, 2007, 11:04am

Some tests :

$ time sed -n '$=' big_file
2502607

r�el    0m2,31s
util    0m1,79s
sys     0m0,50s
$ time wc -l <big_file
 2502607

r�el    0m2,51s
util    0m2,33s
sys     0m0,18s
$ time awk '{x++}END{ print x}' big_file
2503305

r�el    0m4,73s
util    0m4,52s
sys     0m0,20s
$ time awk 'END{print NR}' big_file
2503305

r�el    0m3,21s
util    0m3,02s
sys     0m0,19s
$ time cat -n big_file | tail -1
2502607 

r�el    0m7,86s
util    0m7,46s
sys     0m0,40s
$

Size of big_file : 72Mb

I don't understand the result given by awk !

Jean-Pierre.

ghostdog74 · June 6, 2007, 12:43pm

i have an actual ~ 2GB file created. Here's my results

# time awk 'END{print NR}' twogigfile.txt
39810582

real    1m8.784s
user    0m10.297s
sys     0m3.056s
# time awk '{x++}ED{ print x}' twogigfile.txt
39810582

real    1m6.738s
user    0m15.365s
sys     0m3.044s

# time wc -l twogigfile.txt
39810582 twogigfile.txt

real    1m5.920s
user    0m2.716s
sys     0m2.952s

# time sed -n '$=' twogigfile.txt
39810582

real    5m33.276s
user    5m12.508s
sys     0m3.368s

wc is comparable to awk methods. sed and cat methods pales.

matrixmadhan · June 6, 2007, 1:47pm

This is surprising me !

I tried with 600 MB file
though the time taken to compute number of lines were different

result were unique for each of the commands.

matrixmadhan · June 6, 2007, 2:22pm

number of lines : 587810152 ( 587 million )
file size : 2.737 GB

time wc -l d1
587810152
wc -l d1  10.62s user 2.65s system 21% cpu 1:02.15 total


time sed -n '$=' d1
587810152
sed -n '$=' d1  134.83s user 2.74s system 81% cpu 2:47.91 total

time awk '{x++}END{ print x}' d1
587810152
awk '{x++}END{ print x}' d1  487.43s user 3.02s system 95% cpu 8:31.66 total

time perl -e ' while (<>) { } print "$.\n" '  d1
587810152
perl -e ' while (<>) { } print "$.\n" ' d1  212.17s user 2.87s system 93% cpu 3:50.42 total

radoulov · June 6, 2007, 3:05pm

Based on c.u.s. post.

$ uname -a
SunOS xxxxx 5.9 Generic_118558-36 sun4u sparc SUNW,Sun-Fire-V490
$ echo $BASH_VERSION
2.05.0(1)-release
$ du -k FILE1
124652  FILE1
$ type wc
wc is hashed (/usr/bin/wc)
$ time wc -l FILE1
 1475071 FILE1

real    0m5.519s
user    0m1.300s
sys     0m0.490s
$ # grep -c ^ will give you wrong numbers (one to high) 
$ # depending on whether the last line is incomplete or not.
$ time grep -c ^ FILE1 
1475071

real    0m1.411s
user    0m1.050s
sys     0m0.360s
$ /usr/dt/bin/dtksh
$ print ${.sh.version}
Version M-12/28/93d
$ type wc
wc is a shell builtin version of /bin/wc
$ time wc -l FILE1
 1475071 FILE1

real    0m1.36s
user    0m0.95s
sys     0m0.40s
$ # but ...
$ du -k FILE2
3314730 FILE2
$ wc -l FILE2
/usr/dt/bin/dtksh: wc: FILE2: cannot open [Value too large for defined data type]

aigles · June 7, 2007, 2:50am

My test file was the result of the concatenation of all files of a directory.
The directory contains text and binary files, so my test file contains also binary data. I think that the awk doesn't like binary datas.

I made a test to confirm this fact:

$ tar cvf test.tar test.dat
a test.dat 4009 blocs.
$ wc -l test.dat
   60636 test.dat
$ sed -n $= test.tar
$ awk END {print NR} test.tar
60842
$

As you can see, the result is different for the three commands.

Jean-Pierre.

matrixmadhan · June 7, 2007, 4:40am

I suppose wc -l uses ' \n ' as the line delimiter and counts the number of lines based on the occurence of the delimiter ' \n ' from the file.

I have seen files for which ' \n ' is not the default delimiter in that case even wc will not give the result as expected.

So should be the case with awk and binary files.