count lines of file

dear all,
i want to count the lines of a flat(text) file using awk.i have tried with {print NR}
but its taking lot of time for a big file like 2GB file size.
so i want better efficiency...so can any body please help me with some other and better awk code?

Regards,
Pankaj

why 'awk'?
wouldn't 'wc -l' be sufficient?

Did you try something similiar to this,

awk '{x++}END{ print x}' filename

As pointed out by vgersh,
wc -l < filename should do!

Dear vgersh99,

thanks for the reply but i have already tried with this command `wc -l`
its taking lot of time for 2GB size of file

Regards,
Pankaj

Panknil,
The only way to know how many lines a file has is to count
line by line, which is what "wc -l" does.
Anything else will use the same logic.
If it is taking too long, there is nothing else you can do.

how fast do you want it to be?

well.... as Shell_Life said... 'there ain't nothing you can do'.
a better question is 'WHY do you need to know the number of lines?'. Maybe you don't need to know the number of lines if you/we know the initial objective of all of this!

Hi.

Are you concerned about wall-clock time or CPU time? ... cheers, drl

( removed - duplicate post ... cheers, drl )

Some tests :

$ time sed -n '$=' big_file
2502607

r�el    0m2,31s
util    0m1,79s
sys     0m0,50s
$ time wc -l <big_file
 2502607

r�el    0m2,51s
util    0m2,33s
sys     0m0,18s
$ time awk '{x++}END{ print x}' big_file
2503305

r�el    0m4,73s
util    0m4,52s
sys     0m0,20s
$ time awk 'END{print NR}' big_file
2503305

r�el    0m3,21s
util    0m3,02s
sys     0m0,19s
$ time cat -n big_file | tail -1
2502607 

r�el    0m7,86s
util    0m7,46s
sys     0m0,40s
$

Size of big_file : 72Mb

I don't understand the result given by awk !

Jean-Pierre.

i have an actual ~ 2GB file created. Here's my results

# time awk 'END{print NR}' twogigfile.txt
39810582

real    1m8.784s
user    0m10.297s
sys     0m3.056s
# time awk '{x++}ED{ print x}' twogigfile.txt
39810582

real    1m6.738s
user    0m15.365s
sys     0m3.044s

# time wc -l twogigfile.txt
39810582 twogigfile.txt

real    1m5.920s
user    0m2.716s
sys     0m2.952s

# time sed -n '$=' twogigfile.txt
39810582

real    5m33.276s
user    5m12.508s
sys     0m3.368s

wc is comparable to awk methods. sed and cat methods pales.

This is surprising me ! :confused: :confused: :confused:

I tried with 600 MB file
though the time taken to compute number of lines were different

result were unique for each of the commands.

number of lines : 587810152 ( 587 million )
file size : 2.737 GB

time wc -l d1
587810152
wc -l d1  10.62s user 2.65s system 21% cpu 1:02.15 total


time sed -n '$=' d1
587810152
sed -n '$=' d1  134.83s user 2.74s system 81% cpu 2:47.91 total

time awk '{x++}END{ print x}' d1
587810152
awk '{x++}END{ print x}' d1  487.43s user 3.02s system 95% cpu 8:31.66 total

time perl -e ' while (<>) { } print "$.\n" '  d1
587810152
perl -e ' while (<>) { } print "$.\n" ' d1  212.17s user 2.87s system 93% cpu 3:50.42 total

Based on c.u.s. post.

$ uname -a
SunOS xxxxx 5.9 Generic_118558-36 sun4u sparc SUNW,Sun-Fire-V490
$ echo $BASH_VERSION
2.05.0(1)-release
$ du -k FILE1
124652  FILE1
$ type wc
wc is hashed (/usr/bin/wc)
$ time wc -l FILE1
 1475071 FILE1

real    0m5.519s
user    0m1.300s
sys     0m0.490s
$ # grep -c ^ will give you wrong numbers (one to high) 
$ # depending on whether the last line is incomplete or not.
$ time grep -c ^ FILE1 
1475071

real    0m1.411s
user    0m1.050s
sys     0m0.360s
$ /usr/dt/bin/dtksh
$ print ${.sh.version}
Version M-12/28/93d
$ type wc
wc is a shell builtin version of /bin/wc
$ time wc -l FILE1
 1475071 FILE1

real    0m1.36s
user    0m0.95s
sys     0m0.40s
$ # but ...
$ du -k FILE2
3314730 FILE2
$ wc -l FILE2
/usr/dt/bin/dtksh: wc: FILE2: cannot open [Value too large for defined data type]

My test file was the result of the concatenation of all files of a directory.
The directory contains text and binary files, so my test file contains also binary data. I think that the awk doesn't like binary datas.

I made a test to confirm this fact:

$ tar cvf test.tar test.dat
a test.dat 4009 blocs.
$ wc -l test.dat
   60636 test.dat
$ sed -n $= test.tar
$ awk END {print NR} test.tar
60842
$

As you can see, the result is different for the three commands.

Jean-Pierre.

I suppose wc -l uses ' \n ' as the line delimiter and counts the number of lines based on the occurence of the delimiter ' \n ' from the file.

I have seen files for which ' \n ' is not the default delimiter in that case even wc will not give the result as expected.

So should be the case with awk and binary files.