dear all,
i want to count the lines of a flat(text) file using awk.i have tried with {print NR}
but its taking lot of time for a big file like 2GB file size.
so i want better efficiency...so can any body please help me with some other and better awk code?
Panknil,
The only way to know how many lines a file has is to count
line by line, which is what "wc -l" does.
Anything else will use the same logic.
If it is taking too long, there is nothing else you can do.
well.... as Shell_Life said... 'there ain't nothing you can do'.
a better question is 'WHY do you need to know the number of lines?'. Maybe you don't need to know the number of lines if you/we know the initial objective of all of this!
i have an actual ~ 2GB file created. Here's my results
# time awk 'END{print NR}' twogigfile.txt
39810582
real 1m8.784s
user 0m10.297s
sys 0m3.056s
# time awk '{x++}ED{ print x}' twogigfile.txt
39810582
real 1m6.738s
user 0m15.365s
sys 0m3.044s
# time wc -l twogigfile.txt
39810582 twogigfile.txt
real 1m5.920s
user 0m2.716s
sys 0m2.952s
# time sed -n '$=' twogigfile.txt
39810582
real 5m33.276s
user 5m12.508s
sys 0m3.368s
wc is comparable to awk methods. sed and cat methods pales.
number of lines : 587810152 ( 587 million )
file size : 2.737 GB
time wc -l d1
587810152
wc -l d1 10.62s user 2.65s system 21% cpu 1:02.15 total
time sed -n '$=' d1
587810152
sed -n '$=' d1 134.83s user 2.74s system 81% cpu 2:47.91 total
time awk '{x++}END{ print x}' d1
587810152
awk '{x++}END{ print x}' d1 487.43s user 3.02s system 95% cpu 8:31.66 total
time perl -e ' while (<>) { } print "$.\n" ' d1
587810152
perl -e ' while (<>) { } print "$.\n" ' d1 212.17s user 2.87s system 93% cpu 3:50.42 total
$ uname -a
SunOS xxxxx 5.9 Generic_118558-36 sun4u sparc SUNW,Sun-Fire-V490
$ echo $BASH_VERSION
2.05.0(1)-release
$ du -k FILE1
124652 FILE1
$ type wc
wc is hashed (/usr/bin/wc)
$ time wc -l FILE1
1475071 FILE1
real 0m5.519s
user 0m1.300s
sys 0m0.490s
$ # grep -c ^ will give you wrong numbers (one to high)
$ # depending on whether the last line is incomplete or not.
$ time grep -c ^ FILE1
1475071
real 0m1.411s
user 0m1.050s
sys 0m0.360s
$ /usr/dt/bin/dtksh
$ print ${.sh.version}
Version M-12/28/93d
$ type wc
wc is a shell builtin version of /bin/wc
$ time wc -l FILE1
1475071 FILE1
real 0m1.36s
user 0m0.95s
sys 0m0.40s
$ # but ...
$ du -k FILE2
3314730 FILE2
$ wc -l FILE2
/usr/dt/bin/dtksh: wc: FILE2: cannot open [Value too large for defined data type]
My test file was the result of the concatenation of all files of a directory.
The directory contains text and binary files, so my test file contains also binary data. I think that the awk doesn't like binary datas.
I made a test to confirm this fact:
$ tar cvf test.tar test.dat
a test.dat 4009 blocs.
$ wc -l test.dat
60636 test.dat
$ sed -n $= test.tar
$ awk END {print NR} test.tar
60842
$
As you can see, the result is different for the three commands.