quickest way to get the total number of lines in a file

SkySmart · September 30, 2012, 7:14pm

i have a file that's about 2GB, i have to get the total number of lines in this file every 10 minutes.

the interval is not an issue. i just need the proper, most efficient way to do this.

any ideas?

i got the following from another thread on this site, but:

awk 'int(100*rand())%5<1' file

but this randomly pulls out 20% of lines in a file. i'm thinking this code can be slightly modified to get what i want?

rdrtx1 · September 30, 2012, 7:57pm

wc -l filename

SkySmart · September 30, 2012, 8:46pm

i'd rather not use "wc -l". it is not efficient at all to be used on a file size of 2GB. i'm hoping there's a better way to get the total line number of a file that big.

rdrtx1 · September 30, 2012, 8:53pm

awk 'END {print NR}' filename

Don_Cragun · September 30, 2012, 10:07pm

You do realize, don't you, that in order to count the number of <newline> characters in a file you have to read the entire file?

If lines are being added to the file constantly, but existing data in the file isn't changing, you could save the line count and file size at the time of your last check. Then when the next size check occurs, you could just read the new data and count the added lines and add the results to your previous count. If you're really concerned about efficiency, you probably want to write this in C or C++ rather than a shell script.

jim_mcnamara · September 30, 2012, 10:39pm

Sky Smart - can you cite a verified reference as to why reading a file from record 1 to EOF is NOT the most efficient for carriage control files. rdrtx1's sample code does that and so does wc -l

I'll answer:
No such valid reference exists. You have to count the number of \n characters to get a line count. The only other possibility is for a fixed length record file. In that case you call stat, ls, or some code you have to get the number of bytes (reading file metadata: struct stat st_size) and then do integer division: bytes/recsz.

One other ' reliable' way is to call x=ftell() on the end of a file when you know the file has finished being written and divide x/recsz - again for fixed record length files. This is an even less efficient way to do stat.

NOTHING else exists. In other words: how can you know how many \n characters exist in the file?

SkySmart · October 1, 2012, 9:50am

jim mcnamara:

Sky Smart - can you cite a verified reference as to why reading a file from record 1 to EOF is NOT the most efficient for carriage control files. rdrtx1's sample code does that and so does wc -l

I'll answer:
No such valid reference exists. You have to count the number of \n characters to get a line count. The only other possibility is for a fixed length record file. In that case you call stat, ls, or some code you have to get the number of bytes (reading file metadata: struct stat st_size) and then do integer division: bytes/recsz.

One other ' reliable' way is to call x=ftell() on the end of a file when you know the file has finished being written and divide x/recsz - again for fixed record length files. This is an even less efficient way to do stat.

NOTHING else exists. In other words: how can you know how many \n characters exist in the file?

when you run a "wc -l" on a file that big, it takes a while to get the total line count. i understand that the entire file must be read in order to get the total lines.

i'm also aware that in UNIX, there are more than one ways to get something done. on some linux systems a "grep -P" will get you want you want a lot faster than any other utility can. on others, the -P option is not available.

overall, i'm more concerned about speed and what the quickest way is to get total line count on a file that big.

msabhi · October 1, 2012, 9:57am

sed -n '$=' input_file

pamu · October 1, 2012, 10:13am

I think wc -l file is a efficient way of counting lines as compared to other methods discussed above..

jim_mcnamara · October 1, 2012, 10:51am

IF you start testing methods on a file be aware of the effect file caching by the OS and disk controllers. You will get completely bogus results if you are not aware of this. I/O wait time is the biggest time consumer. Disks are at the very best 10 times slower than memory unless you have SSD.

Pretend you try sed and get this answer:

time sed -n '$=' input_file
real    0m2.098s
user    0m0.516s
sys     0m0.338s

Great - that took 2.098 seconds of wall time.
Let's try wc -l

time wc -l input_file
real    0m0.778s
user    0m0.416s
sys     0m0.338s

Wow. wc -l was faster.

No. A lot of the file data was still in cache. So there was no I/O wait. Why, because you ran against the same file. As you read thru a file the system will attempt to cache all or parts of it, depending on available resources.

The file data in the cache slowly goes away as other users read/write the same disk. After a while the file is no longer cached. How long that is, I cannot say. Solaris will use part of free memory as file cache, so will Linux. Add this to what the disk controller caches and some large chunks of really huge files can be in memory.

SAN storage behaves in a similar way, but is a lot more complex. SAN is generally slower than direct disk, then some systems have the faster directio options, the fastest storage is raw disk (bypassing the filesystem and kernel code for filesystem support). Oracle will do this for its database files if configured.

Also you can tune a filesystem.

If you have to speed up file I/O look into SSD for desktops.

SkySmart · October 1, 2012, 11:04am

jim mcnamara:

IF you start testing methods on a file be aware of the effect file caching by the OS and disk controllers. You will get completely bogus results if you are not aware of this. I/O wait time is the biggest time consumer. Disks are at the very best 10 times slower than memory unless you have SSD.

Pretend you try sed and get this answer:
time sed -n '$=' input_file
real    0m2.098s
user    0m0.516s
sys     0m0.338s
Great - that took 2.098 seconds of wall time.
Let's try wc -l
time wc -l input_file
real    0m0.778s
user    0m0.416s
sys     0m0.338s
Wow. wc -l was faster.

No. A lot of the file data was still in cache. So there was no I/O wait. Why, because you ran against the same file. As you read thru a file the system will attempt to cache all or parts of it, depending on available resources.

The file data in the cache slowly goes away as other users read/write the same disk. After a while the file is no longer cached. How long that is, I cannot say. Solaris will use part of free memory as file cache, so will Linux. Add this to what the disk controller caches and some large chunks of really huge files can be in memory.

SAN storage behaves in a similar way, but is a lot more complex. SAN is generally slower than direct disk, then some systems have the faster directio options, the fastest storage is raw disk (bypassing the filesystem and kernel code for filesystem support). Oracle will do this for its database files if configured.

Also you can tune a filesystem.

If you have to speed up file I/O look into SSD for desktops.

thank you so much for the detailed explanation. i've always wondered why sometimes i get faster response and other times i get a much slower response when running the same command on a file. now i know. thanks a million.

msabhi · October 1, 2012, 11:23am

jim mcnamara:

IF you start testing methods on a file be aware of the effect file caching by the OS and disk controllers. You will get completely bogus results if you are not aware of this. I/O wait time is the biggest time consumer. Disks are at the very best 10 times slower than memory unless you have SSD.

Pretend you try sed and get this answer:
time sed -n '$=' input_file
real    0m2.098s
user    0m0.516s
sys     0m0.338s
Great - that took 2.098 seconds of wall time.
Let's try wc -l
time wc -l input_file
real    0m0.778s
user    0m0.416s
sys     0m0.338s
Wow. wc -l was faster.

No. A lot of the file data was still in cache. So there was no I/O wait. Why, because you ran against the same file. As you read thru a file the system will attempt to cache all or parts of it, depending on available resources.

The file data in the cache slowly goes away as other users read/write the same disk. After a while the file is no longer cached. How long that is, I cannot say. Solaris will use part of free memory as file cache, so will Linux. Add this to what the disk controller caches and some large chunks of really huge files can be in memory.

SAN storage behaves in a similar way, but is a lot more complex. SAN is generally slower than direct disk, then some systems have the faster directio options, the fastest storage is raw disk (bypassing the filesystem and kernel code for filesystem support). Oracle will do this for its database files if configured.

Also you can tune a filesystem.

If you have to speed up file I/O look into SSD for desktops.

Yeah i thought so when i tested...varying times..Very good food for our thoughts Jim...thanks..

drl · October 1, 2012, 1:02pm

Hi.

See also post at count lines of file for some additional timings ... cheers, drl