i'd rather not use "wc -l". it is not efficient at all to be used on a file size of 2GB. i'm hoping there's a better way to get the total line number of a file that big.
You do realize, don't you, that in order to count the number of <newline> characters in a file you have to read the entire file?
If lines are being added to the file constantly, but existing data in the file isn't changing, you could save the line count and file size at the time of your last check. Then when the next size check occurs, you could just read the new data and count the added lines and add the results to your previous count. If you're really concerned about efficiency, you probably want to write this in C or C++ rather than a shell script.
Sky Smart - can you cite a verified reference as to why reading a file from record 1 to EOF is NOT the most efficient for carriage control files. rdrtx1's sample code does that and so does wc -l
I'll answer:
No such valid reference exists. You have to count the number of \n characters to get a line count. The only other possibility is for a fixed length record file. In that case you call stat, ls, or some code you have to get the number of bytes (reading file metadata: struct stat st_size) and then do integer division: bytes/recsz.
One other ' reliable' way is to call x=ftell() on the end of a file when you know the file has finished being written and divide x/recsz - again for fixed record length files. This is an even less efficient way to do stat.
NOTHING else exists. In other words: how can you know how many \n characters exist in the file?
when you run a "wc -l" on a file that big, it takes a while to get the total line count. i understand that the entire file must be read in order to get the total lines.
i'm also aware that in UNIX, there are more than one ways to get something done. on some linux systems a "grep -P" will get you want you want a lot faster than any other utility can. on others, the -P option is not available.
overall, i'm more concerned about speed and what the quickest way is to get total line count on a file that big.
IF you start testing methods on a file be aware of the effect file caching by the OS and disk controllers. You will get completely bogus results if you are not aware of this. I/O wait time is the biggest time consumer. Disks are at the very best 10 times slower than memory unless you have SSD.
Pretend you try sed and get this answer:
time sed -n '$=' input_file
real 0m2.098s
user 0m0.516s
sys 0m0.338s
Great - that took 2.098 seconds of wall time.
Let's try wc -l
time wc -l input_file
real 0m0.778s
user 0m0.416s
sys 0m0.338s
Wow. wc -l was faster.
No. A lot of the file data was still in cache. So there was no I/O wait. Why, because you ran against the same file. As you read thru a file the system will attempt to cache all or parts of it, depending on available resources.
The file data in the cache slowly goes away as other users read/write the same disk. After a while the file is no longer cached. How long that is, I cannot say. Solaris will use part of free memory as file cache, so will Linux. Add this to what the disk controller caches and some large chunks of really huge files can be in memory.
SAN storage behaves in a similar way, but is a lot more complex. SAN is generally slower than direct disk, then some systems have the faster directio options, the fastest storage is raw disk (bypassing the filesystem and kernel code for filesystem support). Oracle will do this for its database files if configured.
Also you can tune a filesystem.
If you have to speed up file I/O look into SSD for desktops.
thank you so much for the detailed explanation. i've always wondered why sometimes i get faster response and other times i get a much slower response when running the same command on a file. now i know. thanks a million.