Verify if line number exist

SkySmart · December 15, 2012, 6:49pm

os: linux/sunos

i'm running the following:

sed -n "2065696{p;q}" /tmp/file.txt

/tmp/file.txt is a very big file. it ranges from 400MB to 4GB in size. i want to know if line 2065696 exist. hence the reason for the above. but the problem is, this command is very slow. i have tried awk and grep and both take a while to come back to the prompt.

does any one have ideas on how to fasten this?

jim_mcnamara · December 15, 2012, 7:01pm

Are all of the file records the same length? (the file has a fixed record length).
if so:

get the file size from ls -l (solaris) or stat command (Linux)
divide the file size by the fixed record size.
# if the result is >= 2065696 then then record exists.

If not you have to push the file pointer past 2065695

\n

, not have reached EOF. i.e., read that far into the file.

What are you trying to do? Look for something on line 2065696?

SkySmart · December 15, 2012, 7:24pm

jim mcnamara:

Are all of the file records the same length? (the file has a fixed record length).
if so:

get the file size from ls -l (solaris) or stat command (Linux)

divide the file size by the fixed record size.
# if the result is >= 2065696 then then record exists.

If not you have to push the file pointer past 2065695
\n
, not have reached EOF. i.e., read that far into the file.

What are you trying to do? Look for something on line 2065696?

yes i'm looking for something on line 2065696.

and i dont understand the second part of what you said. can you please elaborate

also, each line in the file is not the same length.

jim_mcnamara · December 15, 2012, 8:19pm

Because each line ends with \n, the newline character, you have to read from the start of the file, reading line by line, until you have read 2065696 lines. Or found that many newlines minus to be on line 2065696.

The fastest way to do that is to use something that is compiled to do just exactly that:

// findln.c
// usage:
//  ./findln  file_to_check
//  ./findln  < file_to_check
//  command some_file | ./findln
// compile [g]cc findln.c -o findln

#include <stdlib.h>
#include <stdio.h>
int main(int argc, char **argv)
{
    FILE *in=NULL;
    int i=0;
    if (argc==1)
    	  in=stdin; 
    else 
    	  in=fopen(argv[1], "r");
    
    if(in==NULL) {perror(""); exit(1);}
    char tmp[4096]={0x0};
    while(fgets(tmp, sizeof(tmp), in)!=NULL )
    {
         i++;
         if(i==2065696)
         {
              printf("%s", tmp);
              exit(0);  // no error because we found it
         }   
    }
    return 1;  // error because we did not get the line
}

One of the things you do in UNIX and windows is to write quick and dirty code for things like this. Does one thing: It just reads a file about as fast as possible, line by line.

If you want a faster solution this is how you do it. C, C++, or some other compiled language you know. awk is interpreted, sed is meant to do a lot of stuff, so they all are going to be somewhat slower than stupid code like the above.

jmgibby · January 29, 2013, 4:02pm

how about this

tail +2065696 /tmp/file.txt | head -1