extract certain parts from a file

gpk_newbie · October 16, 2011, 10:23pm

I have a logfile from which i need to extract certain pattern based on the time but the problem here is the time is not same for all days.

Input file:

Mon 12:34:56 abvjingjgg
Mon 12:34:57 ofjhjgjhgh
.
.
.
Mon 22:30:00 kkfng
.
.
.
Mon 23:12:23 kjgsdafhkljf
.
.
.
Tue 01:04:54 ldkjaoper

Now i need only extract the data which is updated after 22:00:00 till the end of the file but there may be chances of no update at exactly 22:00:00
the input file also contains data for the previous days as well but i need only the part which i updated last i.e the data i need will be at the end of the file always.

Desired output file:
Mon 22:30:00 kkfng
.
.
.
Mon 23:12:23 kjgsdafhkljf
.
.
.
Tue 01:04:54 ldkjaoper

agama · October 16, 2011, 10:44pm

As long as you want everything after the first timestamp of 22:00 or later, then this should work:

awk ' snarf || $2+0 >= 22 { snarf = 1; print; }' log-file-name

It does require that a timestamp between 22:00 and 23:59:59 be present.

gpk_newbie · October 16, 2011, 10:50pm

Thanks agama. Let me try.

---------- Post updated at 08:20 AM ---------- Previous update was at 08:17 AM ----------

Thanks a lot agama. It works. But can you just explain briefly what it does

agama · October 16, 2011, 11:21pm

The basic format of an awk programme is

condition { action }

such that action statements are executed if condition evaluates to true.

In this case, the condition is

snarf || $2+0 >= 22

Awk treats snarf like C, and thus it evaluates to true if not zero or is not a null string (undefined). As the programme starts, it evaluates to false. The second part evaluates to true when the hour of the timestamp is greater or equal to 22. This makes use of an awk trick that converts the lead portion of a string to an integer by adding zero ($2+0) so that it can be compared to the integer 22.

Once the expression evaluates to true (time stamp is good) then we set snarf to 1 such that the expression always is true and all lines after the first good timestamp are printed.

Some suggested reading on awk:
Awk - A Tutorial and Introduction - by Bruce Barnett

gpk_newbie · October 16, 2011, 11:54pm

Thanks a lot agama. but still i have a doubt will this check for the latest update in the log file because there may be updates in logfile for previous days also.

---------- Post updated at 09:24 AM ---------- Previous update was at 09:04 AM ----------

I tried for the same and it did not work when logfile contains previous days data also. its checking for the first occurance of 22:00:00 and displaying all the contents that follow i whereas i need only data which has been updated for 22:00:00 at the end till end of the file.

agama · October 17, 2011, 12:08am

You are correct; my original post indicated that it would snarf from the first occurrence of the timestamp until the end. I didn't catch the part in your original post that indicated you only wanted the last day -- sorry about that.

Along the same lines, but it does not include anything before the last timestamp after 21:59:59. It does assume that every line in the file has a timestamp.

awk  ' 
    BEGIN { i = 0; }
    $2+0 < 22 { roll = 1; }     # rolled to next day -- signal reset needed

    snarf || $2+0 >= 22 {
        if(  $2+0 >= 22 && roll )  # reset on first timestamp after roll
        {
            roll = 0;
            delete capture;
            i = 0;
        }

        snarf = 1; 
        capture[i++] = $0; 
    }

    END {      # after all of the file has been read, print the lines from the last timestamp of 22:00 or later
        for( j = 0; j < i; j++ )
            print capture[j];
    }' input-file

gpk_newbie · October 17, 2011, 1:16am

great this works fine.

---------- Post updated at 10:24 AM ---------- Previous update was at 09:45 AM ----------

Hi agama, i use the below command to get 7 lines after the pattern from file1 to file2, but the problem here is im not able able to include even the pattern into file2.

gawk 'c-->0;/pattern/{c=7}' file1 > file2

---------- Post updated at 10:46 AM ---------- Previous update was at 10:24 AM ----------

sorry again but a small doubt if the time im looking is 22:15:00 instead of 22:00:00 then how to change the gawk command.

agama · October 17, 2011, 9:28pm

No problem. Just a small tweek:

awk  '
    BEGIN { i = 0; }

    {
        split( $2, a, ":" );               # divide field 2 into hr min sec 
        t = (3600 * a[1]) +  (60 * a[2]);  # compute sec past midnight; 80100 is 22:15:00
        if( t < 80100 )                 # wrapped to next day; must roll
            roll = 1;

        if( snarf || t >= 80100 )       # snarfing or past the magic time
        {
            if(  t >= 80100 && roll )   # 22:15 the next day; clear first
            {
                roll = 0;
                delete capture;
                i = 0;
            }

            snarf = 1;
            capture[i++] = $0;     # buffer the record
        }
    }

    END {
        for( j = 0; j < i; j++ )
            print capture[j];
    }'  input-file

Thinking on your other question.

gpk_newbie · October 17, 2011, 10:20pm

Thanks a lot agama. It works fine.

and i found the solution for the other ques. just one more print command to print it.

gawk 'c-->0;/pattern/{print;c=7}' file1 > file2

---------- Post updated at 07:50 AM ---------- Previous update was at 07:24 AM ----------

the file size is not of concern here as the file size is in few kb and will always be the same.

however the pattern occurs for multiple times and i need the 7 lines following the pattern for each occurance of the pattern.

thanks for the help.