Process file every minute

Hey guy's,

How can I schedule this script using AWK, what I require is it runs every minute as new data will be added to the file? And I need to capture these data and save it to the output file.

I tried using sleep 1 to allow the file to be processed but is not working.


 awk '
      {
          gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
          gsub( " ", "~" );
          gsub( "<", " " );
   
          for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
          {
              if( split( $(i), a, "=" ) == 2 )
              {
                  gsub(  "\"", "", a[2] );
                  gsub(  "~", " ", a[2] );
                  values[a[1]] = a[2];
              }
          }
   
          #gcount[values["Gender"]]++;         # collect counts
          #acount[values["Age"]]++;
          agcount[values["Gender"]","values["Age"]]++;
   
          printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
      }
   
      END {
          printf( "\nSummary\n" );
          for( x in agcount )
              printf( "%s,%d\n", x, agcount[x] ) | "sort";
      }
  ' input-file


Thank you all

instead of scheduling it to run (assuming you like what the code does now):

tail -f inputfile | 
awk '
      {
          gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
          gsub( " ", "~" );
          gsub( "<", " " );
   
          for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
          {
              if( split( $(i), a, "=" ) == 2 )
              {
                  gsub(  "\"", "", a[2] );
                  gsub(  "~", " ", a[2] );
                  values[a[1]] = a[2];
              }
          }
   
          #gcount[values["Gender"]]++;         # collect counts
          #acount[values["Age"]]++;
          agcount[values["Gender"]","values["Age"]]++;
   
          printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
      }
   
      END {
          printf( "\nSummary\n" );
          for( x in agcount )
              printf( "%s,%d\n", x, agcount[x] ) | "sort";
      }
  '   > reportfile

Note the changes in red, tail -f keeps sending your awk code any data added to the file, this way you don't have to worry about missing data.

The only problem with this is the data in only stored in the file for couple seconds and I need to process it as soon as data arrives. Meaning old data is deleted and replaced by new.

That's an awkward thing to have to handle. Would it be possible to replace the input file with a fifo, so you could simply do

while true
do
        awk ... < /path/to/fifo
done

to have the data fed into awk direct?

Beware that if your while loop seizes up, so does your other program...

Imho reliable processing of transient data is impossible in Shell.
This is a system design error.
Imho you need to modify the application which is writing the file and do the processing in the application.

Hmm. Sounds like a process to capture a rapidly changing temporary file?

Yes, data in the file is temporary and I need to capture these data as soon as it arrives.

All I require is to check the current time and capture all the data for that minute.

I am new to AWK scripting so I don't know that much about it, but is this achievable using AWK and ksh?

If not can I please get advice what can I use to get the require result.

Thank you all

Have you considered my suggestion of using a fifo?

Hi Corona688 ,

Yes I am looking on First In, First Out approach but not sure how to add it to my script.

Do I require to rewrite the again to add the fifo to it?

I'm not sure what you're asking...

The idea is to remove the output file, and replace it with a fifo

# create a backup with sane permissions in case this doesn't work
tar -cpf /path/to/backup.tar /path/to/outputfile

rm /path/to/outputfile
mkfifo /path/to/outputfile

Then, whenever something tries to open and replace that file, you'll be able to read from it.

while true
do
        cat /path/to/output > /tmp/$$
        echo "Received data from /path/to/output"
        # process /tmp/$$ as you please since it won't vanish from under you
done

This is assuming it's truncating and replacing the file in the manner I expect.

If it isn't then, like methyl said, "reliable processing of transient data is impossible in Shell". I'd extend that to mean that processing transient data isn't something you should be doing, and if there isn't any other way to get the data, that's a severe flaw in whatever program you're using.

1 Like

@James Owen
In this context when working with basic tools, trying to capture the contents of a transient temporary file is not a sensible system design option. Neither is trying to read any file which may be open for write by another application.

Btw. "fifo" is a unix technical term for a Named Pipe.

What other design options do you have available which involve using more sophisticated tools?

Hey Guy's I have tried to use fifo but no luck, is not doing what I want to get.

@ methyl

I not sure is I understand your question correctly, Will PERL be better choice than AWK.

If so can I please get help, here a basic PERL script which parse the xml messages

while(<DATA>){
    if(/.*NAME="([^"]*).*Age="([^"]*).*D\.O\.B="([^"]*).*Gender="([^"]*).*/){
        print $1," ",$2," ",$3," ",$4,"\n";
        $age_hash{$2}++;
        $gender_hash{$4}++;
    }
}
print "\n";
foreach my $age(sort {$a<=>$b} keys %age_hash){
    print $age,": ",$age_hash{$age},"\n";
}


Please help :wall:

What exactly is it doing?

1 Like

Is more like what is it not doing?

Your suggestion of fifo is allowing to create a back up file which will store data and allows me to keep the data and process it which is not what I am after. All I want is to read from one and write to output file.

Thank for trying to help.:b:

:wall:

I am not sure if this will help but may be try and use another output file (this will be temporary file) which will allow you to store the data then may be using the AWK code you have process the data from this file. This is similar to FIFO approach but this is how I usually process files.

You could also create a new file (base the file name on process_id, username, or session_id, and time of day) each time the original program does a write, then have the report program process and delete the file.
Or, "tee" the original output to two files, so that one is a log file, then keep track of the last record number processed in the log file.

That sounds like a trivial modification of what I already gave you... If you don't want to process it -- don't.

while true
do
        cat /path/to/output
done > logfile

instead of replacing /path/to/output all the time, output will be appended to logfile.

Unix tee command looks interesting, except one question to be able to use this command I will need to have a output file to stores all my outputs then using the tee command process this file.

Am right or have I got this wrong?

Also If I have a file which collects all the data and then process the file every minute using the AWK code, is there a away to delete the data that has been processed then do the next minute data and so on.

Sorry you guy�s you be bothering you all with my issue and taking a lot of your time.

Thank you all again. :b:

tee doesn't have to write to a file necessarily. What it does, depends what you do with it.

You've had answer after answer thrown at you and keep saying "that doesn't do what I want".

You are not helping clarify what you do want, or we might have been able to give you an answer a week ago. I suspect some of them might actually do what you want already.

If my fifo example does anything -- it's still unclear whether you even tried -- I continue to think it would be a good foundation. It would allow you to capture the data without timing problems. tee cannot be directly used on a file that keeps replacing itself but could be used once it's put together with the fifo, etc, etc.

Once we finally figure out what you actually do want, changing it to fit shouldn't be hard.

After following this thread for a week my view is unchanged. You cannot capture transient data reliably from transient data files using standard unix tools.

Imho 1.The system design needs attention. In particular the application which is writing these data files. You get a completely different scenario if the application writes files with unique file names and closes each file after processing.

Imho 2. This processing should be in the primary system. Too many times we find complex data processing in Shell tools which should be done in a high level database language.
Throughout this thread we have learnt absolutely nothing about the process which writes these transient data files or the reason for intercepting the files and we have seen no sign of test code, formal testing, and the results of such testing.

Imho 3. This thread is a complete waste of technical resources.

@Corona688 , You asked what I want. I want an AWK code which parse xml message and produces an out to output file every minute:

Message example

    [date+time], message=[DATA= �<?xml version=�1.0?�><data changeMsg><NAME=�John Smith�><Age=�23�><D.O.B=�11-10-1988�> <Gender=�Male�>�
Out put example (Time is current time)
8:30,Male,23,1
8:31,Female,23,1
8:32,Female,30,4
8:33,Male,50,10

I have an AWK code which parse the xml message and does the counts then writes to output file. But the only problem with this is that the data is in temporary file and I need to capture rapidly changing data from this temporary file.

AWK code

awk '
    {
        gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
        gsub( " ", "~" );
        gsub( "<", " " );

        for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
        {
            if( split( $(i), a, "=" ) == 2 )
            {
                gsub(  "\"", "", a[2] );
                gsub(  "~", " ", a[2] );
                values[a[1]] = a[2];
            }
        }

        #gcount[values["Gender"]]++;         # collect counts
        #acount[values["Age"]]++;
        agcount[values["Gender"]","values["Age"]]++;

        printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
    }

    END {
        printf( "\nSummary\n" );
        for( x in agcount )
            printf( "%s,%d\n", x, agcount[x] ) | "sort";
    }
' input-file

If I understood the fifo approach right then this what I did.
Create a backup.tar

  tar -cpf /path/to/backup.tar /path/to/outputfile

Then create outputfile

  mkfifo /path/to/outputfile

After using while do loop process the messages

  While true Do 
  awk '
    {
        gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
        gsub( " ", "~" );
        gsub( "<", " " );

        for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
        {
            if( split( $(i), a, "=" ) == 2 )
            {
                gsub(  "\"", "", a[2] );
                gsub(  "~", " ", a[2] );
                values[a[1]] = a[2];
            }
        }

        #gcount[values["Gender"]]++;         # collect counts
        #acount[values["Age"]]++;
        agcount[values["Gender"]","values["Age"]]++;

        printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
    }

    END {
        printf( "\nSummary\n" );
        for( x in agcount )
            printf( "%s,%d\n", x, agcount[x] ) | "sort";
    }
' input-file

done

But I am not sure about this

while true
do
        cat /path/to/output > /tmp/$$
        echo "Received data from /path/to/output"
        # process /tmp/$$ as you please since it won't vanish from under you
done
  

Should I write the data from the feed to the backup file, then using the AWK code read from backup file then to the output file and sleep 10 seconds? Same process again and again.

My only problem with this is writing to different files which mean I will have all these data in different files which I don�t need to keep.

If I got this process wrong, I will be grateful if you could with the right process.

Thank you once again and Happy New Year