Process file every minute

James_Owen · December 19, 2011, 4:22pm

Hey guy's,

How can I schedule this script using AWK, what I require is it runs every minute as new data will be added to the file? And I need to capture these data and save it to the output file.

I tried using sleep 1 to allow the file to be processed but is not working.


 awk '
      {
          gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
          gsub( " ", "~" );
          gsub( "<", " " );
   
          for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
          {
              if( split( $(i), a, "=" ) == 2 )
              {
                  gsub(  "\"", "", a[2] );
                  gsub(  "~", " ", a[2] );
                  values[a[1]] = a[2];
              }
          }
   
          #gcount[values["Gender"]]++;         # collect counts
          #acount[values["Age"]]++;
          agcount[values["Gender"]","values["Age"]]++;
   
          printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
      }
   
      END {
          printf( "\nSummary\n" );
          for( x in agcount )
              printf( "%s,%d\n", x, agcount[x] ) | "sort";
      }
  ' input-file

Thank you all

jim_mcnamara · December 19, 2011, 5:23pm

instead of scheduling it to run (assuming you like what the code does now):

tail -f inputfile | 
awk '
      {
          gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
          gsub( " ", "~" );
          gsub( "<", " " );
   
          for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
          {
              if( split( $(i), a, "=" ) == 2 )
              {
                  gsub(  "\"", "", a[2] );
                  gsub(  "~", " ", a[2] );
                  values[a[1]] = a[2];
              }
          }
   
          #gcount[values["Gender"]]++;         # collect counts
          #acount[values["Age"]]++;
          agcount[values["Gender"]","values["Age"]]++;
   
          printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
      }
   
      END {
          printf( "\nSummary\n" );
          for( x in agcount )
              printf( "%s,%d\n", x, agcount[x] ) | "sort";
      }
  '   > reportfile

Note the changes in red, tail -f keeps sending your awk code any data added to the file, this way you don't have to worry about missing data.

James_Owen · December 19, 2011, 5:54pm

The only problem with this is the data in only stored in the file for couple seconds and I need to process it as soon as data arrives. Meaning old data is deleted and replaced by new.

Corona688 · December 19, 2011, 5:59pm

That's an awkward thing to have to handle. Would it be possible to replace the input file with a fifo, so you could simply do

while true
do
        awk ... < /path/to/fifo
done

to have the data fed into awk direct?

Beware that if your while loop seizes up, so does your other program...

methyl · December 19, 2011, 7:03pm

Imho reliable processing of transient data is impossible in Shell.
This is a system design error.
Imho you need to modify the application which is writing the file and do the processing in the application.

Hmm. Sounds like a process to capture a rapidly changing temporary file?

James_Owen · December 20, 2011, 2:25pm

Yes, data in the file is temporary and I need to capture these data as soon as it arrives.

All I require is to check the current time and capture all the data for that minute.

I am new to AWK scripting so I don't know that much about it, but is this achievable using AWK and ksh?

If not can I please get advice what can I use to get the require result.

Thank you all

Corona688 · December 20, 2011, 2:53pm

Have you considered my suggestion of using a fifo?

James_Owen · December 20, 2011, 3:29pm

Hi Corona688 ,

Yes I am looking on First In, First Out approach but not sure how to add it to my script.

Do I require to rewrite the again to add the fifo to it?

Corona688 · December 20, 2011, 4:14pm

I'm not sure what you're asking...

The idea is to remove the output file, and replace it with a fifo

# create a backup with sane permissions in case this doesn't work
tar -cpf /path/to/backup.tar /path/to/outputfile

rm /path/to/outputfile
mkfifo /path/to/outputfile

Then, whenever something tries to open and replace that file, you'll be able to read from it.

while true
do
        cat /path/to/output > /tmp/$$
        echo "Received data from /path/to/output"
        # process /tmp/$$ as you please since it won't vanish from under you
done

This is assuming it's truncating and replacing the file in the manner I expect.

If it isn't then, like methyl said, "reliable processing of transient data is impossible in Shell". I'd extend that to mean that processing transient data isn't something you should be doing, and if there isn't any other way to get the data, that's a severe flaw in whatever program you're using.

methyl · December 20, 2011, 7:25pm

@James Owen
In this context when working with basic tools, trying to capture the contents of a transient temporary file is not a sensible system design option. Neither is trying to read any file which may be open for write by another application.

Btw. "fifo" is a unix technical term for a Named Pipe.

What other design options do you have available which involve using more sophisticated tools?

James_Owen · December 21, 2011, 3:25pm

Hey Guy's I have tried to use fifo but no luck, is not doing what I want to get.

@ methyl

I not sure is I understand your question correctly, Will PERL be better choice than AWK.

If so can I please get help, here a basic PERL script which parse the xml messages

while(<DATA>){
    if(/.*NAME="([^"]*).*Age="([^"]*).*D\.O\.B="([^"]*).*Gender="([^"]*).*/){
        print $1," ",$2," ",$3," ",$4,"\n";
        $age_hash{$2}++;
        $gender_hash{$4}++;
    }
}
print "\n";
foreach my $age(sort {$a<=>$b} keys %age_hash){
    print $age,": ",$age_hash{$age},"\n";
}

Please help :wall:

Corona688 · December 21, 2011, 4:10pm

What exactly is it doing?

James_Owen · December 24, 2011, 7:18am

Is more like what is it not doing?

Your suggestion of fifo is allowing to create a back up file which will store data and allows me to keep the data and process it which is not what I am after. All I want is to read from one and write to output file.

Thank for trying to help.

:wall:

INHF · December 28, 2011, 4:57pm

I am not sure if this will help but may be try and use another output file (this will be temporary file) which will allow you to store the data then may be using the AWK code you have process the data from this file. This is similar to FIFO approach but this is how I usually process files.

jgt · December 28, 2011, 5:32pm

You could also create a new file (base the file name on process_id, username, or session_id, and time of day) each time the original program does a write, then have the report program process and delete the file.
Or, "tee" the original output to two files, so that one is a log file, then keep track of the last record number processed in the log file.

Corona688 · December 28, 2011, 6:30pm

That sounds like a trivial modification of what I already gave you... If you don't want to process it -- don't.

while true
do
        cat /path/to/output
done > logfile

instead of replacing /path/to/output all the time, output will be appended to logfile.

James_Owen · December 30, 2011, 4:28pm

Unix tee command looks interesting, except one question to be able to use this command I will need to have a output file to stores all my outputs then using the tee command process this file.

Am right or have I got this wrong?

Also If I have a file which collects all the data and then process the file every minute using the AWK code, is there a away to delete the data that has been processed then do the next minute data and so on.

Sorry you guy�s you be bothering you all with my issue and taking a lot of your time.

Thank you all again.

Corona688 · December 30, 2011, 7:04pm

tee doesn't have to write to a file necessarily. What it does, depends what you do with it.

You've had answer after answer thrown at you and keep saying "that doesn't do what I want".

You are not helping clarify what you do want, or we might have been able to give you an answer a week ago. I suspect some of them might actually do what you want already.

If my fifo example does anything -- it's still unclear whether you even tried -- I continue to think it would be a good foundation. It would allow you to capture the data without timing problems. tee cannot be directly used on a file that keeps replacing itself but could be used once it's put together with the fifo, etc, etc.

Once we finally figure out what you actually do want, changing it to fit shouldn't be hard.

methyl · December 30, 2011, 8:06pm

After following this thread for a week my view is unchanged. You cannot capture transient data reliably from transient data files using standard unix tools.

Imho 1.The system design needs attention. In particular the application which is writing these data files. You get a completely different scenario if the application writes files with unique file names and closes each file after processing.

Imho 2. This processing should be in the primary system. Too many times we find complex data processing in Shell tools which should be done in a high level database language.
Throughout this thread we have learnt absolutely nothing about the process which writes these transient data files or the reason for intercepting the files and we have seen no sign of test code, formal testing, and the results of such testing.

Imho 3. This thread is a complete waste of technical resources.

James_Owen · December 31, 2011, 6:50pm

@Corona688 , You asked what I want. I want an AWK code which parse xml message and produces an out to output file every minute:

Message example

    [date+time], message=[DATA= �<?xml version=�1.0?�><data changeMsg><NAME=�John Smith�><Age=�23�><D.O.B=�11-10-1988�> <Gender=�Male�>�

Out put example (Time is current time)
8:30,Male,23,1
8:31,Female,23,1
8:32,Female,30,4
8:33,Male,50,10

I have an AWK code which parse the xml message and does the counts then writes to output file. But the only problem with this is that the data is in temporary file and I need to capture rapidly changing data from this temporary file.

AWK code

awk '
    {
        gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
        gsub( " ", "~" );
        gsub( "<", " " );

        for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
        {
            if( split( $(i), a, "=" ) == 2 )
            {
                gsub(  "\"", "", a[2] );
                gsub(  "~", " ", a[2] );
                values[a[1]] = a[2];
            }
        }

        #gcount[values["Gender"]]++;         # collect counts
        #acount[values["Age"]]++;
        agcount[values["Gender"]","values["Age"]]++;

        printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
    }

    END {
        printf( "\nSummary\n" );
        for( x in agcount )
            printf( "%s,%d\n", x, agcount[x] ) | "sort";
    }
' input-file

If I understood the fifo approach right then this what I did.
Create a backup.tar

  tar -cpf /path/to/backup.tar /path/to/outputfile

Then create outputfile

  mkfifo /path/to/outputfile

After using while do loop process the messages

  While true Do 
  awk '
    {
        gsub( ">", "" );        # strip uneeded junk and make "foo bar" easy to capture
        gsub( " ", "~" );
        gsub( "<", " " );

        for( i = 1; i <= NF; i++ )          # snarf up each name=value pair
        {
            if( split( $(i), a, "=" ) == 2 )
            {
                gsub(  "\"", "", a[2] );
                gsub(  "~", " ", a[2] );
                values[a[1]] = a[2];
            }
        }

        #gcount[values["Gender"]]++;         # collect counts
        #acount[values["Age"]]++;
        agcount[values["Gender"]","values["Age"]]++;

        printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
    }

    END {
        printf( "\nSummary\n" );
        for( x in agcount )
            printf( "%s,%d\n", x, agcount[x] ) | "sort";
    }
' input-file

done

But I am not sure about this

while true
do
        cat /path/to/output > /tmp/$$
        echo "Received data from /path/to/output"
        # process /tmp/$$ as you please since it won't vanish from under you
done

Should I write the data from the feed to the backup file, then using the AWK code read from backup file then to the output file and sleep 10 seconds? Same process again and again.

My only problem with this is writing to different files which mean I will have all these data in different files which I don�t need to keep.

If I got this process wrong, I will be grateful if you could with the right process.

Thank you once again and Happy New Year