Compute value from more than three consecutive rows

kathy_wang · April 15, 2016, 11:41pm

Hello all, I am working on a file like below:

site Date time value1 value2
0023 2014-01-01 00:00 32.0 23.7
0023 2014-01-01 01:00 38.0 29.9
0023 2014-01-01 02:00 85.0 26.6
0023 2014-01-01 03:00 34.0 25.3
0023 2014-01-01 04:00 37.0 23.8
0023 2014-01-01 05:00 80.0 20.3
0023 2014-01-01 06:00 90.0 20.0
0023 2014-01-01 07:00 180.0 20.0
0023 2014-01-01 08:00 30.0 20.0

The first column is site, second column is date (whole year of 2014), and third represent time (from 00:00 to 23:00 for each day), fourth and fifth columns are values. I need to compare column 4 and 5 based on the condition below:

For each site (column 1), if column 4 is more than 3 times of columns 5, and this pattern last for equal or more than 3 hours continually, plus the maximum of them must be higher than 100, print all the lines that meet the standard and count how many cases exist for each site. There are totally around 150 sites and each site has hourly data each day. Here is the output I want:

0023 2014-01-01 05:00 80.0 20.3 1
0023 2014-01-01 06:00 90.0 20.0 1
0023 2014-01-01 07:00 180.0 20.0 1
0023 2014-06-30 23:00 200.0 30.3 2
0023 2014-07-01 00:00 303.0 30.3 2
0023 2014-07-01 01:00 134.0 30.3 2
0025 2014-07-01 01:00 136.0 25.3 1           
0025 2014-07-01 02:00 116.0 25.3 1
0025 2014-07-01 03:00 106.0 25.3 1

Any help is highly appreciated!

RavinderSingh13 · April 16, 2016, 12:32am

Hello Kathy wang,

Please use code tags as per forum rules for commands/Inputs/codes which you use in your posts.
Could you please try following and let me know if this helps you.

awk 'NR==1{print;next} {split($3, A,":");if($4/$NF>=3){if(site_id==$1){count++};if(!previous){previous=A[1]};if(A[1]-previous==1){P=P?P ORS $0 OFS count:$0 OFS count;Q++;previous=A[1];site_id=$1} else {previous=A[1];site_id=$1}} else {previous=A[1];P=Q=""};if(Q==3){print P;P=""};}'  Input_file

Output will be as follows.

site Date time value1 value2
0023 2014-01-01 05:00 80.0 20.3 1
0023 2014-01-01 06:00 90.0 20.0 2
0023 2014-01-01 07:00 180.0 20.0 3

I have not tested it with many scenarios, as per your Input_file I have tested, if you have more conditions and terms please mention them with sample Input_file and expected output into code tags and let me know on same.
EDIT: Also one more thing I wanted to know in case there are records where site ids are NOT same but they are fulfilling the other cases what should we do then? As my code above will not take care of it.
So if you want to remove this kind of condition then please do let us know with more details on your requirement. As there can be lots of permutations and combinations could be make out of this, so clear requirement is must here.
EDIT2: Adding a non-one liner form of solution now for same.

awk 'NR==1{
                print;
                next
          }
          {
                split($3, A,":");
                if($4/$NF>=3){
                                if(site_id==$1){
                                                count++
                                               };
                                if(!previous)  {
                                                previous=A[1]
                                               };
                                if(A[1]-previous==1){
                                                        P=P?P ORS $0 OFS count:$0 OFS count;
                                                        Q++;
                                                        previous=A[1];
                                                        site_id=$1
                                                    }
                                else           {
                                                        previous=A[1];
                                                        site_id=$1
                                               }
                             }
                else         {
                                previous=A[1];
                                P=Q=""
                             };
                if(Q==3)     {
                                print P;
                                P=""
                             };
          }
   '    Input_file

Thanks,
R. Singh

kathy_wang · April 16, 2016, 6:08pm

@RavinderSingh13, thank you so much for help. However I got error "previous: Event not found." I tried to search "awk keyword previous", but didn't get anything helpful. Would you please explain it more? Really appreciate.

RavinderSingh13 · April 16, 2016, 6:35pm

Hello Kathy wang,

Sorry, I couldn't understand the error. Could you please mention it more clear with complete information of your requirement and how you are getting error please. For explaination part of code, following may help you in same then.

awk 'NR==1{                                                                                     ##### When awk is reading very first line of Input_file, then do following actions.
                print;                                                                          ##### print the complete very first line here.
                next                                                                            ##### next is a built in awk keyword, which tells control NOT to go further and skip all next written statements for current(which is very first line) now.
          }
          {
                split($3, A,":");                                                               ##### Now this statement will be executed apart from the first line, I am using split built in function of awk so split 3rd field of the line whose delimiter is ":" colon and storing it into an array named A.
                if($4/$NF>=3){                                                                  ##### Now as per your requirement, I am checking here whenever 4th field is 3 times of $NF(which indicates value of LAST field of each LINE.) field of the line, if this condition is TRUE then do following actions.
                                if(site_id==$1){                                                ##### Here I am checking for a variable named site_id if it has the sae value as previous one or NOT, if it has same value as the previous line ones then execute following statement.
                                                count++                                         ##### Here increasing the value of variable named count one more now. 
                                               };
                                if(!previous)  {                                                ##### Here I am verfiying the value of variable named previous, previous is a variable which will hold the value of your time's(3rd field) 1st value, so that we could make sure the difference between last line(whenever it was satisfying the condition where $4/$NF>=3 is TRUE) and current line's TIME have only 1 hour or min difference.
                                                previous=A[1]                                   ##### Setting up value of array named previous to array A's 1st value here.
                                               };
                                if(A[1]-previous==1){                                           ##### Checking here time differences of the current time's value and the previous time's value, so difference should be one as per your requirement.
                                                        P=P?P ORS $0 OFS count:$0 OFS count;    ##### If above condition is TRUE then I am setting up the value of variable named P to current line's value with the site id's count. Moreover if P already has value then I am making sure P's value should be appended here successfully.
                                                        Q++;                                    ##### Increasing the value of variable named Q here to one, WHERE variable Q is meant for keeping track if 3 consecutive lines have come to satisfy all conditions then it should print the value of P.
                                                        previous=A[1];                          ##### Setting up variable named previous to the array A's 1st value of current line(time value, do do compare operation again for next line.).
                                                        site_id=$1                              ##### Setting up site_id value to $1(first field) of current line.
                                                    }
                                else           {                                                ##### In case difference condition of A[1]-previous is NOT TRUE then perform following actions please. 
                                                        previous=A[1];                          ##### I am setting value of previous variable to A's first value.
                                                        site_id=$1                              ##### Now setting up site_id's value to first field too.
                                               }
                             }
                else         {                                                                  ##### In case condition of $4/$NF>=3 is NOT TRUE then do following actions.
                                previous=A[1];                                                  ##### Setting up variable named previous's value to array A's 1st value for next line's comparisions.
                                P=Q=""                                                          ##### Nulliying the values of variabes named P and Q. Because already condition os FALSE and we need 3 consecutive lines to be satisfied with the conditions so no need of variable named P and Q any value here.
                             };
                if(Q==3)     {                                                                  ##### When variable Q's value is equal to 3 then do following actions.
                                print P;                                                        ##### printing the value of P, which actually will have those 3 consecutive lines which are satisfying all the conditions successfully.
                                P=""                                                            ##### Nullyfing the value of variable P, so that OLD values shouldn't print again while printing the new ones.
                             };
          }
   '    Input_file                                                                              ##### Mentioning the Input_file here.

Hope this helps you.

Thanks,
R. Singh

kathy_wang · April 16, 2016, 8:13pm

@RavinderSingh13, I got error message when I tried to test your script:

% awk 'NR==1{print;next} {split($3, A,":");if($4/$NF>=3){if(site_id==$1){count++};if(!previous){previous=A[1]};if(A[1]-previous==1){P=P?P ORS $0 OFS count:$0 OFS count;Q++;previous=A[1];site_id=$1} else {previous=A[1];site_id=$1}} else {previous=A[1];P=Q=""};if(Q==3){print P;P=""};}'  Inputfile
previous: Event not found.

Thank you very much.

Don_Cragun · April 16, 2016, 11:49pm

It looks like Ravinder accidentally used a single quote in a comment inside a single-quoted script in post #4 in this thread. But, the diagnostic you have shown us doesn't seem to have come from any code Ravinder suggested.

Are you using csh again and getting errors from it mistakenly trying to use its history mechanism inside a single quoted script?

RudiC · April 17, 2016, 5:47am

None of your input meets those requirements ("more than"). Assuming each line represents one hour increments, the line number (NR) is relied upon. If that is NOT correct, add some algorithms to account for the time, but take care of "crossing midnight", which makes the calculation more difficult. Making use of the NR assumption, try

awk '
function PRT()  {if (C3 && MR)  {++CNT
                                 for (i=1;m i<=LC; i++) print LN, CNT
                                }
                }

$1 != SITE      {PRT()
                 SITE = $1
                 LC = CNT = MR = C3 = 0
                }

$4/$5 > 3       {LN[++LC] = $0
                 if (!ST)               ST = NR - 1
                 if (NR - ST > 3)       C3 = 1
                 if ($4 > 100)          MR = 1
                 next
                }
                {PRT()
                 ST = LC = MR = C3 = 0
                }
END             {PRT()
                }
' file

It does not produce any output as none of your requirements are met by the input sample.

If you replace if (NR - ST > 3) by if (NR - ST >= 3) , the result

0023 2014-01-01 05:00 80.0 20.3 1
0023 2014-01-01 06:00 90.0 20.0 1
0023 2014-01-01 07:00 180.0 20.0 1
0023 2014-06-30 23:00 200.0 30.3 2
0023 2014-07-01 00:00 303.0 30.3 2
0023 2014-07-01 01:00 134.0 30.3 2
0025 2014-07-01 01:00 136.0 25.3 1
0025 2014-07-01 02:00 116.0 25.3 1
0025 2014-07-01 03:00 106.0 25.3 1

is what you requested in post#1 (with the input extended by some lines of your desired output - or add matching 4th and higher lines).

kathy_wang · April 18, 2016, 10:30am

@Mr. Cragun and Ravinder, thank you for help!

After I switch to bash , I did get some output with the limited input I gave. However, there is problem using the input below:

site Date time value1 value2
0023 2014-01-01 00:00 32.0 23.7
0023 2014-01-01 01:00 38.0 29.9
0023 2014-01-01 02:00 85.0 26.6
0023 2014-01-01 03:00 34.0 25.3
0023 2014-01-01 04:00 37.0 23.8
0023 2014-01-01 05:00 80.0 20.3
0023 2014-01-01 06:00 90.0 20.0
0023 2014-01-01 07:00 180.0 20.0
0023 2014-01-01 08:00 30.0 20.
0023 2014-01-01 09:00 180.0 20.0
0023 2014-01-01 10:00 130.0 2
0023 2014-01-01 11:00 110.0 2
0023 2014-06-30 23:00 200.0 30.3
0023 2014-07-01 00:00 303.0 30.3
0023 2014-07-01 01:00 134.0 30.3

The output only gave:

site Date time value1 value2
0023 2014-01-01 05:00 80.0 20.3 1
0023 2014-01-01 06:00 90.0 20.0 2
0023 2014-01-01 07:00 180.0 20.0 3
0023 2014-01-01 09:00 180.0 20.0  4
0023 2014-01-01 10:00 130.0 2 2 5
0023 2014-01-01 11:00 110.0 2 2  6

But the desired output should be:

0023 2014-01-01 05:00 80.0 20.3      1
0023 2014-01-01 06:00 90.0 20.0      1
0023 2014-01-01 07:00 180.0 20.0    1
0023 2014-01-01 09:00 180.0 20.0    2
0023 2014-01-01 10:00 130.0 2        2
0023 2014-01-01 11:00 110.0 2        2 
0023 2014-06-30 23:00 200.0 30.3 1
0023 2014-07-01 00:00 303.0 30.3 1
0023 2014-07-01 01:00 134.0 30.3 1

---------- Post updated at 06:30 AM ---------- Previous update was at 05:51 AM ----------

@Rudic, thank you for help.
I am not sure if I understand you correctly or not, but there are tree lines that do match the requirement from the input:

0023 2014-01-01 05:00 80.0 20.3
0023 2014-01-01 06:00 90.0 20.0
0023 2014-01-01 07:00 180.0 20.0

RudiC · April 18, 2016, 10:40am

They don't:

which is four consecutive lines. What you mean (I guess) is "more than or equal"; change the script accordingly as indicated.

kathy_wang · April 18, 2016, 3:08pm

RudiC, thank you. I had edited that as suggested. Here is the output I got when I use bash

awk: division by zero
 input record number 1, file input
 source line number 12

Don_Cragun · April 18, 2016, 4:31pm

Change the 1st line of RudiC's script from:

awk '

to:

awk '
NR == 1         {next
                }

to skip over the header line in your input file.

kathy_wang · April 19, 2016, 2:16pm

Thank you, Runic and Don Cragun, I had tried the new code as suggested, but there is a problem. If I use the new input as below:010730023 2014-06-30 07:00 78.0 21.8081

010730023 2014-06-30 08:00 80.0 21.8081
010730023 2014-06-30 09:00 80.0 21.8081
010730023 2014-06-30 10:00 78.0 21.8081
010730023 2014-06-30 11:00 92.0 21.8081
010730023 2014-06-30 12:00 89.0 21.8081
010730023 2014-06-30 13:00 83.0 21.8081
010730023 2014-06-30 14:00 80.0 21.8081
010730023 2014-06-30 15:00 82.0 21.8081
010730023 2014-06-30 16:00 88.0 21.8081
010730023 2014-06-30 17:00 89.0 21.8081
010730023 2014-06-30 18:00 106.0 21.8081

I got output as:

010730023 2014-06-30 08:00 80.0 21.8081  1

The desired output should be:

010730023 2014-06-30 08:00 80.0 21.8081 1
010730023 2014-06-30 09:00 80.0 21.8081 1
010730023 2014-06-30 10:00 78.0 21.8081 1
010730023 2014-06-30 11:00 92.0 21.8081 1
010730023 2014-06-30 12:00 89.0 21.8081 1
010730023 2014-06-30 13:00 83.0 21.8081 1
010730023 2014-06-30 14:00 80.0 21.8081 1
010730023 2014-06-30 15:00 82.0 21.8081 1
010730023 2014-06-30 16:00 88.0 21.8081 1
010730023 2014-06-30 17:00 89.0 21.8081 1
010730023 2014-06-30 18:00 106.0 21.8081 1

RavinderSingh13 · April 19, 2016, 2:41pm

Hello Kathy,

Not sure why my code is not worked for you, see as follows it works well and fine for me, could you please try this one(which is same as previous posts) and let me know how it goes then.
Following is the Input_file:

cat Input_file
site Date time value1 value2
0023 2014-01-01 00:00 32.0 23.7
0023 2014-01-01 01:00 38.0 29.9
0023 2014-01-01 02:00 85.0 26.6
0023 2014-01-01 03:00 34.0 25.3
0023 2014-01-01 04:00 37.0 23.8
0023 2014-01-01 05:00 80.0 20.3
0023 2014-01-01 06:00 90.0 20.0
0023 2014-01-01 07:00 180.0 20.0
0023 2014-01-01 08:00 30.0 20.0

Then following code I have used in bash .

awk 'NR==1{
                 print;
                 next
           }
           {
                 split($3, A,":");
                 if($4/$NF=3){
                                 if(site_id==$1){
                                                 count++
                                                };
                                 if(!previous)  {
                                                 previous=A[1]
                                                };
                                 if(A[1]-previous==1){
                                                         P=P?P ORS $0 OFS count:$0 OFS count;
                                                         Q++;
                                                         previous=A[1];
                                                         site_id=$1
                                                     }
                                 else           {
                                                         previous=A[1];
                                                         site_id=$1
                                                }
                              }
                 else         {
                                 previous=A[1];
                                 P=Q=""
                              };
                 if(Q==3)     {
                                 print P;
                                 P=""
                              };
           }
    '  Input_file

Output will be as follows.

site Date time value1 value2
0023 2014-01-01 05:00 80.0 20.3 1
0023 2014-01-01 06:00 90.0 20.0 2
0023 2014-01-01 07:00 180.0 20.0 3

Thanks,
R. Singh

RudiC · April 19, 2016, 2:48pm

I have to apologize: in function PRT 's for loop, there's an m too many.

                                 for (i=1;m i<=LC; i++) print LN, CNT
                                          ^--- this one.

Don't ask me how it came there (typo?), nor why it worked with the first sample. Correction: Remove it

                                 for (i=1; i<=LC; i++) print LN, CNT

and try again.

kathy_wang · April 19, 2016, 10:20pm

@Rudic, I really appreciate your help. Now the first code works but with another issue: I just realize the date isn't continuous for some sites. e.g.

0023 2014-01-01 21:00 90.0 20
0023 2014-01-01 22:00 80.0 20
0023 2014-01-01 23:00 130.0 20
0023 2014-01-02 16:00 130.0 20
0023 2014-01-02 17:00 200.0 30.3
0023 2014-01-02 18:00 303.0 30.3
0023 2014-01-02 19:00 136.0 25.3

The output from code is:

0023 2014-01-01 21:00 90.0 20 1
0023 2014-01-01 22:00 80.0 20 1
0023 2014-01-01 23:00 130.0 20 1
0023 2014-01-02 16:00 130.0 20 1
0023 2014-01-02 17:00 200.0 30.3 1
0023 2014-01-02 18:00 303.0 30.3 1
0023 2014-01-02 19:00 136.0 25.3 1

The desired output should be:

0023 2014-01-01 21:00 90.0 20 1
0023 2014-01-01 22:00 80.0 20 1
0023 2014-01-01 23:00 130.0 20 1
0023 2014-01-02 16:00 130.0 20 2
0023 2014-01-02 17:00 200.0 30.3 2
0023 2014-01-02 18:00 303.0 30.3 2
0023 2014-01-02 19:00 136.0 25.3 2

---------- Post updated at 06:20 PM ---------- Previous update was at 06:15 PM ----------

@RudiC, I also tested the second code by put it in the script after adding the first line
#! /bin/bash
Here is what i got:

awk: syntax error at source line 7
 context is
	                 >>>  if($4(NF)= <<< 3){
awk: illegal statement at source line 8
awk: illegal statement at source line 8

RudiC · April 20, 2016, 2:38am

Wouldn't that "last for more than 3 hours continually"

0023 2014-01-01 23:00 130.0 20
0023 2014-01-02 16:00 130.0 20

As stated before,

As for RavinderSingh13's code, you didn't copy his proposal correctly. Compare your erroneous code to his post.

kathy_wang · April 29, 2016, 12:44pm

@RavinderSingh13, sorry to reply late. I did try the same code and input as you used. And I only get:
site Date time value1 value2
No any data coming out.