The first column is site, second column is date (whole year of 2014), and third represent time (from 00:00 to 23:00 for each day), fourth and fifth columns are values. I need to compare column 4 and 5 based on the condition below:
For each site (column 1), if column 4 is more than 3 times of columns 5, and this pattern last for equal or more than 3 hours continually, plus the maximum of them must be higher than 100, print all the lines that meet the standard and count how many cases exist for each site. There are totally around 150 sites and each site has hourly data each day. Here is the output I want:
Please use code tags as per forum rules for commands/Inputs/codes which you use in your posts.
Could you please try following and let me know if this helps you.
site Date time value1 value2
0023 2014-01-01 05:00 80.0 20.3 1
0023 2014-01-01 06:00 90.0 20.0 2
0023 2014-01-01 07:00 180.0 20.0 3
I have not tested it with many scenarios, as per your Input_file I have tested, if you have more conditions and terms please mention them with sample Input_file and expected output into code tags and let me know on same.
EDIT: Also one more thing I wanted to know in case there are records where site ids are NOT same but they are fulfilling the other cases what should we do then? As my code above will not take care of it.
So if you want to remove this kind of condition then please do let us know with more details on your requirement. As there can be lots of permutations and combinations could be make out of this, so clear requirement is must here.
EDIT2: Adding a non-one liner form of solution now for same.
@RavinderSingh13, thank you so much for help. However I got error "previous: Event not found." I tried to search "awk keyword previous", but didn't get anything helpful. Would you please explain it more? Really appreciate.
Sorry, I couldn't understand the error. Could you please mention it more clear with complete information of your requirement and how you are getting error please. For explaination part of code, following may help you in same then.
awk 'NR==1{ ##### When awk is reading very first line of Input_file, then do following actions.
print; ##### print the complete very first line here.
next ##### next is a built in awk keyword, which tells control NOT to go further and skip all next written statements for current(which is very first line) now.
}
{
split($3, A,":"); ##### Now this statement will be executed apart from the first line, I am using split built in function of awk so split 3rd field of the line whose delimiter is ":" colon and storing it into an array named A.
if($4/$NF>=3){ ##### Now as per your requirement, I am checking here whenever 4th field is 3 times of $NF(which indicates value of LAST field of each LINE.) field of the line, if this condition is TRUE then do following actions.
if(site_id==$1){ ##### Here I am checking for a variable named site_id if it has the sae value as previous one or NOT, if it has same value as the previous line ones then execute following statement.
count++ ##### Here increasing the value of variable named count one more now.
};
if(!previous) { ##### Here I am verfiying the value of variable named previous, previous is a variable which will hold the value of your time's(3rd field) 1st value, so that we could make sure the difference between last line(whenever it was satisfying the condition where $4/$NF>=3 is TRUE) and current line's TIME have only 1 hour or min difference.
previous=A[1] ##### Setting up value of array named previous to array A's 1st value here.
};
if(A[1]-previous==1){ ##### Checking here time differences of the current time's value and the previous time's value, so difference should be one as per your requirement.
P=P?P ORS $0 OFS count:$0 OFS count; ##### If above condition is TRUE then I am setting up the value of variable named P to current line's value with the site id's count. Moreover if P already has value then I am making sure P's value should be appended here successfully.
Q++; ##### Increasing the value of variable named Q here to one, WHERE variable Q is meant for keeping track if 3 consecutive lines have come to satisfy all conditions then it should print the value of P.
previous=A[1]; ##### Setting up variable named previous to the array A's 1st value of current line(time value, do do compare operation again for next line.).
site_id=$1 ##### Setting up site_id value to $1(first field) of current line.
}
else { ##### In case difference condition of A[1]-previous is NOT TRUE then perform following actions please.
previous=A[1]; ##### I am setting value of previous variable to A's first value.
site_id=$1 ##### Now setting up site_id's value to first field too.
}
}
else { ##### In case condition of $4/$NF>=3 is NOT TRUE then do following actions.
previous=A[1]; ##### Setting up variable named previous's value to array A's 1st value for next line's comparisions.
P=Q="" ##### Nulliying the values of variabes named P and Q. Because already condition os FALSE and we need 3 consecutive lines to be satisfied with the conditions so no need of variable named P and Q any value here.
};
if(Q==3) { ##### When variable Q's value is equal to 3 then do following actions.
print P; ##### printing the value of P, which actually will have those 3 consecutive lines which are satisfying all the conditions successfully.
P="" ##### Nullyfing the value of variable P, so that OLD values shouldn't print again while printing the new ones.
};
}
' Input_file ##### Mentioning the Input_file here.
It looks like Ravinder accidentally used a single quote in a comment inside a single-quoted script in post #4 in this thread. But, the diagnostic you have shown us doesn't seem to have come from any code Ravinder suggested.
Are you using csh again and getting errors from it mistakenly trying to use its history mechanism inside a single quoted script?
None of your input meets those requirements ("more than"). Assuming each line represents one hour increments, the line number (NR) is relied upon. If that is NOT correct, add some algorithms to account for the time, but take care of "crossing midnight", which makes the calculation more difficult. Making use of the NR assumption, try
awk '
function PRT() {if (C3 && MR) {++CNT
for (i=1;m i<=LC; i++) print LN, CNT
}
}
$1 != SITE {PRT()
SITE = $1
LC = CNT = MR = C3 = 0
}
$4/$5 > 3 {LN[++LC] = $0
if (!ST) ST = NR - 1
if (NR - ST > 3) C3 = 1
if ($4 > 100) MR = 1
next
}
{PRT()
ST = LC = MR = C3 = 0
}
END {PRT()
}
' file
It does not produce any output as none of your requirements are met by the input sample.
If you replace if (NR - ST > 3) by if (NR - ST >= 3) , the result
Thank you, Runic and Don Cragun, I had tried the new code as suggested, but there is a problem. If I use the new input as below:010730023 2014-06-30 07:00 78.0 21.8081
Not sure why my code is not worked for you, see as follows it works well and fine for me, could you please try this one(which is same as previous posts) and let me know how it goes then.
Following is the Input_file:
@RavinderSingh13, sorry to reply late. I did try the same code and input as you used. And I only get: site Date time value1 value2
No any data coming out.