Text Proccessing with sort,uniq,awk

rollyah · December 21, 2010, 1:36pm

Hello,

I have a log file with the following input:

X , ID , Date, Time, Y
01,01368,2010-12-02,09:07:00,Pass
01,01368,2010-12-02,10:54:00,Pass
01,01368,2010-12-02,13:07:04,Pass
01,01368,2010-12-02,18:54:01,Pass
01,01368,2010-12-03,09:02:00,Pass
01,01368,2010-12-03,13:53:00,Pass
01,01368,2010-12-03,16:07:00,Pass

My goal is to get the number of times ID has a TIME that's after 09:00:00 each DATE.
That would give me two output. one is the number of days ID has been late, and secondly, the day and time this ID has been late .

I've started as such:

sort -t ','  -k 3,3 -k 4,4  file.log  # this will sort the file according to the DATE field as well as the Time fileld.

I'm stuck for the last 30 min to find a way to get the first line of each day (logically it'll be the earliest as i've sorted by date/time previously) once i know how to do this, i'll be able to compare time and proceed..

Can any one help ?
i looked into sort - u and uniq -f3 though i didnt get far with it..
i'm reading a few tutorials on AWK, though as i'm usual a sed person (it usually solved all my needs) i'm finding it a bit more complex..

Franklin52 · December 21, 2010, 1:55pm

Assuming you want the number if times for each id per date:

awk -F, '{gsub(":","",$4)} 
int($4)>90000{a[$2,$3]++} 
END{for(i in a)print i, a}
' file

Use nawk or /usr/xpg4/bin/awk on Solaris.

anurag.singh · December 21, 2010, 2:12pm

 
awk -F'(,|:)' '{if($4+0 >= 9) a[$1," on date ",$3]++}END{for (i in a)print "count of ID",i,"is",a}' inputFile

For input in post #1, output is:

 
count of ID 01 on date 2010-12-02 is 4
count of ID 01 on date 2010-12-03 is 3

rollyah · December 21, 2010, 2:21pm

Thanks for your help, i've gathered to build on your advice and come up with something that works for what i want to accomplish:

 wc -l test;  awk -F , '{if ($4 > "09:10:00") print $2 " was late on", $3 " by coming at ",$4}' test

Output:

       7 test
01368 was late on 2010-12-02 by coming at  10:54:00
01368 was late on 2010-12-02 by coming at  13:07:04
01368 was late on 2010-12-02 by coming at  18:54:01
01368 was late on 2010-12-03 by coming at  13:53:00
01368 was late on 2010-12-03 by coming at  16:07:00

Though the problem with the above output, is that it checks every date/ line with 2010-12-03 instead of just the first (earliest)

is there a way to take the earliest time (according to date) or take the first line sorted by the date field?

Franklin52 · December 21, 2010, 2:35pm

What should be the output of the given data of your first post?

rollyah · December 21, 2010, 2:45pm

Thanks to your help i've reached this step:

 
original data:
 
01,01368,2010-12-02,09:07:00,Pass
01,01368,2010-12-02,10:54:00,Pass
01,01368,2010-12-02,13:07:04,Pass
01,01368,2010-12-02,18:54:01,Pass
01,01368,2010-12-03,09:02:00,Pass
01,01368,2010-12-03,13:53:00,Pass
01,01368,2010-12-03,16:07:00,Pass

 
awk -F , '{if ($4 > "09:10:00") print $2 " was late on", $3 " by coming at ",$4}' test | tee  DaysLate ; wc -l DaysLate

OUTPUT:

 
01368 was late on 2010-12-02 by coming at  10:54:00
 
01368 was late on 2010-12-02 by coming at  13:07:04
 
01368 was late on 2010-12-02 by coming at  18:54:01
 
01368 was late on 2010-12-03 by coming at  13:53:00
 
01368 was late on 2010-12-03 by coming at  16:07:00
 
       5 DaysLate

the only thing missing is to find a way to just take the earliest time of each day.

in other words the above output should be:

   0 DaysLate # as on 12-02 he came in at 09:07 which is before 09:10 and on 12-03 he came in at 09:02 which is also before the set time

anurag.singh · December 21, 2010, 3:17pm

 
sort -t ','  -k 3,3 -k 4,4  file.log | awk -F, '{if($1 != "X" && !a[$3]) {a[$3]++;if($4 < "09:10:00") v="before 09:10"; else v="after 09:10"; print "on",substr($3,6,5),"here came in at",substr($4,0,5),v;}}'

Franklin52 · December 21, 2010, 3:21pm

awk -F, 'BEGIN{c=0}
s!=$3 {
  if($4 > time){
    print $2 "was late on " $3 " by coming at " $4; s=$3
    c++
  }
  else {
    s=$3
  }
}
END{print c " Days Late"}' time="09:01:00" file.log

rollyah · December 21, 2010, 3:50pm

thank you, used this :

as i didn't understand how it's exactly done, so i couldnt get rid of the "before" lines myself.

is it too much if i ask u to explain to me how it's done ?
here's the parts i didnt get:

anurag.singh:

 
sort -t ','  -k 3,3 -k 4,4  file.log | awk -F, '{if($1 != "X" && !a[$3]) {a[$3]++;if($4 < "09:10:00) v="before 09:10"; else v="after 09:10"; print "on",substr($3,6,5),"here came in at",substr($4,0,5),v;}}'

anurag.singh · December 21, 2010, 4:02pm

!a[$3] == >> $3 is date. This is to make sure that we are processing only 1st line for a given date (a[$3] will be set for other reords for same data)
a[$3]++ == >> This is just to set array with index $3.
v="before 09:10" == >> v is a variable, This is used in print statement (last argument to print)
substr == >> substr(string, start_index, lenght_of_substring)

Hope this helps !!
If you don't want before lines, just modify if condition where time value is being checked.

sort -t ',' -k 3,3 -k 4,4 file.log | awk -F, '{if($1 != "X" && !a[$3]) {a[$3]++;if($4 > "09:10:00") print "ID number " $2 " came in on",substr($3,6,5)," at",substr($4,0,5),"after 09:10";}}'

command I gave earlier, (In post #7) doing following:

Process only 1st line for a given date (after sorting the file) and ignore all other lines for same date
Compate time value with "09:10:00" and display before/after message accordingly

If you still stuck, you may post a proper input and proper output.

rollyah · December 21, 2010, 4:29pm

Thank you for your tremendous help, everything works great though i'mm still trying to get my head around this as i really would like to learn awk.
i got a tutorial for advanced scripting and i fast forwarded to AWK though they only had about 1 minute worth of relevant info and nothing this advanced.
so if you have time, if it's not too much trouble can you help me with the below?

if($1 != "X" # can you please explain the logic?
&& !a[$3] # i get that $3 is the 3d field, which is the date, but what does !a[$3] stand for? what does it represent? and why is it in the same if statement?
substr($3,6,5) # how did you manage to remove the year

PS: can you suggest a good place to start about awk ? a tutorial, or something you've managed to gain ur experience from.. (aside daily practice of course)?

anurag.singh:

!a[$3] == >> $3 is date. This is to make sure that we are processing only 1st line for a given date (a[$3] will be set for other reords for same data)

a[$3]++ == >> This is just to set array with index $3.

v="before 09:10" == >> v is a variable, This is used in print statement (last argument to print)

substr == >> substr(string, start_index, lenght_of_substring)

Hope this helps !!
If you don't want before lines, just modify if condition where time value is being checked.
sort -t ',' -k 3,3 -k 4,4 file.log | awk -F, '{if($1 != "X" && !a[$3]) {a[$3]++;if($4 > "09:10:00") print "ID number " $2 " came in on",substr($3,6,5)," at",substr($4,0,5),"after 09:10";}}'
command I gave earlier, (In post #7) doing following:

Process only 1st line for a given date (after sorting the file) and ignore all other lines for same date

Compate time value with "09:10:00" and display before/after message accordingly

If you still stuck, you may post a proper input and proper ouutput.

anurag.singh · December 22, 2010, 4:37am

$1 != "X" == >> $1 is 1st field of every record, This check ensures that we should not process 1st record in inputFIle, The Header Record (X , ID , Date, Time, Y
). But this check is not necessary(And can be removed) in post #10 command. 2nd if condition (check on $4) will not be true for Header record and so that will not be printed.
what does !a[$3] stand for? == >> This checks if a[$3] is NULL OR has any value set in it. if(!a[$3]) is equivalent to if(a[$3]==NULL)
substr($3,6,5) == >> $3 will have year value like 2010-12-03. String indices in awk starts at 1. To get 12-03, start index is 6 (12-03 starts from 6th character) and then it goes upto the end i.e. lenght is 5.
substr($3,6) also gives same result. length is not needed if substring goes upto the last character.

I learned most of stuffs here only (After building few basics) by following posts of experts here Scrutinizer, vgersh99, Franklin52, fpmurphy, scottn, radoulov etc to name a few

Few AWK articles (Many more can be found on internet):
opengroup_awk
grymoire
Utrecht_University_docs
thegeekstuff_awk_with_examples (link removed)

rollyah · February 15, 2011, 5:31am

i'd like to start by thanking you again and again.
however your help is needed once again.
i've almost understood the given awk help though i'm facing trouble with the following records:

whaty ou helped me with is the following:

now i got rid of

as i've managed to remove it from the output file.
though with the samples given above, there's no result.

it's something related to the first condition, though can't seem to pinpoint it
is it due to the way they're sorted?

keep in mind that it work's perfectly with the following initial output ;

01,01368,2010-12-02,09:07:00,Pass
01,01368,2010-12-02,10:54:00,Pass
01,01368,2010-12-02,13:07:04,Pass
01,01368,2010-12-02,18:54:01,Pass
01,01368,2010-12-03,09:02:00,Pass
01,01368,2010-12-03,13:53:00,Pass
01,01368,2010-12-03,16:07:00,Pass

anurag.singh:

$1 != "X" == >> $1 is 1st field of every record, This check ensures that we should not process 1st record in inputFIle, The Header Record (X , ID , Date, Time, Y
). But this check is not necessary(And can be removed) in post #10 command. 2nd if condition (check on $4) will not be true for Header record and so that will not be printed.

what does !a[$3] stand for? == >> This checks if a[$3] is NULL OR has any value set in it. if(!a[$3]) is equivalent to if(a[$3]==NULL)

substr($3,6,5) == >> $3 will have year value like 2010-12-03. String indices in awk starts at 1. To get 12-03, start index is 6 (12-03 starts from 6th character) and then it goes upto the end i.e. lenght is 5.
substr($3,6) also gives same result. length is not needed if substring goes upto the last character.

I learned most of stuffs here only (After building few basics) by following posts of experts here Scrutinizer, vgersh99, Franklin52, fpmurphy, scottn, radoulov etc to name a few

Few AWK articles (Many more can be found on internet):
opengroup_awk
grymoire
Utrecht_University_docs
thegeekstuff_awk_with_examples (link removed)