Completey new to scripting, question/help?

xdawg · May 26, 2015, 11:57pm

So I need to write a script that can parse our logs and give me the amount of daily activity per user on our website. Unfortunately I'm still learning the very basics so please bear with me . Below is an example snippet from a log to give you a basic idea of what each entry in the log basically looks like (the important parts I want extracted bolded and are basically just the date and username):

blahblahblahblah- 05-26@09:31:26:235 INFO (blahblahblahblah) - myorganization.api.ApiHandler-0>getID(blahblahblahblah:"","user_info":{"username":"joe@somecompany.com","orgid":"blahblahblahblah"

So when somebody is on our site and performs activity, clicking through different pages, etc. an entry like above is written to the log for each bit of activity. In each log there can be several different days depending upon the activity (logs rotate based on size).

So far I've got this:

awk -F"\"username\":\"" '{ print $2 }' logs/mycompany.log | awk -F"\"" '{ print $1 }' | sort | uniq -c

This gives me a list of two colums, with the number of instances (and hence user activity) paired with the username. Now I need to associate these with the date so that for any given day it will output the username and activity and day/date, and output that to .csv file. I'm open to any method really, I "think" it shouldn't be too difficult to modify what I have already but then again I'm new to this and not really sure how to do it right

Don_Cragun · May 27, 2015, 1:26am

Assuming that you want the output sorted by increasing alphanumeric username as the primary sort key and increasing date as the secondary sort key, the following seems to work (assuming your one-line sample is representative of the actual format of your data) without the need for two awk scripts and without the need for uniq :

awk '
{       split($2, d, "@")
        match($0, /.*"username":"/)
        user = substr($0, RLENGTH + 1)
        user = substr(user, 1, index(user, "\""))
        c[d[1] OFS user]++
}
END {   for(i in c)
                printf("%4d %s\n", c, i)
}' logs/mycompany.log | sort -k3,3 -k2,2

xdawg · May 27, 2015, 2:37pm

Thanks, yes I would like the output sorted by increasing alphanumeric username as primary sort key, and increasing data & time as the secondary (I didn't find out until today that they want the time in addition to the date but it appears right after the date in the logs).

Unfortunately the script you wrote didn't quite work for my actual logs, however that might be because of the snippet I included. I should've included a better sample from the logs, showing that the date/time line up on the same space each entry so that might make it easier?

I don't know if maybe there is an easier way to do it, or whether awk is the way to go, so if anybody has a better suggestion I'm certainly open to it.

qtp111659197-5776 - 05-26@09:37:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"joe@mycompany.com","orgid":"d96467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},) 
qtp111659197-5785 - 05-26@09:37:35:100 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getDifferentUniqueDataBySource(differentdata,{"has_values":false,"last_event_triggered":"","user_info":{"username":"joe@mycompany.com","orgid":"d96467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},)

---------- Post updated at 11:37 AM ---------- Previous update was at 10:44 AM ----------

Also, each entry actually is four lines long when you cat the log, rather than just a single line as it appears when I paste it into here.

RudiC · May 27, 2015, 3:47pm

Try an adaption of Don Cragun's fine proposal:

awk '
        {match($0, /username[^,]*/)
         user = substr($0, RSTART+11, RLENGTH-12)
         c[$3 OFS user]++
        }
END     {for(i in c) printf("%4d %s\n", c, i)
        }
' file
   1 05-26@09:37:35:100 joe@mycompany.com
   1 05-26@09:37:34:240 joe@mycompany.com

As there is no sample of your four-line-log-entries, I can't help and you'll need to experiment with three getline s (including error handling) to compose a $0 that you can work upon with above.

Don_Cragun · May 27, 2015, 4:58pm

Yes. When counting fields in awk , there is a huge difference between:

blahblahblahblah-

and:

blahblahblahblah -
        or
qtp111659197-5776 -

When you want help from a computer scientist, details about input file format are crucial!

I understood the reason for counting the number of log entries per day for a given user. But, I must be missing the point of counting the number of log entries related to a user on the same date and time. Do you really have multiple log entries for a given user being created in the same millisecond?

Are you really saying that each of the lines shown in your latest sample contains four newline characters? Or (making lots of wild assumptions) are you saying that four lines on your screen are used for each line when you cat the log because each line is somewhere between 360 and 480 characters long (assuming a 120 character line length on your screen) and your terminal is wrapping the output onto 4 screen lines? (Your two sample lines have 430 and 447 characters, respectively, including the terminating newline characters.) Show us the output from the command:

head -n8 logs/mycompany.log | od -bc

or, if that command fails saying that the -n option isn't recognized:

head -8 logs/mycompany.log | od -bc

so we have a better chance of understanding what your log file format.

xdawg · May 27, 2015, 6:40pm

Sorry about that, after your initial response I realized what I omitted what I thought was useless data was in fact important data for this task

Sorry for the confusion on this too, the request was sent to me in an email from a person who doesn't understand how we log information and wasn't particularly clear - he's trying to gather info on our customer patterns when using our site. Now that I stop and think about it activity per customer per day should be sufficient. Later on they may want timestamps to see if there are times of day where customers are more active, but I'm not worried about that for now.

You are correct, if I do a

wc -l

on a sample file containing a complete entry it is in fact a single line.

Don_Cragun · May 27, 2015, 7:05pm

So, with the new details about your input file format (and using some of RudiC's suggestions to optimize string handling), does:

awk '
{	split($3, d, "@")
	match($0, /"username":"[^"]*"/)
	user = substr($0, RSTART + 12, RLENGTH - 13)
	c[d[1] OFS user]++
}
END {	for(i in c)
		printf("%4d %s\n", c, i)
}' logs/mycompany.log | sort -k3,3 -k2,2

do what you're trying to do?

xdawg · May 28, 2015, 11:41am

It returns the data I need, the only other issue is that it also returns counts for entries that aren't user related as the log files do contain entries that don't have to do with user activity.

HUGE thanks to both of you!

Don_Cragun · May 28, 2015, 1:25pm

Do the log entries that you want to exclude from the report contain "username":"email@address" strings? If they do, how do you identify the log entries you want to select or the log entries you want to exclude?

Basically, this is another request for a clear definition of your input file format.

xdawg · May 28, 2015, 1:29pm

The log entries that I would like excluded do not contain

"username":"email@address"

So basically I only want entries that do contain

"username":"email@address"

to be displayed in the output

RudiC · May 28, 2015, 1:43pm

Try

awk '
match($0, /username[^,]*/)    {c[$3 OFS  substr($0, RSTART+11, RLENGTH-12)]++
                              }
END                 {for(i in c) printf("%4d %s\n", c, i)
                    }
' file
   2 05-26@09:37:34:240 joe@mycompany.com

---------- Post updated at 19:43 ---------- Previous update was at 19:39 ----------

or, to remove the time-of-day,

awk '
match($0, /username[^,]*/)      {c[substr($3, 1, index($3, "@")-1) OFS  substr($0, RSTART+11, RLENGTH-12)]++
                                }
END                             {for(i in c) printf("%4d %s\n", c, i)
                                }
' file
   2 05-26 joe@mycompany.com

xdawg · May 28, 2015, 6:38pm

Thanks RudiC, the output was mostly correct, however some lines of output were just blank space after the date (I added the commas so later I can output into csv easily):

 3, 05-22 ,

Hmm, I'm learning a lot as we go along. I've been breaking down what DonC wrote (RudiC's will be next probably tonight when I get home), and I think I just need to be able to tell it not to output any line that returns a 0 (no match found) when this is run:

match($0, /"username":"[^"]*"/)

I just haven't figured out how to tell it that, maybe I am thinking about this the wrong way.

Don_Cragun · May 28, 2015, 8:06pm

xdawg:

Thanks RudiC, the output was mostly correct, however some lines of output were just blank space after the date (I added the commas so later I can output into csv easily):
 3, 05-22 ,
Hmm, I'm learning a lot as we go along. I've been breaking down what DonC wrote (RudiC's will be next probably tonight when I get home), and I think I just need to be able to tell it not to output any line that returns a 0 (no match found) when this is run:
match($0, /"username":"[^"]*"/)
I just haven't figured out how to tell it that, maybe I am thinking about this the wrong way.

You're learning quickly. I also submitted something that did it slightly differently, but RudiC posted his before I posted mine, so I deleted it. I am reposting mine here (with changes for CSV format output without the spaces around the commas) hoping that something in your log data contains the string username but does not match the pattern "username":"user@host" (which the code below would treat differently than RudiC's suggestion does):

awk '
match($0, /"username":"[^"]*"/) {
	split($3, d, "@")
	user = substr($0, RSTART + 12, RLENGTH - 13)
	c[d[1] "," user]++
}
END {	for(i in c)
		printf("%d,%s\n", c, i)
}' logs/mycompany.log | sort -t, -k3,3 -k2,2

This can still produce output like you've shown above, if a log entry contains the literal string "username":"" (and maybe also if the second quoted string contains blanks). So, let us know if the above code still produces unwanted output and we can try to track down the log entries causing the problem and eliminate them from consideration.

xdawg · May 29, 2015, 2:24pm

Awesome, that worked perfectly.

Many thanks again to both Don and Rudi!

---------- Post updated 05-29-15 at 11:24 AM ---------- Previous update was 05-28-15 at 08:06 PM ----------

Ok, so I'm back, the CEO has changed how he wants the output sorted/presented.

He wants it primarily sorted alphanumerically by organization which is the second part of the username after the "@".

Then the secondary sort key would be increasing date. And finally a third sort key would be alphanumeric using just the user without the org part.

So thinking it through, if I were to use Don your last piece of code:

awk '
match($0, /"username":"[^"]*"/) {
        split($3, d, "@")
        user = substr($0, RSTART + 12, RLENGTH - 13)
        c[d[1] "," user]++
}
END {   for(i in c)
                printf("%d,%s\n", c, i)
}'      liquidgrids.log | sort -t, -k3,3 -k2,2

And here is a sample entry again:

qtp111659197-5776 - 05-26@09:37:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"joe@mycompany.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},)

We would have to split the user key into two pieces, user (in this example joe) and org (in this example mycompany without the .com)? And then change the sort order of the keys at the end of the script? Or would this require a second script?

RudiC · May 29, 2015, 2:55pm

What's driving people crazy is moving targets or creeping specs. Mayhap you can talk the CEO into valueing the time spent upfront?

Hint: use sub() (or split ) to split the user and/or remove his ".com", and yes, change the sort keys. All in above script.

xdawg · May 29, 2015, 3:32pm

Agreed.

Thanks for the hint. So here's what I did below:

awk '
match($0, /"username":"[^"]*"/) {
        split($3, d, "@")
        user = substr($0, RSTART + 12, RLENGTH - 17)
        split(user, e, "@")
        c[e[2] "," d[1] "," e[1]]++
}
END {   for(i in c)
                printf("%d,%s\n", c, i)
}'      company.log | sort -t, -k2,2 -k3,3 -k4,4

This works, now I just need to get the first column in the output (which is the number assigned to the total number of entries per user), to become the fourth column in the output.

Don_Cragun · May 29, 2015, 4:24pm

xdawg:

Agreed.

Thanks for the hint. So here's what I did below:
awk '
match($0, /"username":"[^"]*"/) {
   split($3, d, "@")
   user = substr($0, RSTART + 12, RLENGTH - 17)
   split(user, e, "@")
   c[e[2] "," d[1] "," e[1]]++
}
END {   for(i in c)
   printf("%d,%s\n", c, i)
}'      liquidgrids.log | sort -t, -k2,2 -k3,3 -k4,4
This works, now I just need to get the first column in the output (which is the number assigned to the total number of entries per user), to become the fourth column in the output.

I don't see anything in the above script that gets rid of the .com . And, I don't think I understand what:

you're trying to count,
the order of the output fields, nor
the desired output sort order (I think the sort keys are org, date, and user; but I'm not sure).

I assume that the even though joe@mycompany.com and joe@yourcompany.org both have user joe , you don't really want counts for joe do you? What about joe@abc.com , joe@abc.net , and joe@abc.org ? Do you really believe that abc.com , abc.net , and abc.org are all addresses for the abc organization, or could these be three different organizations? Does your CEO really want to strip off the .com , .net , and .AnyOtherThing ; or was that another knee jerk reaction without thinking about the consequences?

If you could give us about a 10 line sample log file and show us the exact output that your CEO wants from that sample, we might be willing to make one more attempt at a working solution.

xdawg · May 29, 2015, 6:32pm

Ok sorry, lets see if I can clarify from the following log sample:

 
 qtp111659197-5776 - 05-26@09:37:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"alan@companyA.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},)
 qtp111659197-5776 - 05-26@09:38:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"alan@companyA.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},) 
 qtp111659197-5776 - 05-26@09:39:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"bryan@companyB.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},) 
 qtp111659197-5776 - 05-26@09:40:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"carl@companyA.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},) 
 qtp111659197-5776 - 05-26@09:41:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"dan@companyB.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},) 
 qtp111659197-5776 - 05-27@09:37:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"alan@companyA.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},) 
 qtp111659197-5776 - 05-27@09:38:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"bryan@companyB.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},) 
 qtp111659197-5776 - 05-27@09:39:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"carl@companyA.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},) 
 qtp111659197-5776 - 05-27@09:40:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"dan@companyB.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},)
 qtp111659197-5776 - 05-27@09:41:34:240 INFO  (TimingInfoProxy.java:41)     - com.mycompany.api.ApiHandler-0>getUniqueDataBySource(data,{"has_values":false,"last_event_triggered":"","user_info":{"username":"dan@companyB.com","orgid":"69d467a7-9786-47e1-9c12-bb40f9bfc65d","ip":"127.0.0.1"},"date_range":{"min_date":"","start_date":"","end_date":"","trending_start_date":"","trending_end_date":""},"terms":{"and_filtering":[]}},)

So basically I need to export his requirements into a spreadsheet. He wants to be able to see the data sorted first by organization, then by date, then user alphanumerically (within that organization), and finally the number of entries for that user for that specified date.

So for the first user with an entry in the sample log, alan@companyA.com, he had two log entries on 05-26 and one entry on 05-27. So output for just him would look like this:

 
 companyA,05-26,alan,2
 companyA,05-27,alan,1

And this is what it would look like for the entire 10 line sample log:

 
 companyA,05-26,alan,2
 companyA,05-26,carl,1
 companyA,05-27,alan,1
 companyA,05-27,carl,1
 companyB,05-26,bryan,1
 companyB,05-26,dan,1
 companyB,05-27,bryan,1
 companyB,05-27,dan,2

---------- Post updated at 03:32 PM ---------- Previous update was at 02:26 PM ----------

FWIW this seems to work:

 
 awk '
match($0, /"username":"[^"]*"/) {
        split($3, d, "@")
        user = substr($0, RSTART + 12, RLENGTH - 17)
        split(user, e, "@")
        c[e[2] "," d[1] "," e[1]]++
}
END {   for(i in c)
                printf("" "%s,%d\n", i, c)
}'      liquidgrids.log | sort -t, -k1,1 -k2,2 -k3,3

The way I got rid of the .com from the username was changing

RLENGTH - 13

into

RLENGTH - 17

Not sure if that's a proper way to go about this or if it would cause problems that I'm unaware of.

Don_Cragun · May 29, 2015, 6:33pm

This does what you asked for this time:

awk '
match($0, /"username":"[^"]*"/) {
	split($3, d, "@")
	email = substr($0, RSTART + 12, RLENGTH - 13)
	split(email, e, "[@.]")
	c[e[2] "," d[1] "," e[1]]++
}
END {	for(i in c)
		printf("%s,%d\n", i, c)
}' file | sort -t, -k1,3

xdawg · May 29, 2015, 6:35pm

don cragun:

This does what you asked for this time:

awk '
match($0, /"username":"[^"]*"/) {
   split($3, d, "@")
   email = substr($0, RSTART + 12, RLENGTH - 13)
   split(email, e, "[@.]")
   c[e[2] "," d[1] "," e[1]]++
}
END {    for(i in c)
   printf("%s,%d\n", i, c)
}' file | sort -t, -k1,3

Thanks again Don!