Need help with a shell script for finding a certain pattern from log file

rockf1bull · February 8, 2011, 2:17am

Hell Guys,

Being a newbie, I need some help in finding a certain string from a log file of thousands of lines (around 30K lines) and have the output in a separate file.

Below is the file output -

10.155.65.5 - - [20/Jan/2011:07:41:58 +0100] "POST /cas/login?post=true&service=http://test.domain.ca:8000/psp/ppf88prd/?cmd=login%26languageCd=CAN%26userid=VP1%26pwd=TEST123 HTTP/1.1" 200 888

Now what I am after is I need to separate the output after "&service=" and I need to catch the http:// url for now.

Please note that I need the output in a separate file with each search on a different line (basically neatly arranged).
Now one of the other problems is that the log file also contains some other log entries such as below:

19.489.50.8 - - [25/Jan/2011:00:00:11 +0100] "GET /cas/themes/testdomains/fondCas.jpg HTTP/1.1" 200 29659
17.538.23.034 - - [25/Jan/2011:00:00:12 +0100] "GET /cas/status.jsp HTTP/1.0" 200 104

And I need not check them since I am looking for log lines that have "&service=" and only wish to catch the url after this pattern on a separate line in a separate file.

I have looked up on a lot of threads doing similar things and lot of very helpful smalls conditions using grep, sed and awk being offered. Though being a novice in all these, I find it almost impossible to tweak them as per my requirement and thus I have posted it here.

Would really appriciate if someone can guide me on this.

Thanks,
Andy

Franklin52 · February 8, 2011, 3:56am

Something like this?

sed -n 's!.*service=\(http://[^:]*\):.*!\1!p' logfile > newfile

rockf1bull · February 8, 2011, 4:38am

:DTruely amazing Frank!!!
You just did an awesome job and it did work for me

Guess I got really excited since this was my first post though I have been visiting this forum for a while now.

If you don't mind, can you please tell me a bit about those regular expressions?
I assume ^ stands for first line though not sure about !, ', ], p etc..

I am learning all these though it would be at least a good couple of months before I get close to this. One last favour was I already have a host of online sites, contents, ebooks to go through on general linux and unix stuff though if you can refer me to any good books for scripting beginners on both shell and perl, that would be really great. Though, this is not very urgent.

Thanks once again,
Andy.

Franklin52 · February 8, 2011, 5:23am

sed -n 's!.*service=\(http://[^:]*\):.*!1!p' logfile > newfile

You can use a saved substring with \(.*\) which can be recalled with \1

\(http://[^:]*\)

This substring contains the part after "service=" and it starts with "http://".

[^:]* means characters that doesn't contain any colons.

:.* after the substring is the next semicolon and the rest of the line.

Here you can find some tutorial links:

http://www.unix.com/answers-frequently-asked-questions/13774-unix-tutorials-programming-tutorials-shell-scripting-tutorials.html

An excellent book for sed and awk:

rockf1bull · February 8, 2011, 6:35am

One more thing I wanted was to remove any duplicate lines in a bunch of hundreds of lines.

For e.g - If I have following,

portal-test-domain-com
portal-test-domain-com
portal-test-domain-com

I just need them to be only one line saying portal-test-domain-com instead of 3.

Also I need to remove certain parts of each line.

For e.g -
abbbs://porta-test-domain.com/wps/urportal/weber/localbr&ticket=ST-385945-1cdbuEe1o57neMuMgIWa-1681078B-DFC2-7B56-E58B-AA15B18411AD&pgtUrl=https

I need to remove first abbbs and remove everything after domain.com.

If someone can help me, that would be really great since I am still learning these things and it will be at least good couple of weeks before I would do them on my own.

Thanks in advance,
Andy.

---------- Post updated at 05:05 PM ---------- Previous update was at 04:30 PM ----------

Ok, I have just managed to remove the first http:// part from ALL lines of the log file and have also remove the https part of ALL lines to have a much better output for now.

The only part remaining for now is to find and remove all duplicate URLs as mentioned above. Will post a fix here if I find any; till then all suggestions are welcome.

Thanks,
Andy

Franklin52 · February 8, 2011, 6:57am

To remove duplicates from your output you do something like:

<commands> | awk '!a[$0]++'

rockf1bull · February 8, 2011, 7:07am

Sorry for bothering you, I just did that and tried few sed and uniq options to play around as well.

Thanks Frank for all your help, amazing how much one can learn in less than 24 hours; this was MUCH helpful for a newbie like me.

Best Regards,
Andy

Franklin52 · February 8, 2011, 7:41am

No problem, you're welcome.

rockf1bull · February 9, 2011, 2:38am

I need some more help people and would be posting here shortly

Thanks in advance,
Andy

---------- Post updated at 12:59 PM ---------- Previous update was at 12:58 PM ----------

Ok, I must admit I have another small problem I need to tweak.
Initially I wanted only the URL portion of a line from a log file however, what I am after now is I need a little longer URL something like below.

10.155.65.5 - - [20/Jan/2011:07:41:58 +0100] "POST /cas/login?post=true&service=http://test.domain.ca:8000/psp/ppf88prd/?cmd=login%26languageCd=CAN%26userid=VP1%26pwd=TEST123 HTTP/1.1" 200 888

I need http://test.domain.ca:8000/psp/ instead of just http://test.domain.ca like below.

Can someone please help if how I do achieve that. I did try playing with a earlier script I had been given by Frank as below though unfortunately it didn't work

sed -n 's!.*service=\(http://[^:]*\):.*!\1!p' logfile > newfile

Also one more help would be to be aware of a script which would need to find a similar result from around 20 files instead of just 1 file as of now. I am doing it manually for now though need some help in this too.
The file format is something like below -

abc-accent.2011-01-20.log
abc-accent.2011-02-21.log
abc-accent.2011-03-22.log
abc-accent.2011-04-23.log
abc-accent.2011-05-24.log

Again, thanks once again for all help offered already!!!
Cheers,
Andy

---------- Post updated at 01:08 PM ---------- Previous update was at 12:59 PM ----------

Ok, I just found another script that may require some tweaking which I will do later though someone please confirm if it will suit the purpose. (Again I would be testing it in some time on a test box to be sure)
It basically searches and replaces strings while searching a certain parrtern of files in a directory we specify.

#!/bin/bash
OLD="xyz"
NEW="abc"
DPATH="/home/you/foo/*.txt"
BPATH="/home/you/bakup/foo"
TFILE="/tmp/out.tmp.$$"
[ ! -d $BPATH ] && mkdir -p $BPATH || :
for f in $DPATH
do
  if [ -f $f -a -r $f ]; then
    /bin/cp -f $f $BPATH
   sed "s/$OLD/$NEW/g" "$f" > $TFILE && mv $TFILE "$f"
  else
   echo "Error: Cannot read $f"
  fi
done
/bin/rm $TFILE

Any suggestions as always are quiet welcome.

Thanks,
Andy

Franklin52 · February 9, 2011, 4:30am

Try this one:

sed -n 's!.*service=\(http://[^/]*/[^/]*/\).*!\1!p' file

You can use something like this to loop through the abs-accent log files:

ls abc-accent*.log |
while read file
do
  sed -n 's!.*service=\(http://[^/]*/[^/]*/\).*!\1!p' "$file" > URL_file
done

rockf1bull · February 10, 2011, 12:58am

You have been my saviour again Frank and the first script worked for me!!

I am yet to test the script part due to some meetings coming up ahead. Just can't express words when people provide so much help so quickly!!

I will post complete information on this thread shortly for all other newbies like me

Cheers,
Andy

---------- Post updated 02-10-11 at 11:23 AM ---------- Previous update was 02-09-11 at 06:27 PM ----------

Ok, following is what worked for me after MUCH required help from Franklin!!

I used the same script provided by Franklin to get my URLs filtered -

sed -n 's!.*service=\(http://[^/]*/[^/]*/\).*!\1!p' file

However, I still haven't tested the script part provided by Franklin yet and I will post the output later.

Another problem I faced while trying to filter output received after using Franklin's script was I had few URLs with BIG strings with special characters and had to use following to get rid of them. (To get rid of &, ? and ' ' basically AND have them sorted)

 
cat old.file | awk -F \& '{print $1}' | awk -F \? '{print $1}' |awk -F ' ' '{print $1}'| sort -u > output.txt

Thanks guys for all your support and now I have another issue cropping up which I will post in the next post since otherwise this post will be way too big.

Cheers,
Andy

---------- Post updated at 11:28 AM ---------- Previous update was at 11:23 AM ----------

One more problem I have is while trying to remove duplicate lines, I need to treat lower and upper cases in URLs carefully as below -

For following duplicate lines, I need to have only two URLs since currently they are all being treated as UNIQUE URLs.
(note: Separate IPs don't matter since I am only concerned with lower and upper case letters)

 
56.555.72.69/crm_ababcdves/
81.745.42.59/CRM_Ababcdves/
38.475.62.19/squitv3/
92.625.42.89/Squitv3/
37.288.30.12/cview/
63.598.30.89/Cview/
85.048.30.52/CView/

So final output should be -

 
56.555.72.69/crm_ababcdves/
38.475.62.19/squitv3/
37.288.30.12/cview/

Now if someone can help me with this, that would be really great since I am a newbie on these things so far though getting better since past few days.

Cheers,
Andy

Franklin52 · February 10, 2011, 2:28am

Is this what you're looking for?

$ cat file
56.555.72.69/crm_ababcdves/&kllk?jkjk op
81.745.42.59/CRM_Ababcdves/&kllk?jkjk op
38.475.62.19/squitv3/
92.625.42.89/Squitv3/&kllk?jkjk op
37.288.30.12/cview/
63.598.30.89/Cview/&kllk?jkjk op
85.048.30.52/CView/
$
$ awk -F"/" '{
  gsub("[ &,?].*",x)
  s=toupper($2)
}
s in a {next}
{a}1' file
56.555.72.69/crm_ababcdves/
38.475.62.19/squitv3/
37.288.30.12/cview/
$

rockf1bull · February 10, 2011, 4:00am

Not sure how do I run that?
It looks like a small .sh script to me.
So please tell me whether I should save it as a sh file with all duplicate entries in it to ge the final output?

Regards,
Andy

Franklin52 · February 10, 2011, 4:50am

awk -F"/" '{
  gsub("[ &,?].*",x)
  s=toupper($2)
}
s in a {next}
{a}1' file

This is a command to get rid of the characters &, ? and ' ' and to remove the duplicates.

The example above shows an example of the content of a file with those characters, the awk command and the final output.

You could use the command in a script or on the command line.

rockf1bull · February 10, 2011, 5:02am

Thanks Franklin and I actually just added more sort parameters to get convert all upper case urls to lower case and then remove the duplicates thereafter using following -

 
ls -1 | tr '[A-Z]' '[a-z]' | sort | uniq -c | grep -v " 1 "

I think we can also use

 
ls | sort -f | uniq -i -d

Tho I am not quiet sure on that.

Once again, thanks for all your replies; they have indeed been VERY helpful to me.

Cheers,
Andy

rockf1bull · February 11, 2011, 6:31am

Hi Franklin,

Just updating on your help for the script the other day..

Amazing job once again and it has helped me a LOT!!

ls abc-accent*.log |
while read file
do
  sed -n 's!.*service=\(http://[^/]*/[^/]*/\).*!\1!p' "$file" > URL_file
done

Hopefully I won't bug everyone so often now.
You can mark this as "Resolved" now if there is any such case.

Cheers,
Andy