Help around sort

dbourrion · July 16, 2010, 4:26am

Hi.
I'm making some manipulations around the httpd.log.

Here's the commande line I'm using at this time :

cat /thewaytologs/log/httpd_access_log | awk '{print $3,$1}' | sort -k2,2 | uniq -c | sort -gr > /home/myhome/top_IP.txt

In result, i got something like that :

9635 uid=nameA,ou=people,dc=univ,dc=fr his-IP-at-this-Time
9142 uid=nameB,ou=people,dc=univ,dc=fr his-IP-at-this-Time
8316 uid=nameC,ou=people,dc=univ,dc=fr his-IP-at-this-Time
7691 uid=nameA,ou=people,dc=univ,dc=fr his-IP-at-this-Time
5665 uid=nameA,ou=people,dc=univ,dc=fr his-IP-at-this-Time
4300 uid=nameB,ou=people,dc=univ,dc=fr his-IP-at-this-Time
3886 uid=nameB,ou=people,dc=univ,dc=fr his-IP-at-this-Time
3701 uid=nameA,ou=people,dc=univ,dc=fr his-IP-at-this-Time
3534 uid=nameC,ou=people,dc=univ,dc=fr his-IP-at-this-Time
3267 uid=nameC,ou=people,dc=univ,dc=fr his-IP-at-this-Time
3125 uid=nameC,ou=people,dc=univ,dc=fr his-IP-at-this-Time
3024 uid=nameC,ou=people,dc=univ,dc=fr his-IP-at-this-Time
3016 uid=nameA,ou=people,dc=univ,dc=fr his-IP-at-this-Time
2961 uid=nameD,ou=people,dc=univ,dc=fr his-IP-at-this-Time

But I'd like to have something like that :

Help welcome.
Thanks
Best regards
D.

---------- Post updated at 10:26 AM ---------- Previous update was at 09:35 AM ----------

Of course, any other suggestion to get the stuff working (with some other sort of script) would be welcome

radoulov · July 16, 2010, 4:36am

Could you please post a relevant part of your original log file.

dbourrion · July 16, 2010, 6:12am

Yep :

and so on (all lines on that structure)

Thanks

radoulov · July 16, 2010, 6:25am

Assuming the identity information doesn't contain white space characters:

awk 'END { 
  for (U in uid) printf "%s\n%s\n\n", U, uid 
  }
{ uid[$3] = uid[$3] ? uid[$3] RS $1 : $1 }
  ' logfile

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

dbourrion · July 16, 2010, 6:36am

waouhh - gonna test it just now

rdcwayx · July 16, 2010, 6:47am

radoulov's code can't remove duplicate records.

$ cat logfile
10.1.2.3 - uid=jroux,ou=people,dc=univ-myuniv,dc=fr [12/Jul/2010:15:03:14 +0200] "GET /anURL" Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"
10.1.2.3 - uid=jroux,ou=people,dc=univ-myuniv,dc=fr [12/Jul/2010:15:03:14 +0200] "GET /anURL" Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"
10.2.3.4 - uid=arousseau,ou=people,dc=univ-myuniv,dc=fr [12/Jul/2010:15:34:11 +0200] "GET /anURL" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"
10.4.3.4 - uid=arousseau,ou=people,dc=univ-myuniv,dc=fr [12/Jul/2010:15:34:11 +0200] "GET /anURL" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"
10.4.3.4 - uid=arousseau,ou=people,dc=univ-myuniv,dc=fr [12/Jul/2010:15:34:11 +0200] "GET /anURL" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"
10.5.3.4 - uid=arousseau,ou=people,dc=univ-myuniv,dc=fr [12/Jul/2010:15:34:11 +0200] "GET /anURL" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"
10.4.3.4 - uid=arousseau,ou=people,dc=univ-myuniv,dc=fr [12/Jul/2010:15:34:11 +0200] "GET /anURL" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"

$ awk 'END {
   for (U in uid) printf "%s\n%s\n\n", U, uid
   }
 { uid[$3] = uid[$3] ? uid[$3] RS $1 : $1 }
   ' logfile

uid=arousseau,ou=people,dc=univ-myuniv,dc=fr
10.2.3.4
10.4.3.4
10.4.3.4
10.5.3.4
10.4.3.4

uid=jroux,ou=people,dc=univ-myuniv,dc=fr
10.1.2.3
10.1.2.3

Try mine:

 awk '!a[$1 FS $3] {b[$3]=b[$3] FS $1;a[$1 FS $3]=1} 
      END {for (i in b) {print i;split(b,s," "); for (j in s) print s[j]}}' logfile


uid=arousseau,ou=people,dc=univ-myuniv,dc=fr
10.2.3.4
10.4.3.4
10.5.3.4
uid=jroux,ou=people,dc=univ-myuniv,dc=fr
10.1.2.3

radoulov · July 16, 2010, 6:55am

Good point, I didn't think about that!

Another version:

awk 'END { 
  for (U in uid) {
    print U
    for (IP in ip)
      if ( (IP, U) in uip )
        print IP
    }    
  }
{  
  uip[$1, $3]; uid[$3]; ip[$1] 
  }' logfile

dbourrion · July 16, 2010, 6:57am

Hum, doing

cat /tologs/log/httpd_access_log | awk 'END {for (U in uid) printf "%s\n%s\n\n", U, uid}{ uid[$3] = uid[$3] ? uid[$3] RS $1 : $1 }' | sort | uniq -c | sort -gr > /home/myhome/top_IP.txt

I got :

I've lost my uid in the final print

radoulov · July 16, 2010, 6:58am

You need only the awk script
You don't need a pipeline!

dbourrion · July 16, 2010, 7:05am

Yep, that's much much better

Is there a way to get the result with the uid having the more IP on the top of the logfile (something like a sort ? )

rdcwayx · July 16, 2010, 7:08am

You need paste your expect output.

dbourrion · July 16, 2010, 7:13am

well, something like

radoulov · July 16, 2010, 8:05am

If you need to order the output, Perl will be more appropriate:

perl -lane'
    push @{ $uid{ $F[2] } }, $F[0]
      unless $uip{ $F[0], $F[2] }++;

    END {
        for ( sort { @{ $uid{$b} } <=> @{ $uid{$a} } } keys %uid ) {
            print;
            print join $/, @{ $uid{$_} };
        }
    }' logfile

---------- Post updated at 02:05 PM ---------- Previous update was at 01:26 PM ----------

Just for completeness, with awk, sort and cut it would be something like this:

 awk 'END { 
  for (U in uid) {
    n = split(uid, t)
    print n, U, U
    for (i = 0; ++i <= n;)
      print n, U, t
    }    
  }
{  
  uip[$1, $3]++ || uid[$3] = uid[$3] ? \
                     uid[$3] FS $1 : $1 
  }' logfile |
      sort -rn |
        cut -d\  -f3-

rdcwayx · July 16, 2010, 5:41pm

Finially I understood what dbourrion want to do.

awk '{a[$3 FS $1]++} END {for (i in a) print a,i|"sort -k2,2 -k1,1rn"}' /thewaytologs/log/httpd_access_log |awk '!a[$2] {print $2;a[$2]=1} {print $3}'

The first awk command will give a report which similar as /home/myhome/top_IP.txt, but sort by IP repeat counts for each UID

1
3 uid=arousseau,ou=people,dc=univ-myuniv,dc=fr 10.4.3.4
1 uid=arousseau,ou=people,dc=univ-myuniv,dc=fr 10.2.3.4
1 uid=arousseau,ou=people,dc=univ-myuniv,dc=fr 10.5.3.4
2 uid=jroux,ou=people,dc=univ-myuniv,dc=fr 10.1.2.3

Second awk will give you the final report you expect.

uid=arousseau,ou=people,dc=univ-myuniv,dc=fr
10.4.3.4
10.2.3.4
10.5.3.4
uid=jroux,ou=people,dc=univ-myuniv,dc=fr
10.1.2.3