Unique entries in multiple files

Hello,

I have a directory with a log files(many of them). Lines look like this:

Sep  1 00:05:05 server9 pop3d-ssl: LOGIN, user=abc@example.com, ip=[xxx], port=[63030]
Sep  1 00:05:05 server9 pop3d-ssl: LOGOUT, user=abc@example.com, ip=[xxx], port=[63030], top=0, retr=0, rcvd=12, sent=46, time=0
Sep  1 00:05:05 server9 imapd-ssl: couriertls: connect: Connection reset by peer
Sep  1 00:05:06 server9 pop3d: LOGIN, user=def@example.com, ip=[xxx], port=[35312]
Sep  1 00:05:06 server9 pop3d-ssl: LOGOUT, user=ghi@example.com, ip=[xxx], port=[45887], top=0, retr=0, rcvd=18, sent=238125, time=2
Sep  1 00:05:06 server9 pop3d-ssl: LOGOUT, user=jkl@example.com, ip=[xxx], port=[20521], top=0, retr=0, rcvd=12, sent=39, time=151
Sep  1 00:05:06 server9 pop3d: LOGOUT, user=def@example.com, ip=[xxx], port=[35312], top=0, retr=0, rcvd=24, sent=424, time=0
Sep  1 00:05:07 server9 pop3d-ssl: LOGIN, user=mno@example.com, ip=[xx], port=[50097]
Sep  1 00:05:07 server9 pop3d-ssl: LOGOUT, user=mno@example.com, ip=[xxx], port=[50097], top=0, retr=0, rcvd=29, sent=102, time=0

I need a script that will count unique users that login in, in theese files. E.g. if user "abc@example.com" has detected in file one, there is no need for looking for him in the rest of the files. Could you help me?

Replace file.log with your log file names

$ sed -n "s/.*user=\([^,]*\),.*/\1/p" file.log | sort -u | wc -l
5

Hello ramirez987,

Following may help you in same.

 awk '{match($0,/LOGIN, user=.*com/);A[substr($0,RSTART+12,RLENGTH-12)]} END{for(i in A){if(i){print i}}}'  Input_file
 

Output will be as follows.

mno@example.com
def@example.com
abc@example.com
 

EDIT: To get only the count for login ids, following may help you.

 awk '{match($0,/LOGIN, user=.*com/);if(substr($0,RSTART+12,RLENGTH-12)){o++}} END{print o}' Input_file
 

Thanks,
R. Singh

Thank you for help.

anbu23, your solution is working, but what if I had many files in this directory and I want to do this for all of them? Lets say that I am trying to find out how many users were logged in, in e.g last week, so I will have to check 7 files, but if abc@example.com was logged yesterday and two days before, then it should count as one...

RavinderSingh13, your code looks elegent, but nothing is happend.

Hello ramirez,

sed -n "s/.*user=\([^,]*\),.*/\1/p" file.log | sort -u | wc -l

That code will still do the right thing thanks to sort -u . Repeated logins will be count as one entry.

If you have file.log, file1.log, file2.log, ..., file6.log

sed -n "s/.*user=\([^,]*\),.*/\1/p" file.log file1.log file2.log file3.log file4.log file5.log file6.log | sort -u | wc -l

or you can make use of some kind of glob

sed -n "s/.*user=\([^,]*\),.*/\1/p" file.log file[1-6].log| sort -u | wc -l

Thanks, but I was thinking more about some loop:

#!/bin/bash
DIR=/usr/src/count_test
FILES=$DIR/*
for i in $FILES
    do
    sed -n "s/.*user=\([^,]*\),.*/\1/p" $FILES | sort -u $i
done

But it doesn't work

Could you explain how it doesn't work? I know why it doesn't, but I am quite sure is not the reason why you think it doesn't.

Perhaps, could you say why this, by itself, would not help?

DIR=/usr/src/count_test
FILES=$DIR/*
sed -n "s/.*user=\([^,]*\),.*/\1/p" $FILES | sort -u
1 Like

It is printing data from file, instead of counting it.

Well, would this solve your request better? Try:

sed -n "s/.*LOGIN.*user=\([^,]*\),.*/\1/p" $FILES | sort | uniq -c
      2 abc@example.com
      2 def@example.com
      2 mno@example.com

Not really. If I make a copy of the log file in the directory and run:

#!/bin/bash
DIR=/usr/src/count_test
FILES=$DIR/*
for i in $FILES
    do
 sed -n "s/.*LOGIN.*user=\([^,]*\),.*/\1/p" $FILES | sort | uniq -c
done

then I will get :
2 abc@example.com
2 def@example.com
2 mno@example.com
2 abc@example.com
2 def@example.com
2 mno@example.com
Which is not the case. should be:
4 abc@example.com
4 def@example.com
4 mno@example.com

In the loop, don't use $FILES - use $i, the loop variable.

---------- Post updated at 12:41 ---------- Previous update was at 10:59 ----------

Sorry, my answer was incomplete. If you insist on the for loop, modify it like

for i in $FILES 
  do  sed -n "s/.*LOGIN.*user=\([^,]*\),.*/\1/p" $i 
  done | sort | uniq -c 
1 Like

I have another problem.

I would like this script to look for "from=<" and "@example.com>" and count logins between theese two phrases. E.g.:

Something somethingso from=<abc@xgx.com> methingsomething somethingfrom=<ramirez987@example.com> somethingsomething something
Something somethingso from=<abc@xxx.com> methingsomething somethingfrom=<ramirez987@example.com> somethingsomething something
Something somethingso methingsomething somefrom=<abc@xxx.com> thingfrom=<alison@example.com> somethingsomething something

result:
1 alison
2 ramirez987

And you want the second occurrence in a line, not the first one? Are there always two?

---------- Post updated at 22:16 ---------- Previous update was at 22:15 ----------

Try

sed -n "s/^.*from=<\([^@]*\)@.*/\1/p" file | sort | uniq -c
      1 alison
      2 ramirez987

Hi,

There is no permanent place in which the address appeares, so it is not about "second occurrence". I just want to list all lolgins between the strings: "from=<" and "@example.com>"

Yes, I overlooked something. Correction:

sed -n "s/^.*from=<\([^@]*\)@example.com.*/\1/p" file

Unfortunatelly it is still not working.

#!/bin/bash
DIR=/usr/src/count_test3
FILES=$DIR/*
for i in $FILES
    do

        sed -n "s/^.*from=<\([^@]*\)@example.com.*/\1/p" $i | sort | uniq -c
                   done

I've got nothing.

[LEFT]Does it work on the sample in post#12?
[/LEFT]

Yes, but it doesn't work on other log file.
It can look like this:

Sep 11 12:51:18 servername postfix/qmgr[18292]: 61F981C3A39: from=<ramirez987@example.com>, size=490403, nrcpt=1 (queue active)
Sep 11 12:51:18 servername mimedefang.pl[25646]: times int: 0,0,0,0,0,0 0
Sep 11 12:51:18 servername mimedefang.pl[25646]: checking by antivirus
Sep 11 12:51:18 servername postfix/smtpd[26146]: D3A771C39AA: client=emsdgdsilgw1sdg.passgel.com[xxx.xxx.xxx.xxx]
Sep 11 12:51:18 servername postfix/smtpd[26179]: disconnect from maidgssdggewwle.com[xxx.xxx.xxx.xxx]
Sep 11 12:51:18 servername postfix/virtual[23973]: 61F981C3A39: to=<john@nothing.com>, relay=virtual, delay=2.5, delays=2.4/0/0/0.14, dsn=2.0.0, status=sent (delivered to mai$
Sep 11 12:51:18 servername postfix/qmgr[18292]: 61F981C3A39: removed
Sep 11 12:51:18 servername mimedefang.pl[25646]: times: 0,0,0,0,0
Sep 11 12:51:18 servername mimedefang.pl[25646]: MDLOG,0495B1C3A2D,mail_in,,,<sdfr@newssfer.furn.com>,<lucas@example.com>,=?utf-8?q?Sprawd=C5=BA_aktualne_M$
Sep 11 12:51:18 servername mimedefang.pl[25646]: autoresponse1  sdfr@newssfer.furn.com-> lucas@example.com sdfr@newssfer.furn.com
Sep 11 12:51:18 servername mimedefang.pl[25646]: FILTER_END sender <sdfr@newssfer.furn.com> relayAddr xxx.xxx.xxx.xxx relayHostN: [xxx.xxx.xxx.xxx] helo: mail3.m3.pl rcpt$
Sep 11 12:51:19 servername postfix/qmgr[18292]: 0495B1C3A2D: from=<sdfr@newssfer.furn.com>, size=445850, nrcpt=1 (queue active)
Sep 11 12:51:21 servername postfix/qmgr[18292]: D3A771C39AA: from=<james@example.com>, size=46390, nrcpt=3 (queue active)
Sep 11 12:51:21 servername postfix/virtual[23973]: D3A771C39AA: to=<blablablah@nothing.com>, relay=virtual, delay=3.4, delays=3.4/0.01/0/0.01, dsn=2.0.0, status=sent (delivered to m$
Sep 11 12:51:21 servername postfix/virtual[23973]: D3A771C39AA: to=<abc@nothing.com>, relay=virtual, delay=3.4, delays=3.4/0.01/0/0.02, dsn=2.0.0, status=sent (delive$
Sep 11 12:51:21 servername postfix/virtual[23973]: D3A771C39AA: to=<def@nothing.com>, relay=virtual, delay=3.4, delays=3.4/0.01/0/0.02, dsn=2.0.0, status=sent (deliver$
Sep 11 12:51:21 servername postfix/qmgr[18292]: D3A771C39AA: removed

Works perfectly for me:

sed -n "s/^.*from=<\([^@]*\)@example.com.*/\1/p" file
ramirez987
james

WHAT (and HOW) "doesn't work"?

Well... I have no idea. It is working perfectly fine with the few lines above, but when I try to analyze full log file, there is no result.