Create shell script to extract unique information from one file to a new file.

Hi to all,

I got this content/pattern from file http.log.20110808.gz

[07/Aug/2011:07:37:39 +0800] mail1 httpd[14646]: Account Notice: close [192.168.10.128] igchung@abc.com 2011/8/7 7:37:36 0:00:03 0 0 1
[07/Aug/2011:07:37:44 +0800] mail1 httpd[14647]: Account Information: login [192.168.10.131:17187] sastria9@abc.com proxy sid=gFp4DLm5HnU
[07/Aug/2011:07:37:44 +0800] mail1 httpd[14648]: Account Notice: close [192.168.10.131] sastria9@abc.com 2011/8/7 7:37:44 0:00:00 0 0 1
[07/Aug/2011:07:37:45 +0800] mail1 httpd[14647]: Account Information: login [192.168.10.131:17194] sastria9@abc.com proxy sid=gSiaecABc/E
[07/Aug/2011:07:38:37 +0800] mail1 httpd[14646]: Account Information: login [192.168.10.129:2063] pntcdor1@abc.com proxy sid=ZGhAdmqmz3k
[07/Aug/2011:07:38:37 +0800] mail1 httpd[14647]: Account Notice: close [192.168.10.129] pntcdor1@abc.com 2011/8/7 7:38:37 0:00:00 0 0 1
[07/Aug/2011:07:38:38 +0800] mail1 httpd[14646]: Account Information: login [192.168.10.129:2071] pntcdor1@abc.com proxy sid=PtwbGuIk+I4
[07/Aug/2011:07:38:48 +0800] mail1 httpd[14646]: Account Information: login [192.168.10.130:14272] visnet@abc.com proxy sid=4W6xBKPXXvk
[07/Aug/2011:07:38:48 +0800] mail1 httpd[14647]: Account Notice: close [192.168.10.130] visnet@abc.com 2011/8/7 7:38:48 0:00:00 0 0 1
[07/Aug/2011:07:38:48 +0800] mail1 httpd[14646]: Account Information: login [192.168.10.130:14279] visnet@abc.com proxy sid=/qenNd/tps8
[07/Aug/2011:07:38:59 +0800] mail1 httpd[14646]: Account Notice: close [192.168.10.130] visnet@abc.com 2011/8/7 7:38:48 0:00:11 0 0 1
[07/Aug/2011:07:39:06 +0800] mail1 httpd[14647]: Account Information: login [192.168.10.130:14367] animan86@abc.com proxy sid=VdYyCOMtPsQ

how can I generate one new file with content as below, from file above?

igchung@abc.com
sastria9@abc.com
pntcdor1@abc.com
visnet@abc.com
animan86@abc.com

With grep/sort/uniq:

grep -o "[^ ]*@[^ ]*" http.log.20110808.gz | sort | uniq

With awk:

awk ' /@/ { sub("^.*] ",""); sub(" .*", ""); if(!($0 in E)) print; E[$0]} ' http.log.20110808.gz

Note: if file is gzipped as extension seems to imply you man need to pipe output of gzip -d to these solutions.

1 Like

Hi,

I am using unix solaris 10 for this, is this right?

[root] grep -o "[^ ]*@[^ ]*" http.log.20110801.gz | sort | uniq >1.out
grep: illegal option -- o
Usage: grep -hblcnsviw pattern file . . .
[root] grep -o "[^ ]*@[^ ]*" http.log.20110801.gz | sort
grep: illegal option -- o
Usage: grep -hblcnsviw pattern file . . .
[root] awk ' /@/ { sub("^.*]  ",""); sub(" .*", ""); if(!($0 in E)) print; E[$0]} '  http.log.20110801.gz
awk: syntax error near line 1
awk: illegal statement near line 1
awk: syntax error near line 1
awk: illegal statement near line 1
awk: syntax error near line 1
awk: illegal statement near line 1

Try to use nawk instead of awk.

1 Like

still cant, the result is unreadable (binary)

1 Like

where should i put the gzip -d?

like this?

[root|reports.tm.net.my:/data2/mail1/201108] grep -o "[^ ]*@[^ ]*" http.log.20110801.gz | sort | uniq | gzip -d
grep: illegal option -- o
Usage: grep -hblcnsviw pattern file . . .

gzip: stdin: unexpected end of file


[root|reports.tm.net.my:/data2/mail1/201108] awk ' /@/ { sub("^.*] ",""); sub(" .*", ""); if(!($0 in E)) print; E[$0]} ' http.log.20110801.gz | gzip -d
awk: syntax error near line 1
awk: illegal statement near line 1
awk: syntax error near line 1
awk: illegal statement near line 1
awk: syntax error near line 1
awk: illegal statement near line 1

gzip: stdin: unexpected end of file
zcat http.log.20110801.gz | nawk...

OR

gunzip -c http.log.20110801.gz | nawk...

OR

gzip -dc http.log.20110801.gz | nawk ... 
1 Like

You can use zgrep also

/user/ahamed> zgrep -o "[^ ]*@[^ ]*" http.log.20110808.gz 
igchung@abc.com
sastria9@abc.com
sastria9@abc.com
sastria9@abc.com
pntcdor1@abc.com
pntcdor1@abc.com
pntcdor1@abc.com
visnet@abc.com
visnet@abc.com
visnet@abc.com
visnet@abc.com
animan86@abc.com

Using sed

gzip -dc http.log.20110801.gz | sed 's/.*] \(.*@.*com\) .*/\1/g' | sort | uniq

regards,
Ahamed

1 Like
1 Like

my god, you guys are pro. it work now, every one of it. thx guys

---------- Post updated at 05:18 PM ---------- Previous update was at 03:44 PM ----------

another question, I generated this file a2.out, however how can I generate another file from it with only unique email listed?

more a2.out
116borrul@bx.com
133fird@b.com
147aedzra@.com
152najib@bx.com
154rshakir@bluehyppo.com
154zadzli@bc.com
155buddin@bx.com
Access to this service for 116borrul@bx.com
Access to this service for 133fird@b.com
Access to this service for 147aedzra@b.com
Access to this service for 152najib@bx.com
Access to this service for 154rshakir@b.com
Access to this service for 154zadzli@bc.com
Access to this service for 155buddin@bx.com

should be like this,

more uniqueemail.out
116borrul@bx.com
133fird@b.com
147aedzra@.com
152najib@bx.com
154rshakir@bluehyppo.com
154zadzli@bc.com
155buddin@bx.com

Try this:

gzip -dc http.log.20110808.gz | nawk ' /@/ { sub("^.*] ",""); sub(" .*", ""); if(!($0 in E)) print; E[$0]} ' > uniqueemail.out
1 Like

my god, its works perfectly thank you so much.

Hi, I have another question,

How to remove any domains(@something.com) in the file structure like this one?

-bash-3.00# more 30days.out 
user/ris1@yiris.net/INBOX 
user/ris2@giris.net/INBOX 
user/ris3@iris.net/INBOX 
user/ris4@hiris.net/INBOX 
user/str1@eamyx.com/INBOX 
user/str2@amyx.com/INBOX 
user/tg4@titangroup.com/INBOX  

output should be like this,

-bash-3.00# more 30days.out 
user/ris1/INBOX 
user/ris2/INBOX 
user/ris3/INBOX 
user/ris4/INBOX 
user/str1/INBOX 
user/str2/INBOX 
user/tg4/INBOX
$ sed 's/@.*\//\//' test
user/ris1/INBOX 
user/ris2/INBOX 
user/ris3/INBOX 
user/ris4/INBOX 
user/str1/INBOX 
user/str2/INBOX 
user/tg4/INBOX
1 Like

works perfectly, thank you so much you help me a lot.

Hi,

need help with another question related to manipulation base on 1 file, to extract selected information to a new file base on some conditions.

I got this pattern in a file a.out with 3000 list of users email address, how to extract 1000 of email address with the selected domain @titangroup.com only, to a new file b.out ?

more a.out
user/admin/INBOX
user/ris1@iris.net/INBOX
user/ris2@iris.net/INBOX
user/ris3@iris.net/INBOX
user/ris4@iris.net/INBOX
user/str1@streamyx.com/INBOX
user/str2@streamyx.com/INBOX
user/str3@streamyx.com/INBOX
user/str4@streamyx.com/INBOX
user/tg1@titangroup.com/INBOX
user/tg2@titangroup.com/INBOX
user/tg3@titangroup.com/INBOX
user/tg4@titangroup.com/INBOX
user/tmnet1/INBOX
user/tmnet2/INBOX
user/tmnet3/INBOX
user/tmnet4/INBOX

output should be like this, ( should listed 1000 users email address)
e.g

more b.out
user/tg1@titangroup.com/INBOX-----> e.g number 1
user/tg2@titangroup.com/INBOX
.
.
.
user/tg3@titangroup.com/INBOX
user/tg1000@titangroup.com/INBOX -------->e.g  number 1000