Hello Everyone!
I trust you are off to a great week! Trying to output the name and count of each uniquely occurring domain in the current directory for a portion of a script I'm building.
Here's what I'm stuck on:
-
Need to find UNIQUE occurences of domains (*@domain.com) in ALL files in a directory.
-
Need to output:
uniquedomain1.com = 1234 occurrences
uniquedomain2.com = 12345 occurrences
... etc
-
Every file includes ONE domain per line, with the format of the surrounding text being inconsistent and random. What WILL remain consistent is that each line will have an email address with the following syntax somewhere in each: emailaddress@domain.com
Would someone be able to help me figure out how do this?
Thanks so much
---------- Post updated at 05:30 PM ---------- Previous update was at 04:45 PM ----------
I can call the below to output a list of UNIQUELY occuring domains:
perl -wne'while(/@[\w\.]+/g){print "$&\n"}' filename | sort -u
Now, how do I, for all files in a directory, display the count of each unique domain per file and then a final TOTAL count, per domain, for all files.
Thanks!
file 'test':
@www.test.com
@www.test.org
@www.test.com
@www.test.org
@www.test.com
@www.test.com
@www.test.com
@www.test.com
@
command
perl -we 'my $domains = {};open FH, "<$ARGV[0]"; while (<FH>) {if (/\@([\w\.]+)/){$domains->{$1}+=1;}}foreach my $domain (sort keys %$domains){print "$domain"."=";print $domains->{$domain}."\n";};close FH;' test
more easily read:
my $domains = {};
open FH, "<$ARGV[0]";
while (<FH>) {
if (/\@([\w\.]+)/) {
$domains->{$1}+=1;
}
}
foreach my $domain (sort keys %$domains) {
print "$domain"."=";
print $domains->{$domain}."\n";
};
close FH;
result
www.test.com=6
www.test.org=2
Here's what I'm seeing:
Use of uninitialized value $ARGV[0] in concatenation (.) or string at -e line 1, <> line 19273.
readline() on closed filehandle FH at -e line 1.
... once per every line.
Anyway, I made some headway on my own, so please take a look at my code below.
1 #!/bin/sh
2 for file in *
3 do
4 if [ -f "$file" ]
5 then
6 # FOR EACH FILE, OUTPUT THE FILENAME + LINE COUNT
7 find $file -print0 | xargs -0 wc -l
8 fileLineCount="`wc -w $file`"
9 echo $fileLineCount
10
11 #Output unique domains
12 perl -wne'while(/@[\w\.]+/g){print "$&\n"}' $file | sort -u > uniques.txt # TO FILE
13 #perl -wne'while(/@[\w\.]+/g){print "$&\n"}' $file | sort -u # TO SCREEN
14
15 # Create structures based on individual files
16 c=0; while read line; do arrayDomain[c]=`echo "$line"`; let c=$c+1; done < uniques.txt
17
18
19 arrayDomain_size=${#arrayDomain
[*]}
20
21
22 #ASSIGN 'DOMAIN COUNT' TO THE RELATED ARRAY and OUTPUT COUNT, PER DOMAIN
23 #i=0; while[$arrayDomain_size > $i]; do arrayUniqueNum= $(grep -o ${arrayDomain} $file | wc -w); let i=$i+1; do ne
24 max=c
25 position=0
26 while (( position < max))
27 do
28 arrayUniqueNum[position]=$(grep -o ${arrayDomain[position]} $file | wc -w)
29
30 if [ ${arrayUniqueNum[position]} -ge 1000 ]
31 then
32 echo "${arrayDomain[position]} : ${arrayUniqueNum[position]}"
33 #echo "\n$((${arrayUniqueNum[position]}/$fileLineCount)*100) %"
34 fi
35 (( position = position + 1 ))
36
37 done
38
39
40
41
42 fi
Everything works pretty much, except here are the items I'm COMPLETELY stuck on:
1)Only output the analysis lines IF the count is greater than 1000.
2) For some reason, some output looks like this:
@r : 1052
@s : 2704
@t : 1406
.... when it should actually be showing the entire domain. The domains that get output to uniques.txt looks fine. Not too sure why it's not reading in the lines properly/outputting from arrayDomain[] .
3) Output the percentages as well. You'll see my code that's commented out (
#echo "\n$((${arrayUniqueNum[position]}/$fileLineCount)*100) %"
).
I'm not really sure how to properly format this to make it output what I need (percentage that a given domain makes up in a file):
Domain.com : xxxx unique occurrances : 23%
Help would be GREATLY appreciated. Thanks for your assistance in advance, you all are a true asset to furthering knowledge and education in the Unix community! I'm sure we can come to a solution together. I'm here to learn from the best~!
Please let me know if this needs clarification at all.
you may be getting that error in my sample if you're still using the '-n' flag.
If I get a chance, I'll look at the shell script too.
That line of code worked (think I entered something incorrectly before). I've included a similar functionality in the script (per my previous post). If you'd be so kind to see what can be done to make the other items happen, that would be FANTASTIC.
I'm really stuck and I'd appreciate the opportunity to learn how to make these other functions happen (there are just a few).
Thanks so much in advance.