Need to find occurrences of email domains in all files in a directory

Hello Everyone!

I trust you are off to a great week! Trying to output the name and count of each uniquely occurring domain in the current directory for a portion of a script I'm building.

Here's what I'm stuck on:

  • Need to find UNIQUE occurences of domains (*@domain.com) in ALL files in a directory.

  • Need to output:
    uniquedomain1.com = 1234 occurrences
    uniquedomain2.com = 12345 occurrences
    ... etc

  • Every file includes ONE domain per line, with the format of the surrounding text being inconsistent and random. What WILL remain consistent is that each line will have an email address with the following syntax somewhere in each: emailaddress@domain.com

Would someone be able to help me figure out how do this?

Thanks so much

---------- Post updated at 05:30 PM ---------- Previous update was at 04:45 PM ----------

I can call the below to output a list of UNIQUELY occuring domains:
perl -wne'while(/@[\w\.]+/g){print "$&\n"}' filename | sort -u

Now, how do I, for all files in a directory, display the count of each unique domain per file and then a final TOTAL count, per domain, for all files.

Thanks!

file 'test':

@www.test.com
@www.test.org
@www.test.com
@www.test.org
@www.test.com
@www.test.com
@www.test.com
@www.test.com
@

command

 perl -we 'my $domains = {};open FH, "<$ARGV[0]"; while (<FH>) {if (/\@([\w\.]+)/){$domains->{$1}+=1;}}foreach my $domain (sort keys %$domains){print "$domain"."=";print $domains->{$domain}."\n";};close FH;'  test

more easily read:

my $domains = {};
open FH, "<$ARGV[0]"; 
while (<FH>) {
  if (/\@([\w\.]+)/) {
    $domains->{$1}+=1;
  }
}
foreach my $domain (sort keys %$domains) {
  print "$domain"."=";
  print $domains->{$domain}."\n";
};
close FH;

result

www.test.com=6
www.test.org=2

Here's what I'm seeing:

Use of uninitialized value $ARGV[0] in concatenation (.) or string at -e line 1, <> line 19273.
readline() on closed filehandle FH at -e line 1.

... once per every line.

Anyway, I made some headway on my own, so please take a look at my code below.

 1 #!/bin/sh
  2 for file in *
  3 do
  4   if [ -f "$file" ]
  5   then
  6     # FOR EACH FILE, OUTPUT THE FILENAME + LINE COUNT
  7     find $file -print0 | xargs -0 wc -l
  8     fileLineCount="`wc -w $file`"
  9     echo $fileLineCount
 10 
 11     #Output unique domains
 12     perl -wne'while(/@[\w\.]+/g){print "$&\n"}' $file | sort -u > uniques.txt   # TO FILE
 13     #perl -wne'while(/@[\w\.]+/g){print "$&\n"}' $file | sort -u                # TO SCREEN
 14 
 15     # Create structures based on individual files
 16     c=0; while read line; do arrayDomain[c]=`echo "$line"`; let c=$c+1; done < uniques.txt
 17      
 18 
 19     arrayDomain_size=${#arrayDomain
[*]}
 20     
 21 
 22    #ASSIGN 'DOMAIN COUNT' TO THE RELATED ARRAY and OUTPUT COUNT, PER DOMAIN
 23    #i=0; while[$arrayDomain_size > $i]; do arrayUniqueNum= $(grep -o ${arrayDomain} $file | wc -w); let i=$i+1; do    ne
 24         max=c
 25         position=0
 26         while (( position < max))
 27         do  
 28                 arrayUniqueNum[position]=$(grep -o ${arrayDomain[position]} $file | wc -w)
 29                  
 30                 if [ ${arrayUniqueNum[position]} -ge 1000 ]
 31                 then
 32                         echo "${arrayDomain[position]}  :  ${arrayUniqueNum[position]}"
 33                         #echo "\n$((${arrayUniqueNum[position]}/$fileLineCount)*100) %"
 34                 fi
 35                 (( position = position + 1 ))
 36         
 37         done
 38 
 39 
 40 
 41 
 42    fi

Everything works pretty much, except here are the items I'm COMPLETELY stuck on:

1)Only output the analysis lines IF the count is greater than 1000.
2) For some reason, some output looks like this:
@r : 1052
@s : 2704
@t : 1406
.... when it should actually be showing the entire domain. The domains that get output to uniques.txt looks fine. Not too sure why it's not reading in the lines properly/outputting from arrayDomain[] .

3) Output the percentages as well. You'll see my code that's commented out (

#echo "\n$((${arrayUniqueNum[position]}/$fileLineCount)*100) %"

).
I'm not really sure how to properly format this to make it output what I need (percentage that a given domain makes up in a file):

Domain.com : xxxx unique occurrances : 23%

Help would be GREATLY appreciated. Thanks for your assistance in advance, you all are a true asset to furthering knowledge and education in the Unix community! I'm sure we can come to a solution together. I'm here to learn from the best~!

Please let me know if this needs clarification at all.

you may be getting that error in my sample if you're still using the '-n' flag.

If I get a chance, I'll look at the shell script too.

That line of code worked (think I entered something incorrectly before). I've included a similar functionality in the script (per my previous post). If you'd be so kind to see what can be done to make the other items happen, that would be FANTASTIC.

I'm really stuck and I'd appreciate the opportunity to learn how to make these other functions happen (there are just a few).

Thanks so much in advance.