Do Not Output Duplicates

sudo · April 23, 2014, 5:27pm

Mac OS 10.9

Let me preface this by saying this is not for marketing or spamming purposes.

I have a script that scans all the email messages in a directory (~/Library/Mail/Mailboxes) and outputs a single column list of email addresses. This will run multiple times a day and append the output file with new entries.

If an email is duplicated in the email folder- it is duplicated in the output file. How do I remove these duplications from the output file? Its just a single column of data separated by a new line. Not sure if I should have it check and exclude the output of duplicates or simply run a scan for duplicates after the output file is appended.

This list is being used as input for LDAP queries.

For reference, the scanning/output portion of my script is below:

find $SRC -type f -name *.emlx |
	while read FILE
	do
	   awk '/^From:/ && gsub(/.*<|>.*/,x)' $FILE
	done > ~/Desktop/output.txt
echo "complete"

bartus11 · April 23, 2014, 5:29pm

Try:

find $SRC -type f -name *.emlx | 	
  while read FILE 
  do 	   
    awk '/^From:/ && gsub(/.*<|>.*/,x)' $FILE 	
  done | sort | uniq > ~/Desktop/output.txt 
echo "complete"

sudo · April 23, 2014, 6:06pm

well that was easy- Thanks!!

Don_Cragun · April 24, 2014, 12:54am

You could also try:

find $SRC -type f -name *.emlx -exec awk '/^From:/ && gsub(/.*<|>.*/,x)' {} + | sort -u > ~/Desktop/output.txt 
echo "complete"