Remove duplicate email

lpoolfc · November 26, 2019, 3:52pm

cat path/to/dir/file.html | grep -i 'x*.com' > path/to/dir/file.txt

Before
xyz.com 
xyz.com

After 
cat path/to/dir/file.html | grep -i 'x*.com' |  sed '$!s/$/,/' | tr -d '\n'> path/to/dir/file.txt

Result --> Preferred
output:
xyz.com, xyz.com

The preferred is the exact output I want, but I want to remove duplicate. I tried sort | uniq but still wont work. Any help appreciated.

vgersh99 · November 26, 2019, 3:59pm

how about (for starters):

awk -v str='x*.com' '$1~str && !a[$1]++' myFile

lpoolfc · November 26, 2019, 4:33pm

Sorry still show duplicate

vgersh99 · November 26, 2019, 4:36pm

myFile:

xyz.com
xyz.com
1xyz.com
123.com

$ awk -v str='x*.com' '$1~str && !a[$1]++' myFile
xyz.com
1xyz.com
123.com

lpoolfc · November 26, 2019, 6:43pm

I need them to be on the same line with comma seperated as it appears in my preferred output.

MadeInGermany · November 27, 2019, 2:56pm

Your grep seems not precise, perhaps you mean grep -i '^x.*\.com$' (starts with an x then any amount of characters then .com at the end)
A sed 's/$/,/' does not delete $ because it is an anchor - not a character. But after an N command (that appends the following line to the input buffer) one can remove the embedded \n character.
The following works on all Unix-like OS:

grep -i '^x.*\.com$' file.html | sed -e ':L' -e '$!N;s/\n/, /;tL'

lpoolfc · December 2, 2019, 10:57am

This removes the emails that are the same. I need to only remove the duplicates

lpoolfc · December 2, 2019, 12:10pm

@vgersh99
Got your script to work but need to add sed '$!s/$/,/' | paste -sd ""

 awk -v str='x*.com' '$1~str && !a[$1]++' myFile | sed '$!s/$/,/' | paste -sd ""

to get my required output. Not sureif thats the best way, but I got it to work. Advice?

Thank you

rbatte1 · December 3, 2019, 7:30am

A clunky way:-

grep -Ei "x.*\.com" /path/to/dir/file.html | sort -u > /path/to/dir/file.txt

The expression looks for an x followed by any number of characters followed by .com however this is not anchored to the beginning or end of a line. What is your input data like?

This input would still give some confusing results:-

x123@hello.com
x123@hello.com.foo
hello1@xyz.com
hello2@xyz.com
hello1@not-xyz.com-either

..... and lots of other variations. It leaves me a few questions::-

What precise conditions do you want for the search in the first place?
What output do you want? The full email address or just the domain.

We are adjust the search to get just records you are after, but the search needs to be precise, e.g. does the line start with x or have x immediately after @ ; does .com have to end the line etc. All sorts of rules can be written if you can be sure what you want. If you could post a representative sample of your input and desired output (in CODE tags) then that will give us more to work with.

Kind regards,
Robin

MadeInGermany · December 4, 2019, 8:30am

You can use

awk '!a[$1]++' myFile | paste -sd "," -

If you want to replace each newline with ", " then

awk '!a[$1]++' myFile | sed -e ':L' -e '$!N;s/\n/, /;tL'