extracting domain names out of a text file

totus · October 26, 2008, 2:28pm

I am needing to extract and list domain names out of a very large text file. The text file contains tlds .com .net .org and others as well as third level domains e.g. host1.domain.com and the names are placed within paragraphs of text.

Domains do not have a http:// prefix so I'm thinking the only thing to match on would be the tlds for example match ".com", extract everything before it up to "space" character.

How would I go about doing this?

grep, sed and awk?

Thank you gurus!:o

otheus · October 28, 2008, 11:22am

er, you could use any of them, but perl is better suited:

perl -n -e '/\b\S+\.(com|org|edu)\b/ && print $&,"\n"; '

u4xmnsk · October 28, 2008, 11:29am

grep *.com
grep *.net
and so on..

joeyg · October 28, 2008, 11:49am

> cat file06
blah blah www.boston.com more blah
ha ha yech yes nope not yet tomorrow
today www.unix.com future www.unix.org
forever and ever sportsillustrated.cnn.com high

> cat file06 | tr " " "\n" | grep .com
www.boston.com
www.unix.com
sportsillustrated.cnn.com

h.a.l · February 11, 2009, 5:25am

I am trying to extract .co.uk domains from html,
using the command:
cat $DIR/oldfile.txt | tr " " "\n" | grep [A-Za-z0-9_\.-].co.uk > $DIR/newfile.txt

The problem is that this command matches:
/>domain.co.uk<br
/>domain.co.uk<br
/>domain.co.uk<br
etc

How do I modify my regexp to match alphanumeric chars only? (apart from the dots and possible hyphens)

Many Thanks,

Hal

otheus · February 11, 2009, 6:14am

Well, if you change it to match alphanumeric only, then you get:

domain.co.ukbr

So I don't think that's what you want. If your grep accepts -o, you can do:

grep -o '[A-Za-z0-9_\.-]*.co.uk'

If not, use sed instead of grep:

sed 's/.*\([A-Za-z0-9_\.-]*.co.uk\).*/\1/'

h.a.l · February 11, 2009, 7:53am

Thank you Otheus. Working fine with grep -o.