Get only domain from url file bind

Hello everybody
I have been trying to extract the domain name from the bind query log with different options, however always get stuck with domains that end with link .co.uk or .co.nz.

I tried the following, however only provides the first level:

awk -F"." '{print $(NF-1)"."$NF}' list.txt > test.txt

Current raw file:

csc.beap.bc.yahoo.com
googleads.g.doubleclick.net
dances.with.wolves.tracker.prq.to
fbcdn-sphotos-g-a.akamaihd.net
api.smoot.apple.com
glistockisti.it
apps.ad-x.co.uk
configuration.apple.com.edgekey.net
walter-producer-cdn.api.bbci.co.uk
www.google.co.nz
bbc.co.uk 

Desired output file:

yahoo.com
doubleclick.net
tracker.prq.to
akamaihd.net
apple.com
glistockisti.it
ad-x.co.uk
edgekey.net
bbci.co.uk
google.co.nz
bbc.co.uk 

Is it possible to get the domain names through a command or must the list be compared to another file that contains a list of all domains on the internet?

You have to compare to another list that defines the sub-domains.
Follow this link, if still active, for more information.
A compilation list can be found here.

Thank you for the response Aia, however that post is quite old and does not seem to be active anymore. or have a solution as such.Also thank you for the publicsuffix list this is very helpful and has provided me with an new possible approach to the challenge.

Unfortunately I am very new to the shell scripting world and would appreciate assistance in this regard. Here is the idea:

The URL is longer than the publicsuffix listed items and the url is separated by "." so if there a possibility to grep or search the url starting from the right hand side and finding the most accurate match. Let me provide an example:

walter-producer-cdn.api.bbci.co.uk

starting from the right hand site matching agains the publicsuffix list:
publicsuffix list for uk:

uk
ac.uk
co.uk
gov.uk
ltd.uk
me.uk
net.uk
nhs.uk
org.uk
plc.uk
police.uk
*.sch.uk

URL lookup:

uk

-Match

co.uk

-Match

bbci.co.uk

-No Match

When No Match was returned getting the co.uk with one segment addition of the URL to end up with bbci.co.uk.

Would this be possible to script it in a possible way?

A slightly different approach:

awk '
NR==FNR                 {C[$0]
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-2) OFS $(NF-1) OFS $NF
                         next
                        }
                        {print $(NF-1) OFS $NF
                        }
' FS="." OFS="." publicsuffix.lst raw
yahoo.com
doubleclick.net
prq.to
akamaihd.net
apple.com
glistockisti.it
ad-x.co.uk
edgekey.net
bbci.co.uk
co.nz
bbc.co.uk

Thank you RudiC, could I kindly ask you to elaborate on the code, as mentioned before, I am very new to this. I have two files the one that contains the URL and the other one the publicsuffic list. Thank you

awk '
NR==FNR                 {C[$0]                  # read first file (= NR==FNR) into the indices of the associative array C
                         next                   # stop processing the actual line; proceed with next line
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-2) OFS $(NF-1) OFS $NF
                                                # if second last ($(NF-1) and last ($NF) fields, joint by a dot, are found in C
                                                # print third last, second last, and last field
                         next                   # stop ... see above
                        }
                        {print $(NF-1) OFS $NF  # if above doesn't apply, print second last and last fields 
                        }
' FS="." OFS="." publicsuffix.lst raw           # supply the field separators and two files to awk

This code certainly is not perfect; e.g. the co.nz is missing in the publicsuffix.lst, but it may serve as a starting point...

RudiC, thank you very much for providing this solution, it is truly appreciated. I checked through the publicsuffix list and found that the longest domain is 4 as such added this to the script you provided. Now it works and provides all the different domains. Here is the code I am now using:

awk '
NR==FNR                 {C[$0]
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-2) OFS $(NF-1) OFS $NF
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-3) OFS $(NF-1) OFS $NF
                         next
                        }
$(NF-1) OFS $(NF) in C  {print $(NF-4) OFS $(NF-1) OFS $NF
                         next
                        }
                        {print $(NF-1) OFS $NF
                        }
' FS="." OFS="." public_suffix_list.dat url.txt

I'm surprised this is working for you. There seem to be a few problems:

  1. The code shown in red in your awk script will never be executed. Since the condition on the two red condition/action sets is identical to the orange condition and the action section with that condition ends with a next command, the actions shown in red cannot be executed.
  2. I believe your code should explicitly ignore blank lines and comment lines in public_suffix_list.dat (unless you have pruned those lines out of the public suffix list provided here) when you downloaded the public list into your file).
  3. I don't see how this code handles wildcards in rules (e.g., *.sch.uk ).
  4. I don't see how this code handles exception rules (although there aren't any exception rules if you're just trying to process UK domains).
  5. And, according to the rules published for the public list, you should be loading values in your array with C[$1] instead of C[$0] , but I don't see anything in the public list that includes a comment at the end of any rules so (if you ignored comment lines and blank lines) it might not matter.

Don Cragun, you are correct, I got excited to early. After running the script through a few hundred examples I found it is not working as desired. Do you maybe have suggestion how to extract the domain from the URL?

You might want to give this a try:

awk '
/^\/\/|^ *$/    {next
                }

FNR!=NR         {for (f in FIVE)  if ($0 ~ "\." f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "\." f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "\." t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "\." t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "\." o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}  
NF==1           {ONE[$0]}  
' FS="." OFS="." public_suffix_list.dat rawfile
yahoo.com
doubleclick.net
prq.to
akamaihd.net
apple.com
glistockisti.it
ad-x.co.uk
edgekey.net
bbci.co.uk
google.co.nz
bbc.co.uk

The wildcard issue at begin-of-line hasn't been solved yet; and some optimization might come in handy...

Hi RudiC, Thank you for the script, I am trying to resolve one challenge to check if it is working. I am currently getting: warning: escape sequence `\.' treated as plain `.'

Will try and figure out the sequence.

In an ERE . matches any character. The intent is to match only a period at the start of those patterns. Change each occurrence of "\." in the script to "[.]" and it should get rid of the warnings and restrict the match to what was intended. (You could also use "\\." , but I find the matching list expression easier to use than trying to remember how many times a quoted expression will be evaluated by awk in cases like this.)

Thank you RudiC, for the script and assistance, it is truly appreciated. The script works very well now and extracts the Domain from the URL.

Also thank you Don Cragun, for the assistance.

Here is the final script I am currently using that was written by RudiC:

awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4)                                                                                        , $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3)                                                                                        , $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2)                                                                                        , $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1)                                                                                        , $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; ne                                                                                        xt}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat rawfile

---------- Post updated 11-07-15 at 01:36 PM ---------- Previous update was 11-06-15 at 02:53 PM ----------

Hi RudiC and Don Cragun, could I kindly ask you one final favor to optimize the script that I have currently. The objective is to take the raw log from BIND and enrich this with extraction of the URL and adding content categorization to this. Then writing these to different files to summarize this. The challenge is that with the script below it processes 3.83 lines a second and I have 9 million lines a day :slight_smile:

The input log from the DNS1 file look like the following:

04-Nov-2015 08:28:39.261 queries: info: client 192.168.169.122#59319: query: istatic.eshopcomp.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.269 queries: info: client 192.168.212.136#48872: query: idsync.rlcdn.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.269 queries: info: client 192.168.19.61#53970: query: 3-courier.sandbox.push.apple.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.270 queries: info: client 192.168.169.122#59319: query: ajax.googleapis.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.272 queries: info: client 192.168.251.24#37028: query: um.simpli.fi IN A + (10.10.80.50)
04-Nov-2015 08:28:39.272 queries: info: client 192.168.251.24#37028: query: www.wtp101.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.273 queries: info: client 192.168.251.24#37028: query: magnetic.t.domdex.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.273 queries: info: client 172.25.111.175#59612: query: api.smoot.apple.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.275 queries: info: client 192.168.7.181#45913: query: www.miniclip.com IN A + (10.10.80.50)
while read -r line
do
dt=$(awk -F " " '/ / {print $1}' <<< $line) #Reading the date from the log file into variable
tm=$(awk -F " " '/ / {print $2}' <<< $line) #Reading the time from the log file into variable
ipt=$(awk -F " " '/ / {print $6}'<<< $line) #Reading the IP address from the log file into variable
ip=$(cut -d'#' -f1 <<< $ipt) #removing the port from the IP address and write into variable
url=$(awk -F " " '/ / {print $8}' <<< $line) #Reading the URL from the log file into variable
type=$(awk -F " " '/ / {print $10}' <<< $line) #Reading the Record Type from the log file into variable

echo $url > temp-url #Writing the URL into temp file as I could not get the variable automatically reading this into the awk statement below

dom=$(awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat temp-url) #extracting the Domain from the URL

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head ) #Here I am using http://www.shalla.de/ categorization database to look at every domain and read the folder location to add the category it is in

echo $dt,$tm,$ip,$url,$dom,$type,$ct >> DNS1_Logs 	#Rewriting the log file that contains now also the domain and category of the lookup and removing unnecessary information
echo $dom >> DNS1_DOM								#Wringing on the Domain names into separate file
echo $dom,$ct >> DNS1_CT							#Wringing on the Domain and category names into separate file
done < DNS1

sort DNS1_DOM | uniq -cd | sort -nr > DNS1_Sort 	#Sorting the Domains to get the most utilized once

Thank you very much already in advance.

Obviously, replacing:

while read -r line
do
dt=$(awk -F " " '/ / {print $1}' <<< $line) #Reading the date from the log file into variable
tm=$(awk -F " " '/ / {print $2}' <<< $line) #Reading the time from the log file into variable
ipt=$(awk -F " " '/ / {print $6}'<<< $line) #Reading the IP address from the log file into variable
ip=$(cut -d'#' -f1 <<< $ipt) #removing the port from the IP address and write into variable
url=$(awk -F " " '/ / {print $8}' <<< $line) #Reading the URL from the log file into variable
type=$(awk -F " " '/ / {print $10}' <<< $line) #Reading the Record Type from the log file into variable

with:

while read -r dt tm _ _ _ int _ url _ type _
do	ip=${ipt%%#*}

(which eliminates 5 executions of awk and 1 execution of cut per line in your log file) should let you process MANY more lines per second. Or, just build this into an awk script that will do all of this and do the URL processing you requested before in a single awk (instead of invoking awk again for every line in your log file).

What is the format of the files in the directory /opt/URL/BL ? How many files are there? How many categories are there? Running 5 processes for every line in your log file to grab whatever it is that you want to get is going to keep things running slow. If we can preprocess those files into a table we can search more efficiently for each line's data, that would help immensely.

Would this come close to what you need?

awk '
/^\/\/|^ *$|^\*/        {next
                        }

FNR!=NR         {split ($6, X, "#")
                 printf "%s  %s  %s  %s  ", $1, $2, X[1], $8
                 n=split ($8, U, ".")
                 found = 0
                 if (!found && n >= 5)  for (f in FIVE) if ($8 ~ "\." f "$")    {found = 1; st = 5
                                                                                }
                 if (!found && n >= 4)  for (f in FOUR)  if ($8 ~ "\." f "$")   {found = 1; st = 4
                                                                                }
                 if (!found && n >= 3)  for (t in THREE) if ($8 ~ "\." t "$")   {found = 1; st = 3
                                                                                }
                 if (!found && n >= 2)  for (t in TWO)   if ($8 ~ "\." t "$")   {found = 1; st = 2
                                                                                }
                 if (!found && n >= 1)  for (o in ONE)   if ($8 ~ "\." o "$")   {found = 1; st = 1
                                                                                }
                 for (i=st; i>0; i--) printf "%s.", U[n-i]
                 printf "%s  %s  $CT_goes_here\n", U[n], $10
                }


NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}  
NF==1           {ONE[$0]}  

' FS="."  public_suffix_list.dat  FS=" "  DNS1
04-Nov-2015  08:28:39.261  192.168.169.122  istatic.eshopcomp.com  eshopcomp.com  A  $CT_goes_here
04-Nov-2015  08:28:39.269  192.168.212.136  idsync.rlcdn.com  rlcdn.com  A  $CT_goes_here
04-Nov-2015  08:28:39.269  192.168.19.61  3-courier.sandbox.push.apple.com  apple.com  A  $CT_goes_here
04-Nov-2015  08:28:39.270  192.168.169.122  ajax.googleapis.com  ajax.googleapis.com  A  $CT_goes_here
04-Nov-2015  08:28:39.272  192.168.251.24  um.simpli.fi  simpli.fi  A  $CT_goes_here
04-Nov-2015  08:28:39.272  192.168.251.24  www.wtp101.com  wtp101.com  A  $CT_goes_here
04-Nov-2015  08:28:39.273  192.168.251.24  magnetic.t.domdex.com  domdex.com  A  $CT_goes_here
04-Nov-2015  08:28:39.273  172.25.111.175  api.smoot.apple.com  apple.com  A  $CT_goes_here
04-Nov-2015  08:28:39.275  192.168.7.181  www.miniclip.com  miniclip.com  A  $CT_goes_here

It still takes its time, though, as it needs to creep through on average half the suffixes for every line in DNS1.

Thank you Don Cragun for the optimization of the code, I was not aware of this method of defining variables in one line. With this method I was able to improve the line reading speed by 5 times.

Regarding the

/opt/URL/BL

directory, here the file downloaded has different directories that contains files with the list of URLs defined (here is the link: wget http://www.shallalist.de/Downloads/shallalist.tar.gz\). In the directory there are the following categories:

adv
aggressive
alcohol
anonvpn
automobile
chat
COPYRIGHT
costtraps
dating
downloads
drugs
dynamic
education
finance
fortunetelling
forum
gamble
global_usage
government
hacking
hobby
homestyle
hospitals
imagehosting
isp
jobsearch
library
military
models
movies
music
news
podcasts
politics
porn
radiotv
recreation
redirector
religion
remotecontrol
ringtones
science
searchengines
sex
shopping
socialnet
spyware
tracker
updatesites
urlshortener
violence
warez
weapons
webmail
webphone
webradio
webtv

In these category directories these are two files domains and urls. There are numerous hits for one domain e.g: facebook.com

adv porn movies hobby socialnet spyware redirector finance chat

So with the below line I am trying to grep all directory names that define the category and add these to the variable $ct with spaces between them if more than one.

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )

Here is the entire code now after updating:

while read -r dt tm _ _ _ ipt _ url _ type _
do      ip=${ipt%%#*}

echo $url > temp-url

dom=$(awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat temp-url)

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )

echo $dt,$tm,$ip,$url,$dom,$type,$ct >> DNS1_Logs
echo $dom >> DNS1_DOM
echo $dom,$ct >> DNS1_CT
done < DNS1

sort DNS1_DOM | uniq -cd | sort -nr > DNS1_Sort

one additional question, the domain awk code, is it possible to read a variable like $dom instead of the tmp-url that I am currently first wiring to a temp file? and is it possible to do additional optimization?

---------- Post updated at 03:59 PM ---------- Previous update was at 03:48 PM ----------

Hi RudiC, thank you very much, WOW this is an amazing code. This code is much faster than the code currently working with. Two challenges that I am currently facing with this code is to read the Category files from the folders

/opt/URL/BL

(http://www.shallalist.de/Downloads/shallalist.tar.gz\) and the second challenge is that the spaces are there not comma as separators. I have tried to figure out where exactly the spaces are defined however I have not been able to find this until now.

---------- Post updated at 03:59 PM ---------- Previous update was at 03:59 PM ----------

Hi RudiC, thank you very much, WOW this is an amazing code. This code is much faster than the code currently working with. Two challenges that I am currently facing with this code is to read the Category files from the folders

/opt/URL/BL

(http://www.shallalist.de/Downloads/shallalist.tar.gz\) and the second challenge is that the spaces are there not comma as separators. I have tried to figure out where exactly the spaces are defined however I have not been able to find this until now.

The commas you can insert easily in the format strings used by the printf statements.

Actually, I don't really understand what you want to achieve with your processing of the category files. cut ting the fifth / separated field yields many empty strings, uniq -d prints duplicate lines that occur randomly unless preceded by a sort operation, and head prints 10 lines, yielding an unpredictable result. Please specify exactly what you want/need.

Hi RudiC, apologies for not being clear on the point of the categories. lets me try and elaborate, for every URL of the logs I would like to extract the domain, this is the script you already provided. However what I would like to know is also the category of the url. this is where the shalla files or BL directory comes in. The Domain is then searched through the directories and returning the directory names they are in. Challenge is that the domains and urls files in the BL directories are only containing the one item that is the url or domains or ip addresses. as such what I am trying to do is identify from the initial url what the category is of the domain. this will enable me to check how many url's are browsing news or socialnet, etc.

Sorry, not clear. If you want help on categorizing your domains, you'd best describe an algorithm how to extract that from those BL files/domains/URLs. As already stated, right now your approach as well as the data structures are quite diffuse to me.

Hi RudiC, I have thought about the categorization part of each line however it is taking too much processing power. I will rather do this at the end when I have summarized the files and sorted them. Thank you for the clarification on the uniq -d and sort part, I was not aware of the sort that needs to be there. Also the head that it only prints 10 lines. The last script you provided is working great and I have managed to get the commas in place now.

I however still am faced with a challenge as there a 9 million records and 3 servers, so in total 27 million records to be processed. The optimized script is now doing about 50 thousand lines an hour. As such I must find an alternative method of doing this even though it is working great, I need to be able to process these records as close as possible to real time.

Thank you RudiC, it is truly a pleasure and appreciated your assistance.