Get only domain from url file bind

RudiC · November 10, 2015, 4:06pm

As I said before, crunching through those data amount will take its time.

Are you sure that you are taking the right approach? Why don't you explain you problem here in a broader context? Somebody might come up with a more efficient solution...

omuhans123 · November 11, 2015, 2:17pm

Hi RudiC, Ok let me try to explain the challenge I am faced with and you are correct there are many great minds out there that might have a possible solution for this.

So there are three DNS servers and each one of them produces around 9 million lines a day, a total of 27 million records a day. When looking at the information produced by these DNS servers it does not have much data to work with except the date, time, client IP, URL and record type. As such the objective is to enrich this information by extracting additional information from the base content received.

Original Log File:

04-Nov-2015 08:28:39.261 queries: info: client 192.168.169.122#59319: query: istatic.eshopcomp.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.269 queries: info: client 192.168.212.136#48872: query: idsync.rlcdn.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.269 queries: info: client 192.168.19.61#53970: query: 3-courier.sandbox.push.apple.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.270 queries: info: client 192.168.169.122#59319: query: ajax.googleapis.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.272 queries: info: client 192.168.251.24#37028: query: um.simpli.fi IN A + (10.10.80.50)
04-Nov-2015 08:28:39.272 queries: info: client 192.168.251.24#37028: query: www.wtp101.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.273 queries: info: client 192.168.251.24#37028: query: magnetic.t.domdex.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.273 queries: info: client 172.25.111.175#59612: query: api.smoot.apple.com IN A + (10.10.80.50)
04-Nov-2015 08:28:39.275 queries: info: client 192.168.7.181#45913: query: www.miniclip.com IN A + (10.10.80.50)

From the domain or URL one should be able to categorize the URL or domain into different categories, like socialnet or news and some have multiple categories. Additional to this getting the GeoIP information of the location. Then it is possible to identify the destinations of these URL's, where different URL's are hosted in different countries as an example Google has multiple locations for different content. Below is a sample of the output then:

04-Nov-2015,08:28:39.261,192.168.169.122,istatic.eshopcomp.com,205.185.208.26,eshopcomp.com,A,US,UnitedStates,AZ,Arizona,Phoenix,85012,33.508301,-112.071701,602,
04-Nov-2015,08:28:39.269,192.168.212.136,idsync.rlcdn.com,54.172.162.24,rlcdn.com,A,US,UnitedStates,VA,Virginia,Ashburn,20147,39.033501,-77.483803,703,
04-Nov-2015,08:28:39.269,192.168.19.61,3-courier.sandbox.push.apple.com,17.172.232.11,apple.com,A,US,UnitedStates,CA,California,Cupertino,95014,37.304199,-122.094597,408,updatesites news forum porn movies hobby podcasts downloads shopping webradio
04-Nov-2015,08:28:39.270,192.168.169.122,ajax.googleapis.com,216.58.223.10,ajax.googleapis.com,A,US,UnitedStates,CA,California,MountainView,94043,37.419201,-122.057404,650,
04-Nov-2015,08:28:39.272,192.168.251.24,um.simpli.fi,158.85.41.203,simpli.fi,A,US,UnitedStates,VA,Virginia,Chantilly,22022,38.894299,-77.431099,703,
04-Nov-2015,08:28:39.272,192.168.251.24,www.wtp101.com,54.86.5.94,wtp101.com,A,US,UnitedStates,VA,Virginia,Ashburn,20147,39.033501,-77.483803,703,
04-Nov-2015,08:28:39.273,192.168.251.24,magnetic.t.domdex.com,54.217.251.207,domdex.com,A,IE,Ireland,07,Dublin,Dublin,N/A,53.333099,-6.248900,0,tracker
04-Nov-2015,08:28:39.273,172.25.111.175,api.smoot.apple.com,17.252.75.246,apple.com,A,US,UnitedStates,CA,California,Cupertino,95014,37.304199,-122.094597,408,updatesites news forum porn movies hobby podcasts downloads shopping webradio
04-Nov-2015,08:28:39.275,192.168.7.181,www.miniclip.com,54.192.207.82,miniclip.com,A,US,UnitedStates,WA,Washington,Seattle,98144,47.583900,-122.299500,206,hobby

Now the information has been enriched and adds additional value where it can be imported to a GIS tool and provides a nice graphical interface, etc.
SO the challenge is that the 27million records produced a day is taking days to be processed and enriched with additional information. So is there a solution to get all this information within minutes or hours not days into a file?

RudiC · November 11, 2015, 2:48pm

Well, you still didn't say where and how to get that enriching information, and what I see in your sample above is not found in the BL files you presented earlier - at least I didn't find them.

Some thoughts to improve processing:

could you remove duplicates from the input files?
could you preprocess (condense) those BL files?
could you split the info needed into separate outputs, each requiring less processing?

omuhans123 · November 11, 2015, 3:25pm

Yes, here is the information:
Some thoughts to improve processing:

could you remove duplicates from the input files?
No as they have different time-stamps
could you preprocess (condense) those BL files?
For the BL information it would be possible to do this only at the end and not for every line.
could you split the info needed into separate outputs, each requiring less processing?
yes as longs as there is one file that includes everything in one or are linked with the information.

Here is the current script used:

while read -r line
do
dt=$(awk -F " " '/ / {print $1}' <<< $line)
tm=$(awk -F " " '/ / {print $2}' <<< $line)
ipt=$(awk -F " " '/ / {print $6}'<<< $line)
ip=$(cut -d'#' -f1 <<< $ipt)
url=$(awk -F " " '/ / {print $8}' <<< $line)
urlip=$(geoiplookup -i -f /usr/share/GeoIP/GeoIP.dat $url | awk -F ":" '/ / {print $2}' | cut -d',' -f1 | awk 'NR==2' | tr -d '[[:space:]]')
type=$(awk -F " " '/ / {print $10}' <<< $line)
countrys=$(geoiplookup -f /usr/share/GeoIP/GeoIP.dat $url | awk -F ":" '/ / {print $2}' | cut -d',' -f1 | awk 'NR==1' | tr -d '[[:space:]]')
country=$(geoiplookup -f /usr/share/GeoIP/GeoIP.dat $url | awk -F "," '/ / {print $2}' | cut -d',' -f1 | awk 'NR==1' | tr -d '[[:space:]]')
as=$(geoiplookup -f /usr/share/GeoIP/GeoIPASNum.dat $url | awk -F " " '/ / {print $4}' | cut -d',' -f1 | awk 'NR==2' | tr -d '[[:space:]]')
regions=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $3}' | cut -d',' -f1 | tr -d '[[:space:]]')
region=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $4}' | cut -d',' -f1 | tr -d '[[:space:]]')
city=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $5}' | cut -d',' -f1 | tr -d '[[:space:]]')
postalCode=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $6}' | cut -d',' -f1 | tr -d '[[:space:]]')
lat=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $7}' | cut -d',' -f1 | tr -d '[[:space:]]')
long=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $8}' | cut -d',' -f1 | tr -d '[[:space:]]')
areaCode=$(geoiplookup -f /usr/share/GeoIP/GeoLiteCity.dat $url | awk -F "," '/ / {print $10}' | cut -d',' -f1 | tr -d '[[:space:]]')


echo $url > temp-url

dom=$(awk '
/^\/\/|^ *$/    {next}

FNR!=NR         {for (f in FIVE)  if ($0 ~ "[.]" f "$")  {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (f in FOUR)  if ($0 ~ "[.]" f "$")  {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
                 for (t in THREE) if ($0 ~ "[.]" t "$")  {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
                 for (t in TWO)   if ($0 ~ "[.]" t "$")  {print $(NF-2), $(NF-1), $NF; next}
                 for (o in ONE)   if ($0 ~ "[.]" o "$")  {print $(NF-1), $NF; next}
                 next
                }

/^\*/           {next}

NF==5           {FIVE[$0]}
NF==4           {FOUR[$0]}
NF==3           {THREE[$0]}
NF==2           {TWO[$0]}
NF==1           {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat temp-url)

ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )

echo $dt,$tm,$ip,$url,$urlip,$dom,$type,$countrys,$country,$regions,$region,$city,$postalCode,$lat,$long,$areaCode,$ct >> tmp_Logs
echo $dom >> tmp_DOM
echo $dom,$country,$city,$city,$lat,$long,$ct >> tmp_CT
done < tmp

sort DNS1_DOM | uniq -cd | sort -nr > tmp_Sort

omuhans123 · November 22, 2015, 12:07am

Hi RudiC, Quick question is it not possible to split a file into say 1 million lines and have multiple instances running. The joining the files after completing. So for the 27 million lines there would be 27 simultaneous instances of the script running?

RudiC · November 22, 2015, 2:35am

First thought: yes, why not. If you make sure output files are unique. Still the instances might be competing for resources like memory, CPU, ...

omuhans123 · November 22, 2015, 7:59am

Thank you for the feedback, do you maybe have any suggestions or examples on how to best script this?
I was thinking of maybe starting off with a

wc -l filename

and then dividing this into the number of say 30 files. Then doing the splitting of the files and starting 30 scripts? I just do not know exactly how to do this.

Something like this however not sure how to split this into different files:

A=0
while IFS= read -r LINE ; do
  printf '%s\n' "$LINE" > newfile$A
  (( A++ ))
done < "$INPUTFILE"

RudiC · November 22, 2015, 12:38pm

man split

omuhans123 · November 22, 2015, 1:31pm

Hi RudiC, Thank you for that hint, I was not aware of this command. I have now added the following script to split the files:

#!/usr/bin/bash
fspec=2015-11-09.querylogs
num_files=10

# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))

# Split the actual file, maintaining lines.
split --lines=${lines_per_file} ${fspec} new.

Now there are 10 files, how to I start 10 scripts to run through these files?

Output files:

Total lines     = 27622480
Lines  per file = 2762248
   2762248 new.aa
   2762248 new.ab
   2762248 new.ac
   2762248 new.ad
   2762248 new.ae
   2762248 new.af
   2762248 new.ag
   2762248 new.ah
   2762248 new.ai
   2762248 new.aj
  27622480 total

Don_Cragun · November 22, 2015, 2:00pm

Back in post #14 I showed you a way to get rid of lots of extra processes you are invoking for each line you read. In post #16 you said my suggestion improved performance of your script by a factor of 5. But, in your latest script in post #24 in this thread, you have thrown away that improvement and added several more duplicated invocations of geoiplookup piped through various combinations of awk , cut , awk again, and tr . Why? If you would have taken that earlier advice and applied it to the geoiplookup invocations as well, you would easily get an improvement by at least a factor of 10.

Your script in post #24 has two distinct parts. The 1st 49 lines process the stuff we have been talking about in this thread. The last line processes a file that is not mentioned in the rest of the script. Why not move that command to the start of your script and run it asynchronously (and add a wait at the end of your script if your script should not be allowed to terminate before that command completes)?

How many cores can you use on your system? If you want to run thirty copies of your script simultaneously, you will probably be fighting against yourself if you don't have at least thirty cores at your disposal.

Don_Cragun · November 22, 2015, 7:02pm

omuhans123:

Hi RudiC, Thank you for that hint, I was not aware of this command. I have now added the following script to split the files:

#!/usr/bin/bash
fspec=2015-11-09.querylogs
num_files=10

# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))

# Split the actual file, maintaining lines.
split --lines=${lines_per_file} ${fspec} new.

Now there are 10 files, how to I start 10 scripts to run through these files?

Output files:

Total lines     = 27622480
Lines  per file = 2762248
   2762248 new.aa
   2762248 new.ab
   2762248 new.ac
   2762248 new.ad
   2762248 new.ae
   2762248 new.af
   2762248 new.ag
   2762248 new.ah
   2762248 new.ai
   2762248 new.aj
  27622480 total

In your latest script (n post #24) there is no reference to $fspec or to any files ending with .querylogs .

If you want to run that script 10 times in parallel, are the new.* files supposed to be used as replacements for the file named tmp or for the file named DNS1_DOM or for some other input file?

What is the name of your current script?

How do you invoke your current script?

In what directory are the new.* files located?

In what directory do you invoke your current script?

Given that there are two independent actions occurring in that script, why would you want to run one of those actions 10 times instead of just running it once?