Thank you Don Cragun for the optimization of the code, I was not aware of this method of defining variables in one line. With this method I was able to improve the line reading speed by 5 times.
Regarding the
/opt/URL/BL
directory, here the file downloaded has different directories that contains files with the list of URLs defined (here is the link: wget http://www.shallalist.de/Downloads/shallalist.tar.gz\). In the directory there are the following categories:
adv
aggressive
alcohol
anonvpn
automobile
chat
COPYRIGHT
costtraps
dating
downloads
drugs
dynamic
education
finance
fortunetelling
forum
gamble
global_usage
government
hacking
hobby
homestyle
hospitals
imagehosting
isp
jobsearch
library
military
models
movies
music
news
podcasts
politics
porn
radiotv
recreation
redirector
religion
remotecontrol
ringtones
science
searchengines
sex
shopping
socialnet
spyware
tracker
updatesites
urlshortener
violence
warez
weapons
webmail
webphone
webradio
webtv
In these category directories these are two files domains and urls. There are numerous hits for one domain e.g: facebook.com
adv porn movies hobby socialnet spyware redirector finance chat
So with the below line I am trying to grep all directory names that define the category and add these to the variable $ct with spaces between them if more than one.
ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )
Here is the entire code now after updating:
while read -r dt tm _ _ _ ipt _ url _ type _
do ip=${ipt%%#*}
echo $url > temp-url
dom=$(awk '
/^\/\/|^ *$/ {next}
FNR!=NR {for (f in FIVE) if ($0 ~ "[.]" f "$") {print $(NF-5), $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF; next}
for (f in FOUR) if ($0 ~ "[.]" f "$") {print $(NF-4), $(NF-3), $(NF-2), $(NF-1), $NF ; next}
for (t in THREE) if ($0 ~ "[.]" t "$") {print $(NF-3), $(NF-2), $(NF-1), $NF; next}
for (t in TWO) if ($0 ~ "[.]" t "$") {print $(NF-2), $(NF-1), $NF; next}
for (o in ONE) if ($0 ~ "[.]" o "$") {print $(NF-1), $NF; next}
next
}
/^\*/ {next}
NF==5 {FIVE[$0]}
NF==4 {FOUR[$0]}
NF==3 {THREE[$0]}
NF==2 {TWO[$0]}
NF==1 {ONE[$0]}
' FS="." OFS="." public_suffix_list.dat temp-url)
ct=$(grep -i -r $dom /opt/URL/BL/ | cut -d'/' -f5 | uniq -d | head )
echo $dt,$tm,$ip,$url,$dom,$type,$ct >> DNS1_Logs
echo $dom >> DNS1_DOM
echo $dom,$ct >> DNS1_CT
done < DNS1
sort DNS1_DOM | uniq -cd | sort -nr > DNS1_Sort
one additional question, the domain awk code, is it possible to read a variable like $dom instead of the tmp-url that I am currently first wiring to a temp file? and is it possible to do additional optimization?
---------- Post updated at 03:59 PM ---------- Previous update was at 03:48 PM ----------
Hi RudiC, thank you very much, WOW this is an amazing code. This code is much faster than the code currently working with. Two challenges that I am currently facing with this code is to read the Category files from the folders
/opt/URL/BL
(http://www.shallalist.de/Downloads/shallalist.tar.gz\) and the second challenge is that the spaces are there not comma as separators. I have tried to figure out where exactly the spaces are defined however I have not been able to find this until now.
---------- Post updated at 03:59 PM ---------- Previous update was at 03:59 PM ----------
Hi RudiC, thank you very much, WOW this is an amazing code. This code is much faster than the code currently working with. Two challenges that I am currently facing with this code is to read the Category files from the folders
/opt/URL/BL
(http://www.shallalist.de/Downloads/shallalist.tar.gz\) and the second challenge is that the spaces are there not comma as separators. I have tried to figure out where exactly the spaces are defined however I have not been able to find this until now.