Script to delete HTML tag

zongo · November 20, 2011, 11:15am

Guys,

I have a little script that I got of the internet and that I use in Squid to block ads.
I used that script with linux but now i have moved my servers to freebsd. I have a step learning curve there but it is fun: Back to the script issue.

The script used to work i with linux but freebsd is a bit different.
This line is causing me issue

# cat /tmp/temp_ad_file | grep "(^|\.)" > "/usr/local/etc/squid/squid.adservers"

If I use the line above it in the script below, the destination folder is going to be completely emptied. The goal is to get rid of the HTML tags "(^|\.)" in the list that is given by http address "pgl.yoyo.org" for bad ad website. Then it is used by squid proxy.
The line above is unusable. The script works well if i modified the line above without the pipe, grep and the tags

# cat /tmp/temp_ad_file > "/usr/local/etc/squid/squid.adservers"

Then the list is being updated correctly and not emptied but still with the HTML tags in it.

#!/bin/sh
# Get new ad server list
/usr/local/bin/wget -O /tmp/temp_ad_file \
        'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=squid-dstdom-regex;showintro=0&mimetype=plaintext'
# Clean HTML headers out of the list
cat /tmp/temp_ad_file > "/usr/local/etc/squid/squid.adservers"
# cat /tmp/temp_ad_file | grep "(^|\.)" > "/usr/local/etc/squid/squid.adservers"
# Refresh Squid
/usr/local/sbin/squid -k reconfigure
# Remove tmp file
rm -rf /tmp/temp_ad_file

Any help is much appreciated

Kind Regards,

ahamed101 · November 20, 2011, 11:33am

Can you paste the html tags you are referring to?... actual line in html...

--ahamed

zongo · November 20, 2011, 12:08pm

ahamed101, thanks for replying.
I have pasted the start of the file (txt file)
The HTML tags would be "(^|\.)" & $. If left in that list, the acl squid can't use the file.

(^|\.)www\.sponsor2002\.de$
(^|\.)www1\.gto-media\.com$
(^|\.)www8\.glam\.com$
(^|\.)x-traceur\.com$
(^|\.)x\.mycity\.com$
(^|\.)x6\.yakiuchi\.com$
(^|\.)xchange\.awmcenter\.eu$
(^|\.)xchange\.ro$
(^|\.)xclicks\.net$
(^|\.)xertive\.com$
(^|\.)xiti\.com$

Kind Regards,

ahamed101 · November 20, 2011, 12:14pm

Try this...

grep '(^|\\.)' /tmp/temp_ad_file > "/usr/local/etc/squid/squid.adservers"

--ahamed

zongo · November 20, 2011, 12:37pm

ahamed101, thanks but i get where i was before: The squid.adservers get emptied completely with

# grep ' (^|\\.) ' /tmp/temp_ad_file > "/usr/local/etc/squid/squid.adservers"

My first step is to go and download the list as a txt file and save it to squid folder as squid.adservers.
Then the line above is to update the list once every 3 days. With the line above the destination folder squid.adservers gets emptied when the script is run and the acl inside squid is then complaining.

Regards,

ahamed101 · November 20, 2011, 12:40pm

I am confused... You have pasted the contents of /tmp/temp_ad_file right?... The grep statement is looking for pattern (^|\.) and populating those in the squid.adservers file... try the grep statement without the redirection and see if you get anything on the screen...

--ahamed

zongo · November 20, 2011, 12:55pm

Sorry for the confusion ahamed101
tried without the redirection so staying in the tmp directory with folder "temp_ad_file"
grep '(^|\\.)' /tmp/temp_ad_file. When I look in the temp_ad_file and the list is complete with '(^|\\.)'
Here is an extract below.
Basically what I am trying is clean the html part and redirect the entire list to the squid folder to file called squid.adservers

(^|\.)zedo\.com$
(^|\.)zencudo\.co\.uk$
(^|\.)zenzuu\.com$
(^|\.)zeus\.developershed\.com$
(^|\.)zeusclicks\.com$
(^|\.)zintext\.com$
(^|\.)zmedia\.com$

Kind Regards,

ahamed101 · November 20, 2011, 1:03pm

Paste the output of

grep '(^|\\.)' /tmp/temp_ad_file

--ahamed

zongo · November 20, 2011, 1:09pm

The file is very long. Dont think it might be a good idea to paste the entire file
Here is an extract

(^|\.)zanox-affiliate\.de$
(^|\.)zanox\.com$
(^|\.)zantracker\.com$
(^|\.)zde-affinity\.edgecaching\.net$
(^|\.)zedo\.com$
(^|\.)zencudo\.co\.uk$
(^|\.)zenzuu\.com$
(^|\.)zeus\.developershed\.com$
(^|\.)zeusclicks\.com$
(^|\.)zintext\.com$
(^|\.)zmedia\.com$

agama · November 20, 2011, 2:38pm

Boy is this thread confusing

A couple of observations that I have that might help clear the problem. First, the original post refers to 'removing html' from the file. However the file pulled from yoyo.org with wget using text/plain does not contain any html. More so, the reference to

seems to indicate that while incorrectly calling (^|\.) HTML, these strings are not desired. Depending on how squid is configured, this is true, they need to be removed.

The file from yoyo.com is a list of regular expressions and if squid isn't configured with acl ads dstdom_regex -i "[/usr/local]/etc/squid.adservers" then the regex parts will cause problems. I believe this is the reason things have stopped working is because the configuration on the old machine isn't the same as on FreeBSD. Given this, the original code that is extracting the regex lines from the yoyo.com data using the regex string makes sense.

This doesn't explain why the output file is ending up empty, but might change the focus on the problem a bit. If the squid config is changed to match the old machine, then the regex file can be used as is, otherwise the regex portions should be stripped:

sed 's/[()|.$^]//g' /tmp/temp_ad_file >/usr/local/etc/squid/squid.adservers

Care should be taken if these strings are used without the regex as they might match more URLs than desired.

I'm interested in knowing if the sed above has the same problem -- generates an empty file. If it does, then I question the permissions on the output file. What happens if the output file is changed to something like >/tmp/foo ?

zongo · November 20, 2011, 5:18pm

agama, thanks for your help here. You have understood my issue better than I could ever explain it. I am very sorry for calling those tags HTML when they were not and causing so much confusion. My apologies.
Coming back to the script; the tags are completely gone now. My ACL in squid is back to working and the final destination folder does not get emptied any more. Thank you so much for your help again agama.

One small thing still where I am having some difficulties; learning curve is step:)

When the script is run, the /tmp/temp_ad_file is being displayed in the console. Is there a way to not display the temp_ad_file in the console ?

#!/bin/sh
# Get new ad server list
/usr/local/bin/wget -O /tmp/temp_ad_file \
        'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=squid-dstdom-regex;showintro=0&mimetype=plai
# Removing repeated lines
cat /tmp/temp_ad_file | uniq
# Removing blank lines
sed /^$/d /tmp/temp_ad_file
# Cleaning list
sed 's/[()|.$^]//g' /tmp/temp_ad_file > /usr/local/etc/squid/squid.adservers
# Refresh Squid
/usr/local/sbin/squid -k reconfigure
# Remove tmp file
rm -rf /tmp/temp_ad_file

Kind Regards,

---------- Post updated at 05:18 PM ---------- Previous update was at 04:53 PM ----------

So I think I have figured it out

I am sending into the bit bucket both lines below

cat /tmp/temp_ad_file | uniq 2>&1 > /dev/nul
sed /^$/d /tmp/temp_ad_file 2>&1 > /dev/nul

I have bought this book "mastering.unix.shell.scripting" to try to understand a bit more about nix scripting. I would recommend this book for advanced users: For someone like me, its a bit tough.

Regards,

zongo

danmero · November 20, 2011, 6:40pm

What about :

 /usr/bin/tr -d '[(^|\\$)]' < /tmp/temp_ad_file > /usr/local/etc/squid/squid.adservers

agama · November 20, 2011, 6:56pm

Glad you've got it working, and it seems you're willing to understand the why with the how which is important.

In general, if you aren't doing anything with the output of the above commands, then you don't need to run them; just comment them out, or remove them completely. However, looking at your script, I believe the intent was to execute these as a pipeline rather than by themselves. The data from yoyo.com needs to be sorted for uniq to work, and since you get the input from an external source (order unknown), best to sort with the unique option rather than assuming it's sorted. The sed to remove blank lines can be combined with the sed to delete the regex stuff, so the pipeline is just two commands to accomplish all 4 things:

 # Removing repeated lines
 sort -u /tmp/temp_ad_file | sed '/^$/d; s/[()|.$^]//g'  > /usr/local/etc/squid/squid.adservers

zongo · November 20, 2011, 7:21pm

agama, thanks
I have amended as per your suggestions
I get a better result even
I have moved up the "cleaning list" line so I have it above the "uniq" variable. That way it is sorted, cleaned, blank lines removed, and then all repetition being removed.

I am using the "uniq" filter on the final destination folder which is squid.adservers and not anymore on the /tmp/temp_ad_file.

danmero, thanks for helping as well.

Kind Regards,

#!/bin/sh
# Get new ad server list
/usr/local/bin/wget -O /tmp/temp_ad_file \
        'http://pgl.yoyo.org/adservers/serverlist.php?hostformat=squid-dstdom-regex;showintro=0&mime
# Cleaning list
sort -u /tmp/temp_ad_file | sed '/^$/d; s/[()|.$^]//g' > /usr/local/etc/squid/squid.adservers
# Removing repeated lines
cat /usr/local/etc/squid/squid.adservers | uniq
# Refresh Squid
/usr/local/sbin/squid -k reconfigure
# Remove tmp file
rm -rf /tmp/temp_ad_file

agama · November 20, 2011, 7:40pm

This line isn't necessary:

cat /usr/local/etc/squid/squid.adservers | uniq

The sort -u already makes the records in the file unique as it sorts them. I didn't explain before that the -u option stands for unique. Further, your output isn't going anywhere except standard output, so executing the command isn't doing anything to the file.

zongo · November 20, 2011, 7:50pm

agama, thanks

Regards,

zongo