Hi,
I would greatly appreciate it if someone can help me with my problem.
I have a crawler which collects spam URLs everyday & this data needs to be published in a blacklist.
Here's the catch:
The "Time To Live" (TTL) for each URL is 3 months (or whatever for that matter). If i see the same URL again within the expiry of its TTL, I need to update that URLs TTL, so it stays in the blacklist for another 3 months (TTL).
The URLs which were never seen after the TTL need to be removed from the list after the TTL expires, so I don't have old data & can manage the size of my blacklist.
Here's an example current URL list which my crawler would have got today:
[URL followed by TTL of 3 days or whatever for that matter]
b.com 23
e.com 23
f.com 23
Here's an example of the current master URL file used for comparison:
a.com 19
b.com 20
c.com 21
d.com 21
Here's an example of the updated master URL file after comparison:
b.com 23
c.com 21
d.com 21
e.com 23
f.com 23
Here's what the final blacklist should look like:
b.com
c.com
d.com
e.com
f.com
How do can do this using using sed/grep/date (if it is indeed possible)? Unfortunately, I can't install any SQL db on this machine, which I realize would make things easy.
Again, any help would be much appreciated.
Thanks in advance