Calculating expiry date using date,sed,grep

Hi,

I would greatly appreciate it if someone can help me with my problem.

I have a crawler which collects spam URLs everyday & this data needs to be published in a blacklist.

Here's the catch:

The "Time To Live" (TTL) for each URL is 3 months (or whatever for that matter). If i see the same URL again within the expiry of its TTL, I need to update that URLs TTL, so it stays in the blacklist for another 3 months (TTL).

The URLs which were never seen after the TTL need to be removed from the list after the TTL expires, so I don't have old data & can manage the size of my blacklist.

Here's an example current URL list which my crawler would have got today:
[URL followed by TTL of 3 days or whatever for that matter]

b.com 23
e.com 23
f.com 23

Here's an example of the current master URL file used for comparison:

a.com 19
b.com 20
c.com 21
d.com 21

Here's an example of the updated master URL file after comparison:

b.com 23
c.com 21
d.com 21
e.com 23
f.com 23

Here's what the final blacklist should look like:

b.com
c.com
d.com
e.com
f.com

How do can do this using using sed/grep/date (if it is indeed possible)? Unfortunately, I can't install any SQL db on this machine, which I realize would make things easy.

Again, any help would be much appreciated.

Thanks in advance

Try this...

awk 'NR==FNR{a[$1]=$2;next} {if($1 in a){a[$1]+=(a[$1]-$2)}else{a[$1]=$2}}
END{for(i in a){print i" "a>"new_master_file";print i>"blacklist"}}' curr_master_file curr_url_file

blacklist - will have the unique url's
new_master_file - will have the updated data

--ahamed

Doesn't work. This is what I get in the current master file:

a.com 19
b.com 17
c.com 21
d.com 21
e.com 23
f.com 23

But it should be:

b.com 23
c.com 21
d.com 21
e.com 23
f.com 23

Can you please once more why is a.com missing the updated master file of yours?
May be I didn't get that point.

--ahamed

Because a.com has crossed its time to live of 3 days. In the original master list the entry is:

a.com 19

and today is the 23rd. The difference is 4 days which is greater than the TTL of 3 days. So it goes out of the master list.

Can you please explain your code, So I understand what's going on

try this...

awk 'NR==FNR{a[$1]=$2;next} {d=$2;if($1 in a){a[$1]+=($2-a[$1])}else{a[$1]=$2}}
END{for(i in a){if(d-a>3){continue}print i" "a>"new_master_file";print i>"blacklist"}}' 
curr_master_file curr_url_file

--ahamed