FYI,
I have been quietly updating the man page database adding "similar threads" for man pages.
STEP 1: Full Text MySQL DB Search Matches
The first step, after creating the DB columns, was to process each of the nearly 400K man pages and do a full text mysql search, match and score against each post in the DB and get the top 15 threadids matched (or less than 15, based on the matches and scores).
That process took a few days and resulted in around one third (forgot to record the stats at that point) of the man page entries having similar thread entries.
STEP2: Cross Reference Similar Man Pages in Thread DB Back to Man Page Entries
Then, for the remaining man pages with no entries from the process above (step 1), I took the similarman entries for each thread and did a simple boolean match for man page ids associated with each similar man page (created a number of weeks ago) and created a list of thread matches ordered by the thread reply count in the DB. That process will complete today (in about 3 hours from now, give or take) and there will remain a lot of man pages with no matches based on steps 1 and 2.
STEP3: Boolean Matches Man Page Name with Thread Tags
Then, I will take the remaining man pages without any similar threads and repeat step two matching the name of the man page (only the query, for example ' sshd
') against the tags for each thread, and order the matches by thread reply count, and keep up to 15 matches, as before.
After that, I will look at the remaining unmatched man pages to threads and decide what match I can try next.
The purpose of all is to create more relevant content for each man page in the DB, providing users with a list of discussion threads related to the man page; hence as the idea implies "similar threads for man pages". In addition, this could help SEO, as Google is only including between 10 and 15% of our entire man page collection in their index of our man pages. I would like to increase this percentage in 2020 to closer to 25 to 40%.
Currently, there are a few hours remaining for step 2:
1577593027 Time: 54 Inserts: 116 Floor: 6000 Limit: 300 ToDo: 64839 RemainingTime: 3.6 Hours QLoad: 1.06
1577593080 Time: 55 Inserts: 103 Floor: 6000 Limit: 300 ToDo: 64548 RemainingTime: 3.6 Hours QLoad: 1.17
1577593138 Time: 53 Inserts: 110 Floor: 6000 Limit: 300 ToDo: 64248 RemainingTime: 3.6 Hours QLoad: 1.27
1577593196 Time: 53 Inserts: 108 Floor: 6000 Limit: 300 ToDo: 63948 RemainingTime: 3.6 Hours QLoad: 1.23
1577593257 Time: 53 Inserts: 98 Floor: 6000 Limit: 300 ToDo: 63648 RemainingTime: 3.5 Hours QLoad: 1.04
1577593332 Time: 54 Inserts: 108 Floor: 6000 Limit: 300 ToDo: 63344 RemainingTime: 3.5 Hours QLoad: 1.01
After step 2 is done, I will start step 3 (but I will remember to record a few simple stats before I start step 3).