Similar Threads for Man Pages - In Development

FYI,

I have been quietly updating the man page database adding "similar threads" for man pages.

STEP 1: Full Text MySQL DB Search Matches

The first step, after creating the DB columns, was to process each of the nearly 400K man pages and do a full text mysql search, match and score against each post in the DB and get the top 15 threadids matched (or less than 15, based on the matches and scores).

That process took a few days and resulted in around one third (forgot to record the stats at that point) of the man page entries having similar thread entries.

STEP2: Cross Reference Similar Man Pages in Thread DB Back to Man Page Entries

Then, for the remaining man pages with no entries from the process above (step 1), I took the similarman entries for each thread and did a simple boolean match for man page ids associated with each similar man page (created a number of weeks ago) and created a list of thread matches ordered by the thread reply count in the DB. That process will complete today (in about 3 hours from now, give or take) and there will remain a lot of man pages with no matches based on steps 1 and 2.

STEP3: Boolean Matches Man Page Name with Thread Tags

Then, I will take the remaining man pages without any similar threads and repeat step two matching the name of the man page (only the query, for example ' sshd ') against the tags for each thread, and order the matches by thread reply count, and keep up to 15 matches, as before.

After that, I will look at the remaining unmatched man pages to threads and decide what match I can try next.

The purpose of all is to create more relevant content for each man page in the DB, providing users with a list of discussion threads related to the man page; hence as the idea implies "similar threads for man pages". In addition, this could help SEO, as Google is only including between 10 and 15% of our entire man page collection in their index of our man pages. I would like to increase this percentage in 2020 to closer to 25 to 40%.

Currently, there are a few hours remaining for step 2:

1577593027 Time: 54 Inserts: 116 Floor: 6000 Limit: 300 ToDo: 64839 RemainingTime: 3.6 Hours QLoad: 1.06
1577593080 Time: 55 Inserts: 103 Floor: 6000 Limit: 300 ToDo: 64548 RemainingTime: 3.6 Hours QLoad: 1.17
1577593138 Time: 53 Inserts: 110 Floor: 6000 Limit: 300 ToDo: 64248 RemainingTime: 3.6 Hours QLoad: 1.27
1577593196 Time: 53 Inserts: 108 Floor: 6000 Limit: 300 ToDo: 63948 RemainingTime: 3.6 Hours QLoad: 1.23
1577593257 Time: 53 Inserts: 98 Floor: 6000 Limit: 300 ToDo: 63648 RemainingTime: 3.5 Hours QLoad: 1.04
1577593332 Time: 54 Inserts: 108 Floor: 6000 Limit: 300 ToDo: 63344 RemainingTime: 3.5 Hours QLoad: 1.01

After step 2 is done, I will start step 3 (but I will remember to record a few simple stats before I start step 3).

1 Like

OK.

Step 2 is done:

1577606166 Time: 53 Inserts: 74 Floor: 6000 Limit: 300 ToDo: 2250 RemainingTime: 0.1 Hours QLoad: 1.74
1577606242 Time: 56 Inserts: 85 Floor: 6000 Limit: 300 ToDo: 1950 RemainingTime: 0.1 Hours QLoad: 1.39
1577606278 Time: 55 Inserts: 97 Floor: 6000 Limit: 300 ToDo: 1750 RemainingTime: 0.1 Hours QLoad: 1.49
1577606342 Time: 53 Inserts: 97 Floor: 6000 Limit: 300 ToDo: 1450 RemainingTime: 0.1 Hours QLoad: 1.48
1577606398 Time: 52 Inserts: 75 Floor: 6000 Limit: 300 ToDo: 1150 RemainingTime: 0.1 Hours QLoad: 1.34
1577606470 Time: 53 Inserts: 75 Floor: 6000 Limit: 300 ToDo: 850 RemainingTime: 0.0 Hours QLoad: 1.26
1577606526 Time: 54 Inserts: 101 Floor: 6000 Limit: 300 ToDo: 550 RemainingTime: 0.0 Hours QLoad: 1.26
1577606589 Time: 55 Inserts: 60 Floor: 6000 Limit: 300 ToDo: 250 RemainingTime: 0.0 Hours QLoad: 1.17
1577606633 Time: 51 Inserts: 70 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 1.41
1577606650 Time: 1 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 1.25
1577606714 Time: 1 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 0.83
1577606764 Time: 1 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 0.48
1577606826 Time: 1 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 0.57
1577606884 Time: 0 Inserts: 0 Floor: 6000 Limit: 300 ToDo:  RemainingTime: 0.0 Hours QLoad: 0.40

But as you can see, even after processing al the orphans (man pages without matches) for mysql full text matches (step 1) and cross referencing similarman in threads with the man pages (step 2), we see that a whopping 63% of all man pages are still similar thread orphans:

mysql> select count(1) from neo_man_page_entry where similarthread ="none"; select count(1) from neo_man_page_entry;
+----------+
| count(1) |
+----------+
|   218444 |
+----------+
1 row in set (0.82 sec)

+----------+
| count(1) |
+----------+
|   347938 |
+----------+
1 row in set (0.00 sec)

mysql> 

So, I'm on to step 3 now. Boolean matches between the name of the man page and the tags (update: and the thread titles) for each thread, ordered by reply count. I changed the process (from my step 3 above) to match both thread tags and thread titles, to see if this helps speed things along toward the goal of all the man pages having at least one similar thread entry.

1577609125 Time: 35 Inserts: 63 Floor: 6000 Limit: 300 ToDo: 218340 RemainingTime: 12.1 Hours QLoad: 0.54
1577609318 Time: 35 Inserts: 63 Floor: 6000 Limit: 300 ToDo: 218040 RemainingTime: 12.1 Hours QLoad: 0.41
1577609392 Time: 34 Inserts: 70 Floor: 6000 Limit: 300 ToDo: 217740 RemainingTime: 12.1 Hours QLoad: 0.66
1577609437 Time: 34 Inserts: 68 Floor: 6000 Limit: 300 ToDo: 217440 RemainingTime: 12.1 Hours QLoad: 0.97

Let's see what happens twelve hours from now after this batch processing finishes.

I may move to straight forward boolean matches in the text of the posts (against the name of the man page) for step 4, but that seems too crude, so I'll need to ponder on, and test this, later. But if I add the operating system, that might be too refined and result in a very small number of matches since we have always had quite a hard time getting people to describe their OS when they post a question!

If everyone posted system details when they asked a question, this would make matches a lot better; but they don't and rarely do.

Step 3 is done. The result is that the orphans have dropped from 63% to 53%.

mysql> select count(1) as count from neo_man_page_entry where similarthread = "notagsmatch"; select count(1) as count from neo_man_page_entry;
+--------+
| count  |
+--------+
| 185765 |
+--------+
1 row in set (0.93 sec)

+--------+
| count  |
+--------+
| 347938 |
+--------+
1 row in set (0.00 sec)

So, I now start step 4:

STEP4: Boolean Matches Man Page Name with Post Text

I will take the remaining man pages without any similar threads and repeat step three but matching the name of the man page (only the query, for example ' sshd ') against the page text for each post and get the threadid from the post, and order the matches by the number of times the thread was thanked by users, and keep up to 15 matches, as before.

We will see how many man page orphans find thread relatives in this step 4 manner.

Running...... looks like this query and update will take around a week or so, as not to overload the server.

1577672816 Time: 54 Inserts: 6 Floor: 6000 Limit: 20 ToDo: 185205 RemainingTime: 154.3 Hours QLoad: 1.65
1577672876 Time: 52 Inserts: 4 Floor: 6000 Limit: 20 ToDo: 185185 RemainingTime: 154.3 Hours QLoad: 1.53
1577672942 Time: 54 Inserts: 1 Floor: 6000 Limit: 20 ToDo: 185165 RemainingTime: 154.3 Hours QLoad: 1.04
1577673000 Time: 53 Inserts: 0 Floor: 6000 Limit: 20 ToDo: 185145 RemainingTime: 154.3 Hours QLoad: 1.20

.and so far, it looks like the orphans will be only reduced by a relatively small amount (less than 15% of total remaining orphans, I guess... let's see)

Update:

May stop cron job (step 4) which is processes similar threads for man pages using only the name of the man page and the texts of posts.

Not really getting enough "bang for the buck" from loading the server doing these batch jobs in the background (but now the server is that the lowest load point of the year, so I will wait a few more days before deciding to stop this cron), where we can see that approximate 15% of the queries result in a match:

1577727419 Time: 57 Inserts: 4 Floor: 6000 Limit: 15 ToDo: 169355 RemainingTime: 188.2 Hours QLoad: 1.38
1577727480 Time: 57 Inserts: 4 Floor: 6000 Limit: 15 ToDo: 169340 RemainingTime: 188.2 Hours QLoad: 1.58
1577727540 Time: 57 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169325 RemainingTime: 188.1 Hours QLoad: 1.52
1577727598 Time: 56 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169310 RemainingTime: 188.1 Hours QLoad: 1.35
1577727664 Time: 56 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169295 RemainingTime: 188.1 Hours QLoad: 1.05
1577727724 Time: 55 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169280 RemainingTime: 188.1 Hours QLoad: 1.18
1577727780 Time: 54 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169265 RemainingTime: 188.1 Hours QLoad: 1.23
1577727835 Time: 53 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169250 RemainingTime: 188.1 Hours QLoad: 1.52
1577727896 Time: 55 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169235 RemainingTime: 188.0 Hours QLoad: 1.26
1577727958 Time: 55 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169220 RemainingTime: 188.0 Hours QLoad: 1.57
1577728021 Time: 55 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169205 RemainingTime: 188.0 Hours QLoad: 1.44
1577728075 Time: 53 Inserts: 3 Floor: 6000 Limit: 15 ToDo: 169190 RemainingTime: 188.0 Hours QLoad: 1.84
1577728136 Time: 55 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169175 RemainingTime: 188.0 Hours QLoad: 1.94
1577728198 Time: 57 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169160 RemainingTime: 188.0 Hours QLoad: 1.52
1577728262 Time: 57 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169145 RemainingTime: 187.9 Hours QLoad: 1.39
1577728321 Time: 58 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169130 RemainingTime: 187.9 Hours QLoad: 1.34
1577728379 Time: 55 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169115 RemainingTime: 187.9 Hours QLoad: 1.05
1577728443 Time: 54 Inserts: 2 Floor: 6000 Limit: 15 ToDo: 169100 RemainingTime: 187.9 Hours QLoad: 1.04
1577728500 Time: 58 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169086 RemainingTime: 187.9 Hours QLoad: 1.21
1577728560 Time: 57 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169071 RemainingTime: 187.9 Hours QLoad: 1.30
1577728623 Time: 57 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169056 RemainingTime: 187.8 Hours QLoad: 1.32
1577728682 Time: 57 Inserts: 1 Floor: 6000 Limit: 15 ToDo: 169041 RemainingTime: 187.8 Hours QLoad: 1.57
1577728738 Time: 55 Inserts: 0 Floor: 6000 Limit: 15 ToDo: 169026 RemainingTime: 187.8 Hours QLoad: 1.52 

That means, for now, I'm going to put this project on hold. Here are the intermediate results, showing 52% orphans, which is an improvement over the early 63% orphan stat:

mysql> select count(1) as count from neo_man_page_entry where similarthread = "nopagetextmatch" or similarthread = "notagsmatch"; select count(1) as count from neo_man_page_entry;
+--------+
| count  |
+--------+
| 182585 |
+--------+
1 row in set (0.96 sec)

+--------+
| count  |
+--------+
| 347938 |
+--------+
1 row in set (0.00 sec)

For now, let it run slowly in background....

1 Like

Hi Neo,
my 2 cents:
You maybe did so but if not, knowing the type of process it involves, I would have chosen as you did a calm period for the task, and to not waste proc time due to the different caches, try to optimize what I can/ where I can e.g. not sure you can change the cache ration of the FS or underlying storage ( I suppose that is more the provider's duty...) but you have access to your RDBMS kernel I would reduce its cache working storage to force the reading of the true data) this is efficient for big batch processes when you know you are after data not often read ( so no chance of finding them in caches), of course, it impacts ordinary online interactive work but as you have fewer requests thrown by online users its acceptable... it should improve a bit your step 4...

Thanks Victor,

Sounds good; but I don't want to put much more effort into this with all the other projects I have ongoing. If you have the exact PHP code for MySQL to do as you suggested, then that might take less of my time to implement. Right now I am quite busy on non-unix.com tags working with a number of people and vendors on LoRA and NB-IoT networking gear, regulatory issues, chips, specifications, development boards, gateways, code, libs, etc.

OBTW, here is an example of this "similar thread for man pages" working (man nologin):

https://www.unix.com/man-page/linux/5/nologin/

I think these "similar threads for man pages" combined with my earlier "similar man pages for man pages" will help with search engine indexing the man pages with thin content (SEO).

Update: Focusing on man page with string length less than 4000, currently showing about 53% orphan man pages:

mysql> select count(1) as count from neo_man_page_entry where similarthread = "nopagetextmatch" or similarthread = "notagsmatch" and strlen < 4000; select count(1) as count from neo_man_page_entry where strlen < 4000;

+--------+
| count  |
+--------+
| 108115 |
+--------+
1 row in set (1.01 sec)

+--------+
| count  |
+--------+
| 204819 |
+--------+
1 row in set (0.04 sec)

mysql> 

Still focusing on man pages with strlen of less than 4000:

ubuntu# tail -f neo_simthread_for_man_pages_using_pagetext_timing.log
1577851988 Time: 56 Inserts: 9 Ceiling: 4000 Limit: 22 ToDo: 70540 RemainingTime: 53.4 Hours QLoad: 1.39
1577852043 Time: 57 Inserts: 8 Ceiling: 4000 Limit: 22 ToDo: 70519 RemainingTime: 53.4 Hours QLoad: 1.68
1577852104 Time: 56 Inserts: 5 Ceiling: 4000 Limit: 22 ToDo: 70497 RemainingTime: 53.4 Hours QLoad: 1.55
1577852169 Time: 58 Inserts: 2 Ceiling: 4000 Limit: 22 ToDo: 70475 RemainingTime: 53.4 Hours QLoad: 1.57
1577852221 Time: 57 Inserts: 0 Ceiling: 4000 Limit: 22 ToDo: 70455 RemainingTime: 53.4 Hours QLoad: 1.56
1577852285 Time: 57 Inserts: 1 Ceiling: 4000 Limit: 22 ToDo: 70433 RemainingTime: 53.4 Hours QLoad: 1.43
1577852336 Time: 55 Inserts: 5 Ceiling: 4000 Limit: 22 ToDo: 70413 RemainingTime: 53.3 Hours QLoad: 1.22
1577852410 Time: 55 Inserts: 1 Ceiling: 4000 Limit: 22 ToDo: 70391 RemainingTime: 53.3 Hours QLoad: 0.95
1577852477 Time: 56 Inserts: 2 Ceiling: 4000 Limit: 22 ToDo: 70369 RemainingTime: 53.3 Hours QLoad: 1.18
1577852519 Time: 58 Inserts: 2 Ceiling: 4000 Limit: 22 ToDo: 70353 RemainingTime: 53.3 Hours QLoad: 1.39

Let's see how the under 4000 byte-size orphans are doing around this time tomorrow.

Either way, it is working and live on the site, as you can see from this example:

nologin(5) [linux man page]

https://www.unix.com/man-page/linux/5/nologin/

I am still considering a "Step 5" to deal with the orphans, which will end up (after "Step 4") to being around half of the total man page repo, I am guessing, as follows:

  • Split the names of man pages with underscores, dashes, colons, etc. in the name of the man page and search titles, tags and posts on one or more of those substrings.
  • Use the operation system of the man page query, and cross reference to the most popular threads in corresponding forums (Linux, Solaris, AIX, etc).

Maybe something else... maybe not. Will sleep on this, since sleep always bring more ideas and solutions. I always get a lot of work and new ideas when I sleep and dream.

Here is where we stand, looking only at the man pages under 4000 bytes for similarthreads:

mysql> connect unixmanpages;select count(1) as count from neo_man_page_entry where similarthread = "nopagetextmatch" or similarthread = "notagsmatch" and strlen < 4000; select count(1) as count from neo_man_page_entry where strlen < 4000;
Connection id:    26801183
Current database: unixmanpages

+-------+
| count |
+-------+
| 94967 |
+-------+
1 row in set (1.09 sec)

+--------+
| count  |
+--------+
| 204819 |
+--------+
1 row in set (0.03 sec)

This means that 46% of all man pages under 4000 bytes are similar thread orphans.

Look at all man pages:

mysql> connect unixmanpages;select count(1) as count from neo_man_page_entry where similarthread = "nopagetextmatch" or similarthread = "notagsmatch" ; select count(1) as count from neo_man_page_entry;
Connection id:    26806490
Current database: unixmanpages

+--------+
| count  |
+--------+
| 166050 |
+--------+
1 row in set (1.09 sec)

+--------+
| count  |
+--------+
| 347938 |
+--------+
1 row in set (0.00 sec)

mysql> 

This means that ~48% of all man pages are similar thread orphans.

Next, I will work on how to reduce the under 4000 byte man page similar thread orphans even further. I think I will work some kind of os (operating system) match, just to reduce the orphans so they have some similar thread friends.

Done for the under 4000 byte orphans:

mysql> connect unixmanpages;select count(1) as count from neo_man_page_entry where similarthread = "nopagetextmatch"  and strlen < 4000; select count(1) as count from neo_man_page_entry;
Connection id:    27647119
Current database: unixmanpages

+-------+
| count |
+-------+
| 87617 |
+-------+
1 row in set (0.48 sec)

+--------+
| count  |
+--------+
| 347938 |
+--------+
1 row in set (0.00 sec)

25%...

Update:

With the remaining batch of under 4000 byte man page orphans without similar threads, I matched the os against the text in the posts and ordered the similar thread results by the number of "thread thanks" and now there are zero orphans.

So, I am calling this project "done" for now.

Setting data point for future reference:

Total Linux Man Pages in DB: 145,728
Total Indexed by Google (GSC): 19.287

Total Unix Man Pages in DB: 133,279
Total Indexed by Google (GSC): 12,235

Linux Man Page Index Coverage: 13%
Unix  Man Page Index Coverage: 9%

Let's see if this improves and by how much in 2020, based on all the work I did on this in 2019.