Issue with Keyboard or Char Encoding During Migration

Yes, sure, I need time to think about this too. A lot of head scratching coming on.

Here ya go.....

This is an encoding issue with an old post in the PHP vB3 app.

The encoding of the title DB field from years ago is not specified because the designers of vB 15+ years ago did not think people would post such non-standard chars in titles of posts.

It is correct in discourse, in this case, and wrong in vB3, which is very interesting.

find . -name "*.*� | xargs grep "help" - UNIX for Dummies Questions & Answers - UNIX.COM Community

Isn't that how you see it?

It looks flawless in the new post in my browser.

2 Likes

Yes, I see it correctly on the new site but I see replacement character symbol instead of the double quote marks on the old site.

What title do you see on the old site?

2 Likes

OBTW, if I add this to the showthread PHP script:

$thread['title'] = utf8_encode($thread['title']);

(which I just did for fun to the old forum).

The issue is the same.

I think it is an HTML encoding issue in the vB app. If you look at the source code of the page in old forum:

So, there is some underlying issue with vBulletin3 which is on reason we are migrating, to get away from this way past EOL forum software.

If I add this to the showthread script in old forum:

header('Content-Type: text/html; charset=utf-8');

This does not help on old forums either:

header('Content-Type: text/html; charset=utf-16');

The issue is the same because of some coding mismatch with the same "non-standard" mojibake.

Mojibake - Wikipedia

Dennis, are you getting bored with this mojibake stuff yet?

Actually, I'm feeling very good the only issues you are finding out outlier encoding issues which are broken in the old, legacy, obsolete, long EOL, forum, at this point.

Old site title is broken.... due to encoding issue:

New site, title is looking good:

If we edit the title on the old site and replace the mojibake with regular double quotes, all will be OK.

But honestly, I would not worry about it; but of course we can if we want. It's been like this for nearly a decade and no one said anything about it before :slight_smile:

Yes, I'm getting bored with this.

What you are saying is that when I see

on the old forum, there is nevertheless, something wrong with, say, the second quote mark which after migration becomes

If so, then what is wrong with the second quote mark (Unicode value) in the old site???

I just don't understand how it can be that wrong and still display perfectly on my screen (unless it's 8bit value vs 7bit value or some such).

Otherwise, if it displays perfect then it should migrate perfect. Yes?

If so, then what is wrong with the second quote mark (Unicode value) in the old site???

Those quote marks are in an encoding not processed by the PHP / HTML as proper "UTF-8" in this legacy vBulletin LAMP application, and so it replaces it with the "WTF?" mojibake symbol.

When you look on the old site, you are seeing encoding processed by PHP based on the legacy PHP encoding to HTML.

The new site does this totally different, that is why it displays properly over there in communityville.

If you edit the old title and replace those oddly-encoded chars with the same quotes as on your keyboard the encoding will change, all will be great again and the world will be as one :slight_smile:

I just edited the old title.... using the double quotes on my key board.

This is more-than-likely not about 7 / 8 bit ASCII, it is more-than-likely about UTF-8 and UTF-16.

See also: What is the difference between UTF-8 and UTF-16? - Quora

See also: Comparison of Unicode encodings - Wikipedia

2 Likes

Another thing I would say is, AFAIR, the only characters replaced by this "Replacement Character" placeholder are:

Nul
' <apostrophe>
" <double quote>

Does anyone else recall seeing any others?

Hi Dennis,

I think we are done with this odd encoding issue for now.

We need to stick with the plan of comparing old to new, with links for each; but we are moving off that target so let's call this task done.

Thanks for your help.

If anyone has links to posts in the new forums which have errors which are not in the old forums, please post both links and let me know what you found comparing old and new. Let's stick to my instruction of testing "bottom up" which means comparing the old with the new, posting the two links to each one (old and new) with details of what is different.

Continue here: Please Help Integrity Test New Discourse Forums V2

Otherwise, let's move on to other tasks.

Thanks so much for your help again!

I am going close this sidebar issues thread on encoding because people are not providing links to old versus new, but are posting information with no links to compare, so please forgive me for pointing this out, but if people are going to not follow our testing instructions and are going to post without posting links to old v. new, then it's best we stop and move on to other tasks. This testing phase is not about "speculation". It is brute force, bottom up testing.

Anyway, I'm done with this side bar task for now.. I know exactly what the issues are related to this encoding issue. Most originate from in the original forums and the vast majority of those are spam posts which are not public. I'm not going to waste time cleansing spam, LOL nor writing code for remote outliers caused by strange encoding issues originating in old forums.

Thanks again for all the great testing, everyone!

Great job!