Issue with Keyboard or Char Encoding During Migration

There is a minor issue lingering which we currently have no working solution.

For example, see this pagetext in the original (old) forum mysql DB (continued thats to hicksd8 for finding these and for looking into this interesting topic):

From original DB:

| Hi have two directory with below name in "/opt"

1-Source
2-Destination 

In "Source"� directory there is a lot's of files, with extensions (doc, docx , ppt, xls,...).
In "Destination"� directory only pdf version of (doc, docx) files that exist in source stored.

Now I want to create script that use "diff"� command check "source"� and get list of only (doc, docx) files after that look for related pdf file in "Destination"� if pdf version of (doc, docx) not exist in "destination"� store list of them on a file.

E.g.

1-Source
File1.doc
File2.docx
File3.doc
File4.ppt
File5.xls
File6.doc

2-Destination

File1.pdf
File3.pdf

Expected result after run script is:

File2.docx
File6.doc

Here is my script

diff -r "/opt/source"� "/opt/destination"

Any recommendation?
Thanks

UPDATE
Follow below post and work like charm:

comm -23 <(find dir1 -type f -exec bash -c 'basename "${0%.*}"' {} \; | sort) <(find dir2 -type f -exec bash -c 'basename "${0%.*}"' {} \; | sort)
test1

filenames - diff two directories, but ignore the extensions - Unix & Linux Stack Exchange
[/NOPARSE][/CODE]

Here is the original post:

https://www.unix.com/unix-for-beginners-questions-and-answers/284088-compare-two-directory-find-differents.html

Here is the migrated post:

https://community.unix.com/t/compare-two-directory-and-find-differents/377962

Note that some of the strange chars in the original DB become the "unknown unicode char" \uFFFD in the new DB:

This encoding issue is very noticeable in spam in our database (mostly from non-English speakers and countries).

We see this occasionally in the DB from other non-English speakers and do not have a perfect solution for this issue, so far.

Here some clues about this:

mysql - Trouble with UTF-8 characters; what I see is not what I stored - Stack Overflow

Some suggest code like this:

$final = str_replace("", "", $final);
$final = str_replace("'", "'", $final);
$final = str_replace(""", '"', $final);
$final = str_replace('"�', '-', $final);
$final = str_replace('"', '"', $final);

I will try this on staging server today.

Anyone have any good or better ideas?

Note: Actually, this kinda' fun in a perverted kinda' way. LOL

4 Likes

After this test, I may give this ruby gem a try as well:

File: README - Documentation for mojibake (1.1.2)

See also: GitHub - dekellum/mojibake: Recover mojibake text using a reverse-mapping table

See also: Clean Up Weird Characters in Database | Digging Into WordPress

FYI existing old mysql dB

mysql> SELECT count(postid)  from post where pagetext like '%"%';
+---------------+
| count(postid) |
+---------------+
|            66 |
+---------------+
1 row in set (1.64 sec)

mysql> SELECT count(postid)  from post where pagetext like  '%" %';
+---------------+
| count(postid) |
+---------------+
|            14 |
+---------------+
1 row in set (1.68 sec)


mysql> SELECT count(postid)  from post where pagetext like '%"�%';                                                                                                                    
+---------------+
| count(postid) |
+---------------+
|            45 |
+---------------+
1 row in set (1.66 sec)

mysql> SELECT count(postid)  from post where pagetext like '%'%';                                                                                                                    
+---------------+
| count(postid) |
+---------------+
|           165 |
+---------------+
1 row in set (1.63 sec)

mysql> SELECT count(postid)  from post where pagetext like '%'%';                                                                                                                    
+---------------+
| count(postid) |
+---------------+
|            38 |
+---------------+
1 row in set (1.69 sec)


mysql> SELECT count(postid)  from post where pagetext like '%-%';
+---------------+
| count(postid) |
+---------------+
|             4 |
+---------------+
1 row in set (1.70 sec)

mysql> SELECT count(postid)  from post where pagetext like '%...%';
+---------------+
| count(postid) |
+---------------+
|            23 |
+---------------+
1 row in set (1.68 sec)

Now that SELECT shows some goodies, maybe UPDATE on main DB ? :slight_smile:

Like this:

This spam page (on new site) was full of those "odd diamond things"...

https://community.unix.com/t/infraction-for-solosx-spammed-advertisements/271573

... but I got rid of them on old site:

https://www.unix.com/user-infractions/142431-infraction-solosx-spammed-advertisements.html

Using this mysql code in old forums against mysql DB (for some reason, had to repeat, but that is OK):

Will run against staging mysql soon....

2 Likes

Hmmmmmmmmmmmmmm.......

Why do i keep getting apostrophe's converted to ? when i import - beets

Unicode apostrophe standardization - Style - MetaBrainz Community Discourse

This Ruby code did not match... however running the search and replace mysql query directly against the DB seems to work (step by step).

Testing now on staging server.... if it looks OK, will use this to restore community for further testing.

This is all good... but I want to focus bottom up... only on the specific chars causing a problem in our DB.

That is what I am doing now ... finding the exact offending char and then finding the correct transform to cleanse it.

Please hold off on posting links unless the link contain solutions for the exact char we are having issues with (let's stick to bottoms up approach, not top down, for now).

I have all the chars we have found so far covered, so hold off (on these funny chars) until I get the various staging DBs synced.

Thanks.

Right now I have all the transforms I need based on what we have found so far. We can search for more in the next round. In other words, I know what the problem is. What we need is to find them and then fix them, from a bottoms up approach because I am not going to run any code which "transforms" problems we have not identified and tested. I do not want unintended consciences of running code and others transforms unless they solve a specific, clearly identified issue.

Will update soon.

Need to remap ASCII characters to Unicode?

No. It's not that simple. It it was that simple, there would be no issue now. (The migration script already does encoding mapping from day 1. The DB is already UNICODE... the Ruby script already does encoding mapping. it is not so simple as a "general remap" or it would be done already.)

Let's stick with the plan. Find links with problems. I know how to fix these if everyone will follow my original plan and provide specific links with specific issues (original v. migrated posts).

What I need are EXACT examples of the real problem in OUR DB (not theory). Thanks. The links I posted were directly related to the EXACT char problem I am working today (right now). I used that code as a basis to address directly problems you guys found.

We want to work this bottoms up. Bottoms up means to find the exact issues (not theory) and fix the exact coding issue for each encoding issue.

Please. I'm busy and need to get this done the way I know will work. The only way to get this done correctly and surely is bottoms up. Not top down theory and speculation.

Thanks.

What I need from testers, in this thread is ORIGINAL versus MIGRATED posts examples. I can take care of the rest (finding the encoding in the DB, finding the correct transform, writing the code, running it, testing it in the DB, etc). Please keep on track looking for issues. That is the best way to help get this done.

Everything we have identified so far, I already have a solution for, and tested it and it works.

What I need are more examples of any error, anomaly or other data migration integrity issue, in two links (the original post and the migrated post).

2 Likes

Here is the simple version.

I we have a post full of ORIGINAL v. MIGRATE threads, it is easy for me to compare, come up with code, test and retest.

Without the links, or links scattered all over the place (email, whats app messages, carrier pigeon), it is hard for me to go back and test and it take me too much time because there is a great amount of work to do.

This is why I called for testing exactly as I did in my first call for testing:

Here is an image of what we need, from my first post on this caper:

Please Help Integrity Test New Discourse Forums V2

QUOTE FROM LINK ABOVE

"

"

While I am doing another test run, let me try to explain this better.

Our DB is nearly 15 years old.

People have copy-and-paste any kinds of encoding into the database. That stuff may have or may not have been transform to the encoding of the DB. In addition, over the years, the coding of the DB has changed. It was not UNICODE in the beginning.

The same is true for keyboards. People type from all kinds of keyboard over the years. Sometimes this adds to the problem of encoding, but generally it is from copy-and-post, from what I have seen. Many people like to write their post on their desktop editor and copy and paste that into the forums.

So, running any generic encoding translation will not work for all encodings. If it was, this problem would have already been solved. Sometimes UNICODE does not work because there are encoded chars with are not part of UNICODE.

It's not a theory. It is a fact of years of having a busy forum with people all over the world copy-and-pasting their locally encoded text into our DB. Sometime we get lucky and the encoding works.

All we can do, is identify it and squash it, or ignore it.

It's not critical either, because I can fix it after migration directly in the DB, as I have been doing today. But the best place to fix it is in the legacy mysql DB when possible but it is also doable if information was not lost in migration from mysql to postgres to do it in postgres.

This is why I am kinda begging everyone to help test. I can write the code to fix the issue if I clearly see the issues. There are one million posts. The more people take a look, the more it helps.

Sorry to be begging... LOL. I have been working on this for months. My wife is starting to feel like she has no husband; which I can understand why.

But I wanted everyone to understand why I have asked for this help.

This is exactly what I need..... (image from first post on this test)

-------------------------

Honestly, so far people have provided me a total of about 3 or 4 links only where this encoding issue comes up and most of those are in non-public spam archives.

I don't want to be spending my time chasing outliers in two decades of encoding. Either there are issues or not. I am not going to spend my entire life working on chasing unimportant encoding issues to try to make a migration which s 99.99% perfect to 99.9999% perfect. It's not a good use of our time.

So, please provide details accounts of any remain encoding issues with links to the original and the migrated version.

Thanks.

Here is one, but the issue is in the original DB.

Retry Logic But In Cron - UNIX for Beginners Questions & Answers - UNIX.COM Community

In the mysql DB:

So, no reason to waste time on encoding issues which are not migration issues.

Retry Logic But In Cron

This illustrates the problem, chasing error in the original DB which migration as they were posts.

This is why I need the ORIGINALS and the MIGRATED versions if anyone sees any issue.

However, if anyone knows the correct replacement for that strange stuff, I will add it to the translation.

Hi All,

As Neo says I have been spending a bit of time on this migration integrity issue.

The irritating "Thingy" (white diamond with question mark in the middle) is officially the Unicode symbol called "Replacement character". The character set inserts this as a placeholder for a character that it doesn't understand. IMHO, the issue here is simply that the migration script (or whatever process) SHOULD understand all the characters on our old site. Yes, we already have "Replacement characters" on the old site switch probably emanated from a long ago upgrade from ascii to Unicode, or from Unicode version x to Unicode version y. As Neo says, replacement character symbols in our old site must be ignored because there's nothing we can do about them now apart from manually edit them out as time goes on.

However, I believe that the currently used (Discourse provided??) process is stuffed because it doesn't understand some of the perfectly correct text on our old site. It even screws up a thread title on the old site containing the replacement character symbol - look at this......

Post migration
How to grep i?1/2 symbol? - Shell Programming and Scripting - UNIX.COM Community

Pre migration
How to grep � symbol?

So the process doesn't even understand it's own Unicode character set!!!!

So FWIW, I've come to the conclusion that trying to modify our old dB is futile as the process will probably find something else to screw up.

Indeed, if you follow the first link I posted on this thread further back, others are having the same issue.

That's my update thus far. I'll report back again as my investigation continues.

EDIT: Replacement character symbol is U+FFFD

What are you talking about?

The migrated versus is the same as the original version.

That is exactly how it should be.

This has nothing at all to do with migration, discourse or translating encoding.

Migrating a post where the encoding has already been replaced in the original DB as "a question replacement char" will keep the same "question replacement char" .

Indeed, are you working too hard? :slight_smile:

These example posts you just posted seem "perfect" to me. The migration is just like the original. The original has already replaced unknown encoding (you do not know if the original encoding was unicode or not, that information is gone from existence in the original post and has been replaced with the "wft" encoding symbol.

:slight_smile:

Whoa, hang on a minute........

I'm talking about the thread title. Is your migrated thread title reading correctly? If so, that's news.

My thread title shown is screwed.

Yes, the thread title in the migrated post is different because there is no migration script running against titles.

That's not a big deal, it's a total outlier.

We are not processing "thread titles" in the migration (on migration scripts do) because the migration scripts are not designed to process any titles at all.

All processing is for the posts only.

Let's don't go chasing down outlier rabbit holes where we are not even concerned about.

It takes less time to edit an outlier like that than to discuss it, LOL.

Obviously, is a thread title has a char which is not encoded properly, then it will have issues when migrated. That is like some 0.00001 kinda outlier.

It's good you found it so we can edit it later; but it's nothing to be concerned about.

So, to be clear... the migration script does not do any processing on titles. Titles are expected to be written in "basic normal charsets" and when they are not, it is quite the outlier case.

And look at this.............

Old site
find . -name �*.*� | xargs grep "help"

New site
find . -name "*.*� | xargs grep "help" - UNIX for Dummies Questions & Answers - UNIX.COM Community

The migration has taken out the replacement character placeholders and put in double quotes which, I wouldn't mind betting, is what was originally there.

What? Anybody got a good working crystal ball?

Let me check the original DB and post back..

Hold on.