Convert a Latin1 Database to UTF-8

Having mismatched character sets and collations can cause all kinds of weird display problems on a web site. The default character set for MySQL is latin1 and the default collation is latin1_swedish_ci. If your CMS is using utf8, weird characters can appear in the output (especially for foreign-language words with accents) and sometimes searches will fail for unknown reasons. Most experienced users recommend using utf8 for everything.

If you go into PhpMyAdmin and look at your database, you can check the character set and collation of the database, but you also need to check the individual tables and all of the text fields inside the tables because they may not all match. If you are a MODX user, the SiteCheck extra will check them all for you (and a lot of other things as well).

Important: Don't change the specified character sets and collations in the database directly. That will only make things worse because the actual data in the tables will not match the specifications you've set. You need to perform a process that will convert both the specifications and the data at the same time. That's the purpose of the cdc.php script.

I've tried converting databases from latin1 to utf8 several times over the last two years using various scripts available on the web. I had no great success and often a lot of grief. Most existing scripts do the conversion but require you to drop and restore indexes manually and often lose comments and some field characteristics (e.g. NOT NULL, UNIQUE, DEFAULT).

I finally wrote my own script to do it. The script in this package should convert from any charset/collation to any other charset/collation. By default it converts the current charset/collation to utf8/utf8_general_ci. The script drops and adds the indexes automatically and preserves the field information (including compound-key indexes).

Download cdc.zip

Full disclosure: I still consider character sets to be a black art, and don't can't claim to fully understand them. This script has always worked for me and for a number of others. YMMV.

 

Upgrade note: This script was updated in December of 2014. It should no longer fail on indexes with a size setting smaller than the field size. It was tested on the database for MODX 2.3.2 and now handles the new uri index properly and the compound index on the category object. It also now skips view tables.

 

MODX users: This script is meant to run outside of MODX and is not recommended for use in a MODX snippet or plugin. Clear the site cache before starting. Before Exporting your database go into the Manager and empty the error log and the manager action log (not strictly necessary, but it will make it easier on the server).

Other CMS users: Do not run this script inside your CMS — run it as a standalone script. There is nothing MODX-specific about this script and it should convert your database, but I haven't tested it with anything but MODX.

 

Important Security Note: Be careful where you put the cdc.config.php file. It contains the credentials for logging in to your database. It's not a bad idea to delete it when you're finished converting your database.

 

Quick and Daring Method

If you completely trust this script (not recommended), this method will give you a fast conversion to UTF-8 for your database. I would still back up the database first.

  1. Back up your database by selecting Export in PhpMyAdmin and save to a file.
  2. Be sure to select the "Custom - display all possible options" radio option and check the "Add DROP TABLE / VIEW / PROCEDURE / FUNCTION / EVENT statement" checkbox under "Object creation options."
  3. Create a file called cdc.config.php based on the config.cdc.sample.php file (in the same directory).
  4. Set the appropriate variables in the config file.
  5. Make sure the database is not in use during the rest of this process.
  6. Execute the script in a code editor, in a browser, or from the command line.
  7. If you use a browser, set $convertNewlines = true; in the config file.
  8. In cPanel (or the equivalent) go into PhpMyAdmin and select your database.
  9. Paste the MySQL code into the SQL window of PhpMyAdmin and click on the "Go" button
  10. Be patient, it takes a while to run (go out to lunch).

 

Whichever method you use, when backing up your original database, be sure to select the "Drop Table" option as described above! If you forget and have to restore your original database, you'll have to manually drop all tables in the database before Importing your backup.

 

Safe Method (recommended)

This method will allow you to check the conversion before using it.

  1. In cPanel (or the equivalent) go into PhpMyAdmin and select your database.
  2. Using PhpMyAdmin, "Export" the whole database to a local file. Call the file original-latin1.sql. Note that this file will serve as a backup of the database in case anything goes wrong. Be careful not to overwrite or alter it!
  3. Be sure to select the "Custom - display all possible options" radio option and check the "Add DROP TABLE / VIEW / PROCEDURE / FUNCTION / EVENT statement" checkbox under "Object creation options."
  4. Important: Be sure the DB is not in use for the rest of this process.
  5. On the "Operations" tab of PhpMyAdmin, copy the database to a new database with a new name. Use no hyphens in the DB name.
  6. Export this new database (as in step 2) to a file called before.sql
  7. (optional) using a diff program such as WinMerge, compare the original-latin1.sql file with the before.sql file. Except for the timestamp at the top, they should be identical.
  8. Create a file called cdc.config.php based on the cdc.config.sample.php file (in the same directory).
  9. Edit the variables in the config file to match your database and desired charset/collation.
  10. If you will be executing the script in a browser, set $convertNewlines = true; in the config file.
  11. Run cdc.php, either in an editor, browser, or from the command line (php cdc.php). Do *not* run it from within a CMS that uses the database. To make it easier to cut and paste, you may want to set $showHeaders = false; in the config file.
  12. Copy just the SQL query code to the clipboard. Make sure you get all of it and copy only the lines that start with ALTER TABLE.
  13. InPhpMyAdmin, select the new database you created above. Click on the SQL tab at the top.
  14. Paste in the SQL query code and click on "Go".
  15. Be patient, it takes a while to run. You may or may not get a success message when it's finished. Sometimes you just see the SQL form go blank.
  16. Export the database (as in step 2) to a local SQL file called after.sql
  17. Use a diff program like WinMerge to compare the original-latin1.sql file and the after.sql file. (See the notes below.)
  18. If everything looks good, put the new database into use. In a CMS, edit your CMS' configuration to use the new charset and the new database (in MODX, modify the config.inc.php file).
  19. Use your database and check for anomalous characters.
  20. If there are problems you can't fix, switch back to the original database

 

If your database is very large, the script may not complete. You can tell that has happened by looking at the structure of the database and the tables in PhpMyAdmin. You'll see some of the data as [BLOB ...] instead of the actual data. If this happens, start over by restoring the backup and paste each section of the generated SQL and execute it (the sections are separated by comments with a row of asterisks). As long as sections are run in the order that they appear in the SQL, you'll still get the complete conversion.

 

Notes

When you compare the original-latin1.sql and the after.sql files, you should see differences in the date and time line at the top and in the charset specifications. You may see that the order of some of the indexes has changed. This is normal and unavoidable. As long as they're all there, it's not a problem. It happens because the script has to drop some indexes and add them back. When they're added back, they may go at the end of the indexes. The actual fields in the tables should not have changed.

Remaining Problems

The script will safely change the character encoding and collation of your database, but you may still have problems with strange characters in the database.

It's fairly common to have some extraneous characters in your database before and after conversion. Odd characters have a way of creeping into any MySQL database. The script will solve problems directly related to encoding issues, but you may have other problems as well. The script won't delete oddball characters in the fields.

Pasting text from Word or other text-editing programs may give you some unusual apostrophes and quote marks (both single and double). Word's "curly" quotes (aka "smart quotes") are an example. Usually, you can leave these alone since most modern browsers display them correctly. Generally speaking, you shouldn't edit anything unless it causes trouble in a displayed page.

If you have serious problems with characters that do show up in the display afer conversion, you may need more drastic measures to fix them.

One method is to edit the Exported SQL file from the new database with a good editor (sometimes with a hex editor). Then create a new database with the proper encoding and collation and Import the SQL file to it. Don't edit anything unless you're sure you know it's wrong and what it should be. Don't edit anything but readable text.

You may see odd characters as garbage at the end of some database fields. You can leave them alone.

Sometimes you'll see extraneous characters that are not quotes or apostrophes — most often, these are non-breaking spaces. They look like a capital A with a hat (^). Usually, these can be left alone. Be careful if you delete or replace them because they are a two-character sequence (C2 A0) and if you only delete one of them, the display will break at that point. Sometimes, in a regular editor, you can press delete twice to get rid of them and type a space. Sometimes, if you want them to be spaces, you'll need to use a hex editor and replace the C2 A0 combination with 20 (The space character).

With other odd characters, if you're sure you know what the text should be, you can just delete a section of text (delete the odd characters and one or two characters on each side of the odd ones to be sure) in a regular editor and retype it.

Note that in an SQL file, an apostrophe is represented by two single quotes (e.g. can''t, don''t, won''t). Don't alter these or your Import will crash.

There are also some bits of code on the web for solving various kinds of character problems in a database. mysql strange characters and mysql invisible characters are good search phrases.

Derek Sivers provides some good information here about using fairly exotic SQL queries to fix some of the problems (you can skip the first part of his article because the cdc.php script will have done that part for you).

Preventing future trouble

CDC will convert the database itself, but to prevent trouble in the future when new tables are added, it's a good idea to manage the defaults in MySQL. To see what yours are, follow the steps below:

  • Go into PhpMyAdmin
  • Select the Database
  • Click on the SQL tab
  • Enter this code exactly
  • SHOW VARIABLES LIKE  'char%';
    
  • Click on the "Go" button

The filesystem setting should be binary, but otherwise it's best if the variables all match (though a different setting for character_set_server is usually not a problem). Methods for setting the variables vary by platform and host. Consult Google, your host, or the host's help forum to see how to change them for your server.

Final Notes for MODX Users

After Importing the converted database, edit the manager/includes/config.inc.php file, or for MODX Revolution, core/config/config.inc.php. Change the charset to utf8 and change the name of the database to the new one. Important: Be sure to clear the site cache and your browser cache and cookies before testing the new site.

It's not a bad idea to do an upgrade install (be sure to uncheck all the options for installing snippets and the sample site in MODX Evolution).

 

My book, MODX: The Official Guide - Digital Edition is now available here. The paper version of the book is available from Amazon.

If you have the book and would like to download the code, you can find it here.

If you have the book and would like to see the updates and corrections page, you can find it here.

MODX: The Official Guide is 772 pages long and goes far beyond this web site in explaining beginning and advanced MODX techniques. It includes detailed information on:

  • Installing MODX
  • How MODX Works
  • Working with MODX resources and Elements
  • Using Git with MODX
  • Using common MODX add-on components like SPForm, Login, getResources, and FormIt
  • MODX security Permissions
  • Customizing the MODX Manager
  • Using Form Customization
  • Creating Transport Packages
  • MODX and xPDO object methods
  • MODX System Events
  • Using PHP with MODX

Go here for more information about the book.

Thank you for visiting BobsGuides.com

  —  Bob Ray