MySQL Latin1 and UTF8 issues

by cw Email

In working with Bitweaver, I have come across a nasty side affect of using MySQL- it's handling of encoding is sucky.

I have posted into the Bitweaver Wiki about this, and am sure that it applies to many similar softwares ( I did a quick search and found Moodle and EZ-Publish forum postings with people fighting this, along with numerous other blogging softwares).

In a nutshell, the default MySQL encoding settings are Latin1. Most modern software needs UTF8. This stuffs up the saving of characters,into the database, that are unusual.

UTF8 is huge, literally hundreds of unique characters. If you need to convert the Latin1 characters in your MySQL DB into UTF8, then you will need to understand what is going on and how to fix it.

To understand the character conversion craziness, I recommend you create a new page/blog post/ whatever in your software, leaving it in draft mode, and inserting all the weird characters that your audience is likely to use, in a SINGLE COLUMN.At the same time, do it within a spreadsheet column. Use a character map for this, or if you are unsure, check this page.

Now save the page. And check it in your PHPMyAdmin(I cannot tell you how to do this- it varies hugely). You should have a single column of garbage characters. Congratulations! Copy/paste this into your next spreadsheet column.

Now you have a spreadsheet table detailing the conversions you need to do - first column= character in UTF8 that you want, second= Latin1/UTF8 hybrid mangle that fills your database and needs replacement.

Read the wiki link above, apply those settings to your software if you can. Fix the cause.

Now to clean your database. Use the Export feature of PHPMyAdmin, export with DROP Tables. The resulting SQL file is the code required to rebuild your database.

Examine the following shell script (unix/macos natively; windows users need to use their web host's command line, unless somebody wants to write a BATch file):

#!/bin/bash
echo I\'ll work on "$@" ...
for i in "$@" ; do
echo -n $i...
sed \
-e 's/=latin1/=utf8/g' \
-e 's/é/é/g' -e 's/è/è/g' -e 's/ê/ê/g' -e 's/ë/ë/g' \
-e 's/ / /g' -e 's/«/«/g' -e 's/»/»/g' -e 's/°/°/g' \
-e 's/ä/ä/g' -e 's/â/â/g' -e 's/î/î/g' -e 's/Â/’/g' \
-e 's/ï/ï/g' -e 's/ì/ì/g' -e 's/ò/ò/g' -e 's/ô/ô/g' \
-e 's/ö/ö/g' -e 's/ÿ/ÿ/g' -e 's/ù/ù/g' -e 's/ü/ü/g' \
-e 's/û/û/g' -e 's/ç/ç/g' -e 's/É/É/g' -e 's/È/È/g' \
-e 's/Ê/Ê/g' -e 's/Ë/Ë/g' -e 's/À/À/g' -e 's/Ä/Ä/g' \
-e 's/Â/Â/g' -e 's/Þ/Î/g' -e 's/Ï/Ï/g' -e 's/ÃŒ/Ì/g' \
-e 's/Ã’/Ò/g' -e 's/Ô/Ô/g' -e 's/Ö/Ö/g' -e 's/Ÿ/Ÿ/g' \
-e 's/…/…/g' -e 's/’/’/g' -e 's/àƒâ‚¬/ä/g' -e 's/‘//g' \
-e 's/“/“/g' -e 's/”/”/g' -e 's/‹/‹/g' -e 's/›/›/g' \
-e 's/—/—/g' -e 's/–/—/g' -e 's/â€?/”/g' -e 's/àƒ’/à/g' \
-e 's/â’€’™/’/g' -e 's/â’€’œ/“/g' -e 's/â’€?/”/g' -e 's/àƒ’©/é/g' \
-e 's/â’€’¦//g' -e 's/–/\•/g' -e 's/€/€/g' \
-e 's/Ù/Ù/g' -e 's/Ü/Ü/g' -e 's/Û/Û/g' -e 's/Ç/Ç/g' $i|
sed -e 's/Ã/à/g' > $i.clean

echo done
done


This shell script uses SED to open your exported(and thrice backed up) SQL file, fix it, then save it with .clean at the end of the file name. The code is highly customizable, and will need to be, if you have another character set giving you grief. That above is mostly Western European and abbreviated (the Cleanup script below is "Bigger, Better, Stronger!").

Now to use it, save that code as "cleanup.sh", then do chmod +x cleanup.sh, which makes it executable. Then: sh cleanup.sh olddb.sql (where olddb.sql is the saved name of your exported SQL file from PHPMyAdmin).

Check the file in your text editor (needs to be UTF8 aware), and search for weird characters like  and Ã. If they are still there, then there is some more characters to add to your script.

When you are finished, create the new database in PHPMyadmin, with Collation set to UTF8_general_ci. Then use the Import feature to lift your new clean SQL file upstream.

Don't forget to change your program's settings if you've renamed your database.

For PHP/MySQL apps you will want to send the query "SET NAMES UTF8" to the database JUST ONCE, immediately after first connection, so you will need to find the right place in your code. HINT: find mysql_connect.

eg: $dbconnect=@mysql_connect($host, $user ,$password);
$db=mysql_select_db($dbname);
$setnamesutf8 = mysql_query("SET NAMES UTF8");
...etc.

========================
Resources: (requires Unix/Linux/BSD and OpenOffice Calc)

Cleanup script

UTF8 Charset
Replacement Table
Sed for Windows
========================
Thanks to schwarzvogel.de and Tommie-Lie.

Trackback address for this post

This is a captcha-picture. It is used to prevent mass-access by robots.
Please enter the characters from the image above. (case insensitive)

1 comment

Comment from: Oliver [Visitor] · http://digitalarchitekt.de
*****
This script saved me hours I guess!

two things were missing for my case:

-e 's/ß/ß/g' -e 's/„/„/g'

and sed is complaining about missing whitespace between regex and backslash at line 6 and line 20

many thanks
Oliver
12/11/08 @ 10:24

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
PoorExcellent
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
This is a captcha-picture. It is used to prevent mass-access by robots.
Please enter the characters from the image above. (case insensitive)

Details

CombatWombat Speaks is a personal blog about Computing, Faith, and Life.
It is deliberately anonymous, as it doesn't really matter who I am.
What matters is the quality of the information within, and how
relevant it is to you.

There is a heavy emphasis here on technical computing, especially Linux, Ubuntu, Mint, Microsoft, VirtualBox, Networking, and Usability.

Please drop a comment in if you would like to have more details, or wish to learn about something specifically.

Close Details
CombatWombat speaks
2 Corinthians 10:4