WP is running in semi-unicode and ascii/latin mode. As a result, people with weird languages that require UTF-8 character sets are having major problems. The issue isn't easily detectable, since storing and retreiving UTF-8 data to an SQL database with latin character set seems to work. Unfortuantely, it doesn't really work. WP can store UTF-8 data on a database/table/field with latin character set, but all SQL-based text functions return wrong values.
For example: SORTING, COMPARING, MANIPULATING of any string returns invalid data (not sorted properly, etc). Its about time WP started using UTF-8 everywhere.
The change to UTF-8 isn't simple. Some people thing that they can just "ALTER TABLE" to UTF-8 charset and then use "SET NAMES utf-8" that they'll be fine, WRONG!
For a new installation, its rather easy:
1) All database and table definitions must be set to UTF-8, some examples:
create database wordpress DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
create table wp_users (etc...) DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
2) Modify the WP database connection to execute the following:
SET NAMES utf8;
SET COLLATION_CONNECTION=utf8_general_ci;
Thats about it, a new installation can easily run with full UTF-8 support without any more changes.
Now, how about upgrading from an existing database? Thats more complex. Read this carefuly:
When doing an ALTER TABLE to change the character set, all TEXT (and similar) fields are converted to UTF-8. The conversion BREAKS existing text because the conversion expects the data to be in Latin, but they are not since WP has stored unicode characters in a latin database, as a result we get garbage after the conversion!
The solution is to ALTER all TEXT and related fields to BLOB, then alter the character set and finaly change back the BLOB fields to TEXT.
Example steps:
1) ALTER TABLE users MODIFY Last_Name BLOB;
2) ALTER DATABASE wordpress charset=utf8;
3) ALTER TABLE users charset=utf8;
4) ALTER TABLE users MODIFY Last_Name TEXT CHARACTER SET utf8;
so, we change our text fields to BLOB, switch our database and tables to UTF-8 and finaly in one go we return our initial TEXT fields and switch them to UTF-8.
the key here is that a BLOB field will not be converted to garbage when switched to UTF-8, unlike a TEXT field.
Hopefuly, the developers of WP will be able to create a conversion script to upgrade old latin databases.
Some of the related tickets: #2828, #2942, #3184