PDA

查看完整版本 : MySQL and UTF-8


Stanley
January 28th, 2008, 01:55 AM
MySQL and UTF-8


Notes

Good support from 4.1

utf-8 is utf8 in MySQL.

A collation defines the sort order for the data, it may be case sensitive or not
To find out your current setup:
SHOW VARIABLES LIKE 'character_set_database';
SHOW VARIABLES LIKE 'character_set_client';
To see available character sets and collations on your database:

SHOW CHARACTER SET;
SHOW COLLATION LIKE 'utf8%';
Character set and collation can be set per server, database, table, connection;
Server (/etc/my.cnf):

[mysqld]
...
default-character-set=utf8
default-collation=utf8_general_ci
Database:

(CREATE | ALTER) DATABASE ... DEFAULT CHARACTER SET utf8Table:

(CREATE | ALTER) TABLE ... DEFAULT CHARACTER SET utf8Connection:

SET NAMES 'utf8';
A PHP mysql connection (not totally confirmed, but see tests below) defaults to a latin1 connection, so, your first query after connection should be:

mysql_query("SET NAMES 'utf8'");

The CONVERT() function can convert between charsets, eg:

INSERT INTO utf8table (utf8column)
SELECT CONVERT(latin1field USING utf8)
FROM latin1table;
As mentioned in charsets (http://www.phpwact.org/php/118n/charsets), field widths may need to be increased to deal with multi-byte characters.
Code to generate a mass change of collations:

<?php

// this script will output the queries need to change all fields/tables to a different collation
// it is HIGHLY suggested you take a MySQL dump prior to running any of the generated
// this code is provided as is and without any warranty

die("Make a backup of your MySQL database then remove this line");

set_time_limit(0);

// collation you want to change:
$convert_from = 'latin1_swedish_ci';

// collation you want to change it to:
$convert_to = 'utf8_general_ci';

// character set of new collation:
$character_set= 'utf8';

$show_alter_table = true;
$show_alter_field = true;

// DB login information
$username = 'user';
$password = 'pass';
$database = 'table';
$host = 'localhost';

mysql_connect($host, $username, $password);
mysql_select_db($database);

$rs_tables = mysql_query(" SHOW TABLES ") or die(mysql_error());

print '<pre>';
while ($row_tables = mysql_fetch_row($rs_tables)) {
$table = mysql_real_escape_string($row_tables[0]);

// Alter table collation
// ALTER TABLE `account` DEFAULT CHARACTER SET utf8
if ($show_alter_table) {
echo("ALTER TABLE `$table` DEFAULT CHARACTER SET $character_set;\r\n");
}

$rs = mysql_query(" SHOW FULL FIELDS FROM `$table` ") or die(mysql_error());
while ($row=mysql_fetch_assoc($rs)) {

if ($row['Collation']!=$convert_from)
continue;

// Is the field allowed to be null?
if ($row['Null']=='YES') {
$nullable = ' NULL ';
} else {
$nullable = ' NOT NULL';
}

// Does the field default to null, a string, or nothing?
if ($row['Default']==NULL) {
$default = " DEFAULT NULL";
} else if ($row['Default']!='') {
$default = " DEFAULT '".mysql_real_escape_string($row['Default'])."'";
} else {
$default = '';
}

// Alter field collation:
// ALTER TABLE `account` CHANGE `email` `email` VARCHAR( 50 ) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL
if ($show_alter_field) {
$field = mysql_real_escape_string($row['Field']);
echo "ALTER TABLE `$table` CHANGE `$field` `$field` $row[Type] CHARACTER SET $character_set COLLATE $convert_to $nullable $default; \r\n";
}
}
}

?>

MySQL tables, columns and connections with UTF-8

What happens when you INSERT a UTF-8 string into a MySQL database using PHP‘s MySQL extension (http://php.net/mysql)? Well that depends on what string you use. If you use something like “Iñtërnâtiônàlizætiøn” then doing just about anything in MySQL will be data-safe (but, of course, the collation will be incorrect). If you have characters that don’t encode the same in UTF-8 and latin1 (e.g. text in Chinese, Russian, ...) then the behaviour depends on both the table definition and the encoding of the connection to the database.


http://www.newstanley.com/bbs/attachment.php?attachmentid=11&d=1201503978

* It seems that MySQL will look at the data as being a series of bytes all within the latin1 codepage. If you use the same code to read and write the data then it will round-trip fine (but MySQL’s collation will obviously be wrong). If you use tools with a knowledge of the data types used by MySQL, like phpMyAdmin (http://phpmyadmin.sf.net/) or even mysql_dump, then the wrong encoding is obvious.
(this test was performed with PHP versions 4.4.2 and 5.1.2 and MySQL 5.0.20; PHP source for test (http://pastebin.com/751239))