Files
CleanArchitecture-template/.brain/.agent/skills/database-optimization/mysql/references/character-sets.md
2026-03-12 15:17:52 +07:00

2.3 KiB
Raw Blame History

title, description, tags
title description tags
Character Sets and Collations Charset config guide mysql, character-sets, utf8mb4, collation, encoding

Character Sets and Collations

Always Use utf8mb4

MySQL's utf8 = utf8mb3 (3-byte only, no emoji/many CJK). Always utf8mb4.

CREATE DATABASE myapp DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Collation Quick Reference

Collation Behavior Use for
utf8mb4_0900_ai_ci Case-insensitive, accent-insensitive Default
utf8mb4_0900_as_cs Case/accent sensitive Exact matching
utf8mb4_bin Byte-by-byte comparison Tokens, hashes

_0900_ = Unicode 9.0 (preferred over older _unicode_ variants).

Collation Behavior

Collations affect string comparisons, sorting (ORDER BY), and pattern matching (LIKE):

  • Case-insensitive (_ci): 'A' = 'a' evaluates to true, LIKE 'a%' matches 'Apple'
  • Case-sensitive (_cs): 'A' = 'a' evaluates to false, LIKE 'a%' matches only lowercase
  • Accent-insensitive (_ai): 'e' = 'é' evaluates to true
  • Accent-sensitive (_as): 'e' = 'é' evaluates to false
  • Binary (_bin): strict byte-by-byte comparison (most restrictive)

You can override collation per query:

SELECT * FROM users
WHERE name COLLATE utf8mb4_0900_as_cs = 'José';

Migrating from utf8/utf8mb3

-- Find columns still using utf8
SELECT table_name, column_name FROM information_schema.columns
WHERE table_schema = 'mydb' AND character_set_name = 'utf8';
-- Convert
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Warning: index key length limits depend on InnoDB row format:

  • DYNAMIC/COMPRESSED: 3072 bytes max (≈768 chars with utf8mb4)
  • REDUNDANT/COMPACT: 767 bytes max (≈191 chars with utf8mb4)

VARCHAR(255) with utf8mb4 = up to 1020 bytes (4×255). That's safe for DYNAMIC/COMPRESSED but exceeds REDUNDANT/COMPACT limits.

Connection

Ensure client uses utf8mb4: SET NAMES utf8mb4; (most modern drivers default to this).

SET NAMES utf8mb4 sets three session variables:

  • character_set_client (encoding for statements sent to server)
  • character_set_connection (encoding for statement processing)
  • character_set_results (encoding for results sent to client)

It also sets collation_connection to the default collation for utf8mb4.