add brain

2026-03-12 15:17:52 +07:00
parent fd9f558fa1
commit e7821a7a9d
355 changed files with 93784 additions and 24 deletions
--- a/.brain/.agent/skills/database-optimization/mysql/references/replication-lag.md
+++ b/.brain/.agent/skills/database-optimization/mysql/references/replication-lag.md
@@ -0,0 +1,46 @@
+---
+title: Replication Lag Awareness
+description: Read-replica consistency pitfalls and mitigations
+tags: mysql, replication, lag, read-replicas, consistency, gtid
+---
+
+# Replication Lag
+
+MySQL replication is asynchronous by default. Reads from a replica may return stale data.
+
+## The Core Problem
+1. App writes to primary: `INSERT INTO orders ...`
+2. App immediately reads from replica: `SELECT * FROM orders WHERE id = ?`
+3. Replica hasn't applied the write yet — returns empty or stale data.
+
+## Detecting Lag
+```sql
+-- On the replica
+SHOW REPLICA STATUS\G
+-- Key field: Seconds_Behind_Source (0 = caught up, NULL = not replicating)
+```
+**Warning**: `Seconds_Behind_Source` measures relay-log lag, not true wall-clock staleness. It can underreport during long-running transactions because it only updates when transactions commit.
+
+**GTID-based lag**: for more accurate tracking, compare `@@global.gtid_executed` (replica) to primary GTID position, or use `WAIT_FOR_EXECUTED_GTID_SET()` to wait for a specific transaction.
+
+**Note**: parallel replication with `replica_parallel_type=LOGICAL_CLOCK` requires `binlog_format=ROW`. Statement-based replication (`binlog_format=STATEMENT`) is more limited for parallel apply.
+
+## Mitigation Strategies
+
+| Strategy | How | Trade-off |
+|---|---|---|
+| **Read from primary** | Route critical reads to primary after writes | Increases primary load |
+| **Sticky sessions** | Pin user to primary for N seconds after a write | Adds session affinity complexity |
+| **GTID wait** | `SELECT WAIT_FOR_EXECUTED_GTID_SET('gtid', timeout)` on replica | Adds latency equal to lag |
+| **Semi-sync replication** | Primary waits for >=1 replica ACK before committing | Higher write latency |
+
+## Common Pitfalls
+- **Large transactions cause lag spikes**: A single `INSERT ... SELECT` of 1M rows replays as one big transaction on the replica. Break into batches.
+- **DDL blocks replication**: `ALTER TABLE` with `ALGORITHM=COPY` on primary replays on replica, blocking other relay-log events during execution. `INSTANT` and `INPLACE` DDL are less blocking but still require brief metadata locks.
+- **Long queries on replica**: A slow `SELECT` on the replica can block relay-log application. Use `replica_parallel_workers` (8.0+) with `replica_parallel_type=LOGICAL_CLOCK` for parallel apply. Note: LOGICAL_CLOCK requires `binlog_format=ROW` and `slave_preserve_commit_order=ON` (or `replica_preserve_commit_order=ON`) to preserve commit order.
+- **IO thread bottlenecks**: Network latency, disk I/O, or `relay_log_space_limit` exhaustion can cause lag even when the SQL apply thread isn't saturated. Monitor `Relay_Log_Space` and connectivity.
+
+## Guidelines
+- Assume replicas are always slightly behind. Design reads accordingly.
+- Use GTID-based replication for reliable failover and lag tracking.
+- Monitor `Seconds_Behind_Source` with alerting (>5s warrants investigation).