add brain

This commit is contained in:
2026-03-12 15:17:52 +07:00
parent fd9f558fa1
commit e7821a7a9d
355 changed files with 93784 additions and 24 deletions

View File

@@ -0,0 +1,476 @@
# database-designer reference
## Database Design Principles
### Normalization Forms
#### First Normal Form (1NF)
- **Atomic Values**: Each column contains indivisible values
- **Unique Column Names**: No duplicate column names within a table
- **Uniform Data Types**: Each column contains the same type of data
- **Row Uniqueness**: No duplicate rows in the table
**Example Violation:**
```sql
-- BAD: Multiple phone numbers in one column
CREATE TABLE contacts (
id INT PRIMARY KEY,
name VARCHAR(100),
phones VARCHAR(200) -- "123-456-7890, 098-765-4321"
);
-- GOOD: Separate table for phone numbers
CREATE TABLE contacts (
id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE contact_phones (
id INT PRIMARY KEY,
contact_id INT REFERENCES contacts(id),
phone_number VARCHAR(20),
phone_type VARCHAR(10)
);
```
#### Second Normal Form (2NF)
- **1NF Compliance**: Must satisfy First Normal Form
- **Full Functional Dependency**: Non-key attributes depend on the entire primary key
- **Partial Dependency Elimination**: Remove attributes that depend on part of a composite key
**Example Violation:**
```sql
-- BAD: Student course table with partial dependencies
CREATE TABLE student_courses (
student_id INT,
course_id INT,
student_name VARCHAR(100), -- Depends only on student_id
course_name VARCHAR(100), -- Depends only on course_id
grade CHAR(1),
PRIMARY KEY (student_id, course_id)
);
-- GOOD: Separate tables eliminate partial dependencies
CREATE TABLE students (
id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE courses (
id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE enrollments (
student_id INT REFERENCES students(id),
course_id INT REFERENCES courses(id),
grade CHAR(1),
PRIMARY KEY (student_id, course_id)
);
```
#### Third Normal Form (3NF)
- **2NF Compliance**: Must satisfy Second Normal Form
- **Transitive Dependency Elimination**: Non-key attributes should not depend on other non-key attributes
- **Direct Dependency**: Non-key attributes depend directly on the primary key
**Example Violation:**
```sql
-- BAD: Employee table with transitive dependency
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(100),
department_id INT,
department_name VARCHAR(100), -- Depends on department_id, not employee id
department_budget DECIMAL(10,2) -- Transitive dependency
);
-- GOOD: Separate department information
CREATE TABLE departments (
id INT PRIMARY KEY,
name VARCHAR(100),
budget DECIMAL(10,2)
);
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(100),
department_id INT REFERENCES departments(id)
);
```
#### Boyce-Codd Normal Form (BCNF)
- **3NF Compliance**: Must satisfy Third Normal Form
- **Determinant Key Rule**: Every determinant must be a candidate key
- **Stricter 3NF**: Handles anomalies not covered by 3NF
### Denormalization Strategies
#### When to Denormalize
1. **Read-Heavy Workloads**: High query frequency with acceptable write trade-offs
2. **Performance Bottlenecks**: Join operations causing significant latency
3. **Aggregation Needs**: Frequent calculation of derived values
4. **Caching Requirements**: Pre-computed results for common queries
#### Common Denormalization Patterns
**Redundant Storage**
```sql
-- Store calculated values to avoid expensive joins
CREATE TABLE orders (
id INT PRIMARY KEY,
customer_id INT REFERENCES customers(id),
customer_name VARCHAR(100), -- Denormalized from customers table
order_total DECIMAL(10,2), -- Denormalized calculation
created_at TIMESTAMP
);
```
**Materialized Aggregates**
```sql
-- Pre-computed summary tables
CREATE TABLE customer_statistics (
customer_id INT PRIMARY KEY,
total_orders INT,
lifetime_value DECIMAL(12,2),
last_order_date DATE,
updated_at TIMESTAMP
);
```
## Index Optimization Strategies
### B-Tree Indexes
- **Default Choice**: Best for range queries, sorting, and equality matches
- **Column Order**: Most selective columns first for composite indexes
- **Prefix Matching**: Supports leading column subset queries
- **Maintenance Cost**: Balanced tree structure with logarithmic operations
### Hash Indexes
- **Equality Queries**: Optimal for exact match lookups
- **Memory Efficiency**: Constant-time access for single-value queries
- **Range Limitations**: Cannot support range or partial matches
- **Use Cases**: Primary keys, unique constraints, cache keys
### Composite Indexes
```sql
-- Query pattern determines optimal column order
-- Query: WHERE status = 'active' AND created_date > '2023-01-01' ORDER BY priority DESC
CREATE INDEX idx_task_status_date_priority
ON tasks (status, created_date, priority DESC);
-- Query: WHERE user_id = 123 AND category IN ('A', 'B') AND date_field BETWEEN '...' AND '...'
CREATE INDEX idx_user_category_date
ON user_activities (user_id, category, date_field);
```
### Covering Indexes
```sql
-- Include additional columns to avoid table lookups
CREATE INDEX idx_user_email_covering
ON users (email)
INCLUDE (first_name, last_name, status);
-- Query can be satisfied entirely from the index
-- SELECT first_name, last_name, status FROM users WHERE email = 'user@example.com';
```
### Partial Indexes
```sql
-- Index only relevant subset of data
CREATE INDEX idx_active_users_email
ON users (email)
WHERE status = 'active';
-- Index for recent orders only
CREATE INDEX idx_recent_orders_customer
ON orders (customer_id, created_at)
WHERE created_at > CURRENT_DATE - INTERVAL '30 days';
```
## Query Analysis & Optimization
### Query Patterns Recognition
1. **Equality Filters**: Single-column B-tree indexes
2. **Range Queries**: B-tree with proper column ordering
3. **Text Search**: Full-text indexes or trigram indexes
4. **Join Operations**: Foreign key indexes on both sides
5. **Sorting Requirements**: Indexes matching ORDER BY clauses
### Index Selection Algorithm
```
1. Identify WHERE clause columns
2. Determine most selective columns first
3. Consider JOIN conditions
4. Include ORDER BY columns if possible
5. Evaluate covering index opportunities
6. Check for existing overlapping indexes
```
## Data Modeling Patterns
### Star Schema (Data Warehousing)
```sql
-- Central fact table
CREATE TABLE sales_facts (
sale_id BIGINT PRIMARY KEY,
product_id INT REFERENCES products(id),
customer_id INT REFERENCES customers(id),
date_id INT REFERENCES date_dimension(id),
store_id INT REFERENCES stores(id),
quantity INT,
unit_price DECIMAL(8,2),
total_amount DECIMAL(10,2)
);
-- Dimension tables
CREATE TABLE date_dimension (
id INT PRIMARY KEY,
date_value DATE,
year INT,
quarter INT,
month INT,
day_of_week INT,
is_weekend BOOLEAN
);
```
### Snowflake Schema
```sql
-- Normalized dimension tables
CREATE TABLE products (
id INT PRIMARY KEY,
name VARCHAR(200),
category_id INT REFERENCES product_categories(id),
brand_id INT REFERENCES brands(id)
);
CREATE TABLE product_categories (
id INT PRIMARY KEY,
name VARCHAR(100),
parent_category_id INT REFERENCES product_categories(id)
);
```
### Document Model (JSON Storage)
```sql
-- Flexible document storage with indexing
CREATE TABLE documents (
id UUID PRIMARY KEY,
document_type VARCHAR(50),
data JSONB,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- Index on JSON properties
CREATE INDEX idx_documents_user_id
ON documents USING GIN ((data->>'user_id'));
CREATE INDEX idx_documents_status
ON documents ((data->>'status'))
WHERE document_type = 'order';
```
### Graph Data Patterns
```sql
-- Adjacency list for hierarchical data
CREATE TABLE categories (
id INT PRIMARY KEY,
name VARCHAR(100),
parent_id INT REFERENCES categories(id),
level INT,
path VARCHAR(500) -- Materialized path: "/1/5/12/"
);
-- Many-to-many relationships
CREATE TABLE relationships (
id UUID PRIMARY KEY,
from_entity_id UUID,
to_entity_id UUID,
relationship_type VARCHAR(50),
created_at TIMESTAMP,
INDEX (from_entity_id, relationship_type),
INDEX (to_entity_id, relationship_type)
);
```
## Migration Strategies
### Zero-Downtime Migration (Expand-Contract Pattern)
**Phase 1: Expand**
```sql
-- Add new column without constraints
ALTER TABLE users ADD COLUMN new_email VARCHAR(255);
-- Backfill data in batches
UPDATE users SET new_email = email WHERE id BETWEEN 1 AND 1000;
-- Continue in batches...
-- Add constraints after backfill
ALTER TABLE users ADD CONSTRAINT users_new_email_unique UNIQUE (new_email);
ALTER TABLE users ALTER COLUMN new_email SET NOT NULL;
```
**Phase 2: Contract**
```sql
-- Update application to use new column
-- Deploy application changes
-- Verify new column is being used
-- Remove old column
ALTER TABLE users DROP COLUMN email;
-- Rename new column
ALTER TABLE users RENAME COLUMN new_email TO email;
```
### Data Type Changes
```sql
-- Safe string to integer conversion
ALTER TABLE products ADD COLUMN sku_number INTEGER;
UPDATE products SET sku_number = CAST(sku AS INTEGER) WHERE sku ~ '^[0-9]+$';
-- Validate conversion success before dropping old column
```
## Partitioning Strategies
### Horizontal Partitioning (Sharding)
```sql
-- Range partitioning by date
CREATE TABLE sales_2023 PARTITION OF sales
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE sales_2024 PARTITION OF sales
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
-- Hash partitioning by user_id
CREATE TABLE user_data_0 PARTITION OF user_data
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE user_data_1 PARTITION OF user_data
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
```
### Vertical Partitioning
```sql
-- Separate frequently accessed columns
CREATE TABLE users_core (
id INT PRIMARY KEY,
email VARCHAR(255),
status VARCHAR(20),
created_at TIMESTAMP
);
-- Less frequently accessed profile data
CREATE TABLE users_profile (
user_id INT PRIMARY KEY REFERENCES users_core(id),
bio TEXT,
preferences JSONB,
last_login TIMESTAMP
);
```
## Connection Management
### Connection Pooling
- **Pool Size**: CPU cores × 2 + effective spindle count
- **Connection Lifetime**: Rotate connections to prevent resource leaks
- **Timeout Settings**: Connection, idle, and query timeouts
- **Health Checks**: Regular connection validation
### Read Replicas Strategy
```sql
-- Write queries to primary
INSERT INTO users (email, name) VALUES ('user@example.com', 'John Doe');
-- Read queries to replicas (with appropriate read preference)
SELECT * FROM users WHERE status = 'active'; -- Route to read replica
-- Consistent reads when required
SELECT * FROM users WHERE id = LAST_INSERT_ID(); -- Route to primary
```
## Caching Layers
### Cache-Aside Pattern
```python
def get_user(user_id):
# Try cache first
user = cache.get(f"user:{user_id}")
if user is None:
# Cache miss - query database
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
# Store in cache
cache.set(f"user:{user_id}", user, ttl=3600)
return user
```
### Write-Through Cache
- **Consistency**: Always keep cache and database in sync
- **Write Latency**: Higher due to dual writes
- **Data Safety**: No data loss on cache failures
### Cache Invalidation Strategies
1. **TTL-Based**: Time-based expiration
2. **Event-Driven**: Invalidate on data changes
3. **Version-Based**: Use version numbers for consistency
4. **Tag-Based**: Group related cache entries
## Database Selection Guide
### SQL Databases
**PostgreSQL**
- **Strengths**: ACID compliance, complex queries, JSON support, extensibility
- **Use Cases**: OLTP applications, data warehousing, geospatial data
- **Scale**: Vertical scaling with read replicas
**MySQL**
- **Strengths**: Performance, replication, wide ecosystem support
- **Use Cases**: Web applications, content management, e-commerce
- **Scale**: Horizontal scaling through sharding
### NoSQL Databases
**Document Stores (MongoDB, CouchDB)**
- **Strengths**: Flexible schema, horizontal scaling, developer productivity
- **Use Cases**: Content management, catalogs, user profiles
- **Trade-offs**: Eventual consistency, complex queries limitations
**Key-Value Stores (Redis, DynamoDB)**
- **Strengths**: High performance, simple model, excellent caching
- **Use Cases**: Session storage, real-time analytics, gaming leaderboards
- **Trade-offs**: Limited query capabilities, data modeling constraints
**Column-Family (Cassandra, HBase)**
- **Strengths**: Write-heavy workloads, linear scalability, fault tolerance
- **Use Cases**: Time-series data, IoT applications, messaging systems
- **Trade-offs**: Query flexibility, consistency model complexity
**Graph Databases (Neo4j, Amazon Neptune)**
- **Strengths**: Relationship queries, pattern matching, recommendation engines
- **Use Cases**: Social networks, fraud detection, knowledge graphs
- **Trade-offs**: Specialized use cases, learning curve
### NewSQL Databases
**Distributed SQL (CockroachDB, TiDB, Spanner)**
- **Strengths**: SQL compatibility with horizontal scaling
- **Use Cases**: Global applications requiring ACID guarantees
- **Trade-offs**: Complexity, latency for distributed transactions
## Tools & Scripts
### Schema Analyzer
- **Input**: SQL DDL files, JSON schema definitions
- **Analysis**: Normalization compliance, constraint validation, naming conventions
- **Output**: Analysis report, Mermaid ERD, improvement recommendations
### Index Optimizer
- **Input**: Schema definition, query patterns
- **Analysis**: Missing indexes, redundancy detection, selectivity estimation
- **Output**: Index recommendations, CREATE INDEX statements, performance projections
### Migration Generator
- **Input**: Current and target schemas
- **Analysis**: Schema differences, dependency resolution, risk assessment
- **Output**: Migration scripts, rollback plans, validation queries

View File

@@ -0,0 +1,373 @@
# Database Selection Decision Tree
## Overview
Choosing the right database technology is crucial for application success. This guide provides a systematic approach to database selection based on specific requirements, data patterns, and operational constraints.
## Decision Framework
### Primary Questions
1. **What is your primary use case?**
- OLTP (Online Transaction Processing)
- OLAP (Online Analytical Processing)
- Real-time analytics
- Content management
- Search and discovery
- Time-series data
- Graph relationships
2. **What are your consistency requirements?**
- Strong consistency (ACID)
- Eventual consistency
- Causal consistency
- Session consistency
3. **What are your scalability needs?**
- Vertical scaling sufficient
- Horizontal scaling required
- Global distribution needed
- Multi-region requirements
4. **What is your data structure?**
- Structured (relational)
- Semi-structured (JSON/XML)
- Unstructured (documents, media)
- Graph relationships
- Time-series data
- Key-value pairs
## Decision Tree
```
START: What is your primary use case?
├── OLTP (Transactional Applications)
│ │
│ ├── Do you need strong ACID guarantees?
│ │ ├── YES → Do you need horizontal scaling?
│ │ │ ├── YES → Distributed SQL
│ │ │ │ ├── CockroachDB (Global, multi-region)
│ │ │ │ ├── TiDB (MySQL compatibility)
│ │ │ │ └── Spanner (Google Cloud)
│ │ │ └── NO → Traditional SQL
│ │ │ ├── PostgreSQL (Feature-rich, extensions)
│ │ │ ├── MySQL (Performance, ecosystem)
│ │ │ └── SQL Server (Microsoft stack)
│ │ └── NO → Are you primarily key-value access?
│ │ ├── YES → Key-Value Stores
│ │ │ ├── Redis (In-memory, caching)
│ │ │ ├── DynamoDB (AWS managed)
│ │ │ └── Cassandra (High availability)
│ │ └── NO → Document Stores
│ │ ├── MongoDB (General purpose)
│ │ ├── CouchDB (Sync, replication)
│ │ └── Amazon DocumentDB (MongoDB compatible)
│ │
├── OLAP (Analytics and Reporting)
│ │
│ ├── What is your data volume?
│ │ ├── Small to Medium (< 1TB) → Traditional SQL with optimization
│ │ │ ├── PostgreSQL with columnar extensions
│ │ │ ├── MySQL with analytics engine
│ │ │ └── SQL Server with columnstore
│ │ ├── Large (1TB - 100TB) → Data Warehouse Solutions
│ │ │ ├── Snowflake (Cloud-native)
│ │ │ ├── BigQuery (Google Cloud)
│ │ │ ├── Redshift (AWS)
│ │ │ └── Synapse (Azure)
│ │ └── Very Large (> 100TB) → Big Data Platforms
│ │ ├── Databricks (Unified analytics)
│ │ ├── Apache Spark on cloud
│ │ └── Hadoop ecosystem
│ │
├── Real-time Analytics
│ │
│ ├── Do you need sub-second query responses?
│ │ ├── YES → Stream Processing + OLAP
│ │ │ ├── ClickHouse (Fast analytics)
│ │ │ ├── Apache Druid (Real-time OLAP)
│ │ │ ├── Pinot (LinkedIn's real-time DB)
│ │ │ └── TimescaleDB (Time-series)
│ │ └── NO → Traditional OLAP solutions
│ │
├── Search and Discovery
│ │
│ ├── What type of search?
│ │ ├── Full-text search → Search Engines
│ │ │ ├── Elasticsearch (Full-featured)
│ │ │ ├── OpenSearch (AWS fork of ES)
│ │ │ └── Solr (Apache Lucene-based)
│ │ ├── Vector/similarity search → Vector Databases
│ │ │ ├── Pinecone (Managed vector DB)
│ │ │ ├── Weaviate (Open source)
│ │ │ ├── Chroma (Embeddings)
│ │ │ └── PostgreSQL with pgvector
│ │ └── Faceted search → Search + SQL combination
│ │
├── Graph Relationships
│ │
│ ├── Do you need complex graph traversals?
│ │ ├── YES → Graph Databases
│ │ │ ├── Neo4j (Property graph)
│ │ │ ├── Amazon Neptune (Multi-model)
│ │ │ ├── ArangoDB (Multi-model)
│ │ │ └── TigerGraph (Analytics focused)
│ │ └── NO → SQL with recursive queries
│ │ └── PostgreSQL with recursive CTEs
│ │
└── Time-series Data
├── What is your write volume?
├── High (millions/sec) → Specialized Time-series
│ ├── InfluxDB (Purpose-built)
│ ├── TimescaleDB (PostgreSQL extension)
│ ├── Apache Druid (Analytics focused)
│ └── Prometheus (Monitoring)
└── Medium → SQL with time-series optimization
└── PostgreSQL with partitioning
```
## Database Categories Deep Dive
### Traditional SQL Databases
**PostgreSQL**
- **Best For**: Complex queries, JSON data, extensions, geospatial
- **Strengths**: Feature-rich, reliable, strong consistency, extensible
- **Use Cases**: OLTP, mixed workloads, JSON documents, geospatial applications
- **Scaling**: Vertical scaling, read replicas, partitioning
- **When to Choose**: Need SQL features, complex queries, moderate scale
**MySQL**
- **Best For**: Web applications, read-heavy workloads, simple schema
- **Strengths**: Performance, replication, large ecosystem
- **Use Cases**: Web apps, content management, e-commerce
- **Scaling**: Read replicas, sharding, clustering (MySQL Cluster)
- **When to Choose**: Simple schema, performance priority, large community
**SQL Server**
- **Best For**: Microsoft ecosystem, enterprise features, business intelligence
- **Strengths**: Integration, tooling, enterprise features
- **Use Cases**: Enterprise applications, .NET applications, BI
- **Scaling**: Always On availability groups, partitioning
- **When to Choose**: Microsoft stack, enterprise requirements
### Distributed SQL (NewSQL)
**CockroachDB**
- **Best For**: Global applications, strong consistency, horizontal scaling
- **Strengths**: ACID guarantees, automatic scaling, survival
- **Use Cases**: Multi-region apps, financial services, global SaaS
- **Trade-offs**: Complex setup, higher latency for global transactions
- **When to Choose**: Need SQL + global scale + consistency
**TiDB**
- **Best For**: MySQL compatibility with horizontal scaling
- **Strengths**: MySQL protocol, HTAP (hybrid), cloud-native
- **Use Cases**: MySQL migrations, hybrid workloads
- **When to Choose**: Existing MySQL expertise, need scale
### NoSQL Document Stores
**MongoDB**
- **Best For**: Flexible schema, rapid development, document-centric data
- **Strengths**: Developer experience, flexible schema, rich queries
- **Use Cases**: Content management, catalogs, user profiles, IoT
- **Scaling**: Automatic sharding, replica sets
- **When to Choose**: Schema evolution, document structure, rapid development
**CouchDB**
- **Best For**: Offline-first applications, multi-master replication
- **Strengths**: HTTP API, replication, conflict resolution
- **Use Cases**: Mobile apps, distributed systems, offline scenarios
- **When to Choose**: Need offline capabilities, bi-directional sync
### Key-Value Stores
**Redis**
- **Best For**: Caching, sessions, real-time applications, pub/sub
- **Strengths**: Performance, data structures, persistence options
- **Use Cases**: Caching, leaderboards, real-time analytics, queues
- **Scaling**: Clustering, sentinel for HA
- **When to Choose**: High performance, simple data model, caching
**DynamoDB**
- **Best For**: Serverless applications, predictable performance, AWS ecosystem
- **Strengths**: Managed, auto-scaling, consistent performance
- **Use Cases**: Web applications, gaming, IoT, mobile backends
- **Trade-offs**: Vendor lock-in, limited querying
- **When to Choose**: AWS ecosystem, serverless, managed solution
### Column-Family Stores
**Cassandra**
- **Best For**: Write-heavy workloads, high availability, linear scalability
- **Strengths**: No single point of failure, tunable consistency
- **Use Cases**: Time-series, IoT, messaging, activity feeds
- **Trade-offs**: Complex operations, eventual consistency
- **When to Choose**: High write volume, availability over consistency
**HBase**
- **Best For**: Big data applications, Hadoop ecosystem
- **Strengths**: Hadoop integration, consistent reads
- **Use Cases**: Analytics on big data, time-series at scale
- **When to Choose**: Hadoop ecosystem, very large datasets
### Graph Databases
**Neo4j**
- **Best For**: Complex relationships, graph algorithms, traversals
- **Strengths**: Mature ecosystem, Cypher query language, algorithms
- **Use Cases**: Social networks, recommendation engines, fraud detection
- **Trade-offs**: Specialized use case, learning curve
- **When to Choose**: Relationship-heavy data, graph algorithms
### Time-Series Databases
**InfluxDB**
- **Best For**: Time-series data, IoT, monitoring, analytics
- **Strengths**: Purpose-built, efficient storage, query language
- **Use Cases**: IoT sensors, monitoring, DevOps metrics
- **When to Choose**: High-volume time-series data
**TimescaleDB**
- **Best For**: Time-series with SQL familiarity
- **Strengths**: PostgreSQL compatibility, SQL queries, ecosystem
- **Use Cases**: Financial data, IoT with complex queries
- **When to Choose**: Time-series + SQL requirements
### Search Engines
**Elasticsearch**
- **Best For**: Full-text search, log analysis, real-time search
- **Strengths**: Powerful search, analytics, ecosystem (ELK stack)
- **Use Cases**: Search applications, log analysis, monitoring
- **Trade-offs**: Complex operations, resource intensive
- **When to Choose**: Advanced search requirements, analytics
### Data Warehouses
**Snowflake**
- **Best For**: Cloud-native analytics, data sharing, varied workloads
- **Strengths**: Separation of compute/storage, automatic scaling
- **Use Cases**: Data warehousing, analytics, data science
- **When to Choose**: Cloud-native, analytics-focused, multi-cloud
**BigQuery**
- **Best For**: Serverless analytics, Google ecosystem, machine learning
- **Strengths**: Serverless, petabyte scale, ML integration
- **Use Cases**: Analytics, data science, reporting
- **When to Choose**: Google Cloud, serverless analytics
## Selection Criteria Matrix
| Criterion | SQL | NewSQL | Document | Key-Value | Column-Family | Graph | Time-Series |
|-----------|-----|--------|----------|-----------|---------------|-------|-------------|
| ACID Guarantees | ✅ Strong | ✅ Strong | ⚠️ Eventual | ⚠️ Eventual | ⚠️ Tunable | ⚠️ Varies | ⚠️ Varies |
| Horizontal Scaling | ❌ Limited | ✅ Native | ✅ Native | ✅ Native | ✅ Native | ⚠️ Limited | ✅ Native |
| Query Flexibility | ✅ High | ✅ High | ⚠️ Moderate | ❌ Low | ❌ Low | ✅ High | ⚠️ Specialized |
| Schema Flexibility | ❌ Rigid | ❌ Rigid | ✅ High | ✅ High | ⚠️ Moderate | ✅ High | ⚠️ Structured |
| Performance (Reads) | ⚠️ Good | ⚠️ Good | ✅ Excellent | ✅ Excellent | ✅ Excellent | ⚠️ Good | ✅ Excellent |
| Performance (Writes) | ⚠️ Good | ⚠️ Good | ✅ Excellent | ✅ Excellent | ✅ Excellent | ⚠️ Good | ✅ Excellent |
| Operational Complexity | ✅ Low | ❌ High | ⚠️ Moderate | ✅ Low | ❌ High | ⚠️ Moderate | ⚠️ Moderate |
| Ecosystem Maturity | ✅ Mature | ⚠️ Growing | ✅ Mature | ✅ Mature | ✅ Mature | ✅ Mature | ⚠️ Growing |
## Decision Checklist
### Requirements Analysis
- [ ] **Data Volume**: Current and projected data size
- [ ] **Transaction Volume**: Reads per second, writes per second
- [ ] **Consistency Requirements**: Strong vs eventual consistency needs
- [ ] **Query Patterns**: Simple lookups vs complex analytics
- [ ] **Schema Evolution**: How often does schema change?
- [ ] **Geographic Distribution**: Single region vs global
- [ ] **Availability Requirements**: Acceptable downtime
- [ ] **Team Expertise**: Existing knowledge and learning curve
- [ ] **Budget Constraints**: Licensing, infrastructure, operational costs
- [ ] **Compliance Requirements**: Data residency, audit trails
### Technical Evaluation
- [ ] **Performance Testing**: Benchmark with realistic data and queries
- [ ] **Scalability Testing**: Test scaling limits and patterns
- [ ] **Failure Scenarios**: Test backup, recovery, and failure handling
- [ ] **Integration Testing**: APIs, connectors, ecosystem tools
- [ ] **Migration Path**: How to migrate from current system
- [ ] **Monitoring and Observability**: Available tooling and metrics
### Operational Considerations
- [ ] **Management Complexity**: Setup, configuration, maintenance
- [ ] **Backup and Recovery**: Built-in vs external tools
- [ ] **Security Features**: Authentication, authorization, encryption
- [ ] **Upgrade Path**: Version compatibility and upgrade process
- [ ] **Support Options**: Community vs commercial support
- [ ] **Lock-in Risk**: Portability and vendor independence
## Common Decision Patterns
### E-commerce Platform
**Typical Choice**: PostgreSQL or MySQL
- **Primary Data**: Product catalog, orders, users (structured)
- **Query Patterns**: OLTP with some analytics
- **Consistency**: Strong consistency for financial data
- **Scale**: Moderate with read replicas
- **Additional**: Redis for caching, Elasticsearch for product search
### IoT/Sensor Data Platform
**Typical Choice**: TimescaleDB or InfluxDB
- **Primary Data**: Time-series sensor readings
- **Query Patterns**: Time-based aggregations, trend analysis
- **Scale**: High write volume, moderate read volume
- **Additional**: Kafka for ingestion, PostgreSQL for metadata
### Social Media Application
**Typical Choice**: Combination approach
- **User Profiles**: MongoDB (flexible schema)
- **Relationships**: Neo4j (graph relationships)
- **Activity Feeds**: Cassandra (high write volume)
- **Search**: Elasticsearch (content discovery)
- **Caching**: Redis (sessions, real-time data)
### Analytics Platform
**Typical Choice**: Snowflake or BigQuery
- **Primary Use**: Complex analytical queries
- **Data Volume**: Large (TB to PB scale)
- **Query Patterns**: Ad-hoc analytics, reporting
- **Users**: Data analysts, data scientists
- **Additional**: Data lake (S3/GCS) for raw data storage
### Global SaaS Application
**Typical Choice**: CockroachDB or DynamoDB
- **Requirements**: Multi-region, strong consistency
- **Scale**: Global user base
- **Compliance**: Data residency requirements
- **Availability**: High availability across regions
## Migration Strategies
### From Monolithic to Distributed
1. **Assessment**: Identify scaling bottlenecks
2. **Data Partitioning**: Plan how to split data
3. **Gradual Migration**: Move non-critical data first
4. **Dual Writes**: Run both systems temporarily
5. **Validation**: Verify data consistency
6. **Cutover**: Switch reads and writes gradually
### Technology Stack Evolution
1. **Start Simple**: Begin with PostgreSQL or MySQL
2. **Identify Bottlenecks**: Monitor performance and scaling issues
3. **Selective Scaling**: Move specific workloads to specialized databases
4. **Polyglot Persistence**: Use multiple databases for different use cases
5. **Service Boundaries**: Align database choice with service boundaries
## Conclusion
Database selection should be driven by:
1. **Specific Use Case Requirements**: Not all applications need the same database
2. **Data Characteristics**: Structure, volume, and access patterns matter
3. **Non-functional Requirements**: Consistency, availability, performance targets
4. **Team and Organizational Factors**: Expertise, operational capacity, budget
5. **Evolution Path**: How requirements and scale will change over time
The best database choice is often not a single technology, but a combination of databases that each excel at their specific use case within your application architecture.

View File

@@ -0,0 +1,424 @@
# Index Strategy Patterns
## Overview
Database indexes are critical for query performance, but they come with trade-offs. This guide covers proven patterns for index design, optimization strategies, and common pitfalls to avoid.
## Index Types and Use Cases
### B-Tree Indexes (Default)
**Best For:**
- Equality queries (`WHERE column = value`)
- Range queries (`WHERE column BETWEEN x AND y`)
- Sorting (`ORDER BY column`)
- Pattern matching with leading wildcards (`WHERE column LIKE 'prefix%'`)
**Characteristics:**
- Logarithmic lookup time O(log n)
- Supports partial matches on composite indexes
- Most versatile index type
**Example:**
```sql
-- Single column B-tree index
CREATE INDEX idx_customers_email ON customers (email);
-- Composite B-tree index
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
```
### Hash Indexes
**Best For:**
- Exact equality matches only
- High-cardinality columns
- Primary key lookups
**Characteristics:**
- Constant lookup time O(1) for exact matches
- Cannot support range queries or sorting
- Memory-efficient for equality operations
**Example:**
```sql
-- Hash index for exact lookups (PostgreSQL)
CREATE INDEX idx_users_id_hash ON users USING HASH (user_id);
```
### Partial Indexes
**Best For:**
- Filtering on subset of data
- Reducing index size and maintenance overhead
- Query patterns that consistently use specific filters
**Example:**
```sql
-- Index only active users
CREATE INDEX idx_active_users_email
ON users (email)
WHERE status = 'active';
-- Index recent orders only
CREATE INDEX idx_recent_orders
ON orders (customer_id, created_at)
WHERE created_at > CURRENT_DATE - INTERVAL '90 days';
-- Index non-null values only
CREATE INDEX idx_customers_phone
ON customers (phone_number)
WHERE phone_number IS NOT NULL;
```
### Covering Indexes
**Best For:**
- Eliminating table lookups for SELECT queries
- Frequently accessed column combinations
- Read-heavy workloads
**Example:**
```sql
-- Covering index with INCLUDE clause (SQL Server/PostgreSQL)
CREATE INDEX idx_orders_customer_covering
ON orders (customer_id, order_date)
INCLUDE (order_total, status);
-- Query can be satisfied entirely from index:
-- SELECT order_total, status FROM orders
-- WHERE customer_id = 123 AND order_date > '2024-01-01';
```
### Functional/Expression Indexes
**Best For:**
- Queries on transformed column values
- Case-insensitive searches
- Complex calculations
**Example:**
```sql
-- Case-insensitive email searches
CREATE INDEX idx_users_email_lower
ON users (LOWER(email));
-- Date part extraction
CREATE INDEX idx_orders_month
ON orders (EXTRACT(MONTH FROM order_date));
-- JSON field indexing
CREATE INDEX idx_users_preferences_theme
ON users ((preferences->>'theme'));
```
## Composite Index Design Patterns
### Column Ordering Strategy
**Rule: Most Selective First**
```sql
-- Query: WHERE status = 'active' AND city = 'New York' AND age > 25
-- Assume: status has 3 values, city has 100 values, age has 80 values
-- GOOD: Most selective column first
CREATE INDEX idx_users_city_age_status ON users (city, age, status);
-- BAD: Least selective first
CREATE INDEX idx_users_status_city_age ON users (status, city, age);
```
**Selectivity Calculation:**
```sql
-- Estimate selectivity for each column
SELECT
'status' as column_name,
COUNT(DISTINCT status)::float / COUNT(*) as selectivity
FROM users
UNION ALL
SELECT
'city' as column_name,
COUNT(DISTINCT city)::float / COUNT(*) as selectivity
FROM users
UNION ALL
SELECT
'age' as column_name,
COUNT(DISTINCT age)::float / COUNT(*) as selectivity
FROM users;
```
### Query Pattern Matching
**Pattern 1: Equality + Range**
```sql
-- Query: WHERE customer_id = 123 AND order_date BETWEEN '2024-01-01' AND '2024-03-31'
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
```
**Pattern 2: Multiple Equality Conditions**
```sql
-- Query: WHERE status = 'active' AND category = 'premium' AND region = 'US'
CREATE INDEX idx_users_status_category_region ON users (status, category, region);
```
**Pattern 3: Equality + Sorting**
```sql
-- Query: WHERE category = 'electronics' ORDER BY price DESC, created_at DESC
CREATE INDEX idx_products_category_price_date ON products (category, price DESC, created_at DESC);
```
### Prefix Optimization
**Efficient Prefix Usage:**
```sql
-- Index supports all these queries efficiently:
CREATE INDEX idx_users_lastname_firstname_email ON users (last_name, first_name, email);
-- ✓ Uses index: WHERE last_name = 'Smith'
-- ✓ Uses index: WHERE last_name = 'Smith' AND first_name = 'John'
-- ✓ Uses index: WHERE last_name = 'Smith' AND first_name = 'John' AND email = 'john@...'
-- ✗ Cannot use index: WHERE first_name = 'John'
-- ✗ Cannot use index: WHERE email = 'john@...'
```
## Performance Optimization Patterns
### Index Intersection vs Composite Indexes
**Scenario: Multiple single-column indexes**
```sql
CREATE INDEX idx_users_age ON users (age);
CREATE INDEX idx_users_city ON users (city);
CREATE INDEX idx_users_status ON users (status);
-- Query: WHERE age > 25 AND city = 'NYC' AND status = 'active'
-- Database may use index intersection (combining multiple indexes)
-- Performance varies by database engine and data distribution
```
**Better: Purpose-built composite index**
```sql
-- More efficient for the specific query pattern
CREATE INDEX idx_users_city_status_age ON users (city, status, age);
```
### Index Size vs Performance Trade-off
**Wide Indexes (Many Columns):**
```sql
-- Pros: Covers many query patterns, excellent for covering queries
-- Cons: Large index size, slower writes, more memory usage
CREATE INDEX idx_orders_comprehensive
ON orders (customer_id, order_date, status, total_amount, shipping_method, created_at)
INCLUDE (order_notes, billing_address);
```
**Narrow Indexes (Few Columns):**
```sql
-- Pros: Smaller size, faster writes, less memory
-- Cons: May not cover all query patterns
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
CREATE INDEX idx_orders_status ON orders (status);
```
### Maintenance Optimization
**Regular Index Analysis:**
```sql
-- PostgreSQL: Check index usage statistics
SELECT
schemaname,
tablename,
indexname,
idx_scan as index_scans,
idx_tup_read as tuples_read,
idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE idx_scan = 0 -- Potentially unused indexes
ORDER BY schemaname, tablename;
-- Check index size
SELECT
indexname,
pg_size_pretty(pg_relation_size(indexname::regclass)) as index_size
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexname::regclass) DESC;
```
## Common Anti-Patterns
### 1. Over-Indexing
**Problem:**
```sql
-- Too many similar indexes
CREATE INDEX idx_orders_customer ON orders (customer_id);
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
CREATE INDEX idx_orders_customer_status ON orders (customer_id, status);
CREATE INDEX idx_orders_customer_date_status ON orders (customer_id, order_date, status);
```
**Solution:**
```sql
-- One well-designed composite index can often replace several
CREATE INDEX idx_orders_customer_date_status ON orders (customer_id, order_date, status);
-- Drop redundant indexes: idx_orders_customer, idx_orders_customer_date, idx_orders_customer_status
```
### 2. Wrong Column Order
**Problem:**
```sql
-- Query: WHERE active = true AND user_type = 'premium' AND city = 'Chicago'
-- Bad order: boolean first (lowest selectivity)
CREATE INDEX idx_users_active_type_city ON users (active, user_type, city);
```
**Solution:**
```sql
-- Good order: most selective first
CREATE INDEX idx_users_city_type_active ON users (city, user_type, active);
```
### 3. Ignoring Query Patterns
**Problem:**
```sql
-- Index doesn't match common query patterns
CREATE INDEX idx_products_name ON products (product_name);
-- But queries are: WHERE category = 'electronics' AND price BETWEEN 100 AND 500
-- Index is not helpful for these queries
```
**Solution:**
```sql
-- Match actual query patterns
CREATE INDEX idx_products_category_price ON products (category, price);
```
### 4. Function in WHERE Without Functional Index
**Problem:**
```sql
-- Query uses function but no functional index
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
-- Regular index on email won't help
```
**Solution:**
```sql
-- Create functional index
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
```
## Advanced Patterns
### Multi-Column Statistics
**When Columns Are Correlated:**
```sql
-- If city and state are highly correlated, create extended statistics
CREATE STATISTICS stats_address_correlation ON city, state FROM addresses;
ANALYZE addresses;
-- Helps query planner make better decisions for:
-- WHERE city = 'New York' AND state = 'NY'
```
### Conditional Indexes for Data Lifecycle
**Pattern: Different indexes for different data ages**
```sql
-- Hot data (recent orders) - optimized for OLTP
CREATE INDEX idx_orders_hot_customer_date
ON orders (customer_id, order_date DESC)
WHERE order_date > CURRENT_DATE - INTERVAL '30 days';
-- Warm data (older orders) - optimized for analytics
CREATE INDEX idx_orders_warm_date_total
ON orders (order_date, total_amount)
WHERE order_date <= CURRENT_DATE - INTERVAL '30 days'
AND order_date > CURRENT_DATE - INTERVAL '1 year';
-- Cold data (archived orders) - minimal indexing
CREATE INDEX idx_orders_cold_date
ON orders (order_date)
WHERE order_date <= CURRENT_DATE - INTERVAL '1 year';
```
### Index-Only Scan Optimization
**Design indexes to avoid table access:**
```sql
-- Query: SELECT order_id, total_amount, status FROM orders WHERE customer_id = ?
CREATE INDEX idx_orders_customer_covering
ON orders (customer_id)
INCLUDE (order_id, total_amount, status);
-- Or as composite index (if database doesn't support INCLUDE)
CREATE INDEX idx_orders_customer_covering
ON orders (customer_id, order_id, total_amount, status);
```
## Index Monitoring and Maintenance
### Performance Monitoring Queries
**Find slow queries that might benefit from indexes:**
```sql
-- PostgreSQL: Find queries with high cost
SELECT
query,
calls,
total_time,
mean_time,
rows
FROM pg_stat_statements
WHERE mean_time > 1000 -- Queries taking > 1 second
ORDER BY mean_time DESC;
```
**Identify missing indexes:**
```sql
-- Look for sequential scans on large tables
SELECT
schemaname,
tablename,
seq_scan,
seq_tup_read,
idx_scan,
n_tup_ins + n_tup_upd + n_tup_del as write_activity
FROM pg_stat_user_tables
WHERE seq_scan > 100
AND seq_tup_read > 100000 -- Large sequential scans
AND (idx_scan = 0 OR seq_scan > idx_scan * 2)
ORDER BY seq_tup_read DESC;
```
### Index Maintenance Schedule
**Regular Maintenance Tasks:**
```sql
-- Rebuild fragmented indexes (SQL Server)
ALTER INDEX ALL ON orders REBUILD;
-- Update statistics (PostgreSQL)
ANALYZE orders;
-- Check for unused indexes monthly
SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0;
```
## Conclusion
Effective index strategy requires:
1. **Understanding Query Patterns**: Analyze actual application queries, not theoretical scenarios
2. **Measuring Performance**: Use query execution plans and timing to validate index effectiveness
3. **Balancing Trade-offs**: More indexes improve reads but slow writes and increase storage
4. **Regular Maintenance**: Monitor index usage and performance, remove unused indexes
5. **Iterative Improvement**: Start with essential indexes, add and optimize based on real usage
The goal is not to index every possible query pattern, but to create a focused set of indexes that provide maximum benefit for your application's specific workload while minimizing maintenance overhead.

View File

@@ -0,0 +1,354 @@
# Database Normalization Guide
## Overview
Database normalization is the process of organizing data to minimize redundancy and dependency issues. It involves decomposing tables to eliminate data anomalies and improve data integrity.
## Normal Forms
### First Normal Form (1NF)
**Requirements:**
- Each column contains atomic (indivisible) values
- Each column contains values of the same type
- Each column has a unique name
- The order of data storage doesn't matter
**Violations and Solutions:**
**Problem: Multiple values in single column**
```sql
-- BAD: Multiple phone numbers in one column
CREATE TABLE customers (
id INT PRIMARY KEY,
name VARCHAR(100),
phones VARCHAR(500) -- "555-1234, 555-5678, 555-9012"
);
-- GOOD: Separate table for multiple phones
CREATE TABLE customers (
id INT PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE customer_phones (
id INT PRIMARY KEY,
customer_id INT REFERENCES customers(id),
phone VARCHAR(20),
phone_type VARCHAR(10) -- 'mobile', 'home', 'work'
);
```
**Problem: Repeating groups**
```sql
-- BAD: Repeating column patterns
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
item1_name VARCHAR(100),
item1_qty INT,
item1_price DECIMAL(8,2),
item2_name VARCHAR(100),
item2_qty INT,
item2_price DECIMAL(8,2),
item3_name VARCHAR(100),
item3_qty INT,
item3_price DECIMAL(8,2)
);
-- GOOD: Separate table for order items
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE
);
CREATE TABLE order_items (
id INT PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
item_name VARCHAR(100),
quantity INT,
unit_price DECIMAL(8,2)
);
```
### Second Normal Form (2NF)
**Requirements:**
- Must be in 1NF
- All non-key attributes must be fully functionally dependent on the primary key
- No partial dependencies (applies only to tables with composite primary keys)
**Violations and Solutions:**
**Problem: Partial dependency on composite key**
```sql
-- BAD: Student course enrollment with partial dependencies
CREATE TABLE student_courses (
student_id INT,
course_id INT,
student_name VARCHAR(100), -- Depends only on student_id
student_major VARCHAR(50), -- Depends only on student_id
course_title VARCHAR(200), -- Depends only on course_id
course_credits INT, -- Depends only on course_id
grade CHAR(2), -- Depends on both student_id AND course_id
PRIMARY KEY (student_id, course_id)
);
-- GOOD: Separate tables eliminate partial dependencies
CREATE TABLE students (
student_id INT PRIMARY KEY,
student_name VARCHAR(100),
student_major VARCHAR(50)
);
CREATE TABLE courses (
course_id INT PRIMARY KEY,
course_title VARCHAR(200),
course_credits INT
);
CREATE TABLE enrollments (
student_id INT,
course_id INT,
grade CHAR(2),
enrollment_date DATE,
PRIMARY KEY (student_id, course_id),
FOREIGN KEY (student_id) REFERENCES students(student_id),
FOREIGN KEY (course_id) REFERENCES courses(course_id)
);
```
### Third Normal Form (3NF)
**Requirements:**
- Must be in 2NF
- No transitive dependencies (non-key attributes should not depend on other non-key attributes)
- All non-key attributes must depend directly on the primary key
**Violations and Solutions:**
**Problem: Transitive dependency**
```sql
-- BAD: Employee table with transitive dependency
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
department_name VARCHAR(100), -- Depends on department_id, not employee_id
department_location VARCHAR(100), -- Transitive dependency through department_id
department_budget DECIMAL(10,2), -- Transitive dependency through department_id
salary DECIMAL(8,2)
);
-- GOOD: Separate department information
CREATE TABLE departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(100),
department_location VARCHAR(100),
department_budget DECIMAL(10,2)
);
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
salary DECIMAL(8,2),
FOREIGN KEY (department_id) REFERENCES departments(department_id)
);
```
### Boyce-Codd Normal Form (BCNF)
**Requirements:**
- Must be in 3NF
- Every determinant must be a candidate key
- Stricter than 3NF - handles cases where 3NF doesn't eliminate all anomalies
**Violations and Solutions:**
**Problem: Determinant that's not a candidate key**
```sql
-- BAD: Student advisor relationship with BCNF violation
-- Assumption: Each student has one advisor per subject,
-- each advisor teaches only one subject, but can advise multiple students
CREATE TABLE student_advisor (
student_id INT,
subject VARCHAR(50),
advisor_id INT,
PRIMARY KEY (student_id, subject)
);
-- Problem: advisor_id determines subject, but advisor_id is not a candidate key
-- GOOD: Separate the functional dependencies
CREATE TABLE advisors (
advisor_id INT PRIMARY KEY,
subject VARCHAR(50)
);
CREATE TABLE student_advisor_assignments (
student_id INT,
advisor_id INT,
PRIMARY KEY (student_id, advisor_id),
FOREIGN KEY (advisor_id) REFERENCES advisors(advisor_id)
);
```
## Denormalization Strategies
### When to Denormalize
1. **Performance Requirements**: When query performance is more critical than storage efficiency
2. **Read-Heavy Workloads**: When data is read much more frequently than it's updated
3. **Reporting Systems**: When complex joins negatively impact reporting performance
4. **Caching Strategies**: When pre-computed values eliminate expensive calculations
### Common Denormalization Patterns
**1. Redundant Storage for Performance**
```sql
-- Store frequently accessed calculated values
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_total DECIMAL(10,2), -- Denormalized: sum of order_items.total
item_count INT, -- Denormalized: count of order_items
created_at TIMESTAMP
);
CREATE TABLE order_items (
item_id INT PRIMARY KEY,
order_id INT,
product_id INT,
quantity INT,
unit_price DECIMAL(8,2),
total DECIMAL(10,2) -- quantity * unit_price (denormalized)
);
```
**2. Materialized Aggregates**
```sql
-- Pre-computed summary tables for reporting
CREATE TABLE monthly_sales_summary (
year_month VARCHAR(7), -- '2024-03'
product_category VARCHAR(50),
total_sales DECIMAL(12,2),
total_units INT,
avg_order_value DECIMAL(8,2),
unique_customers INT,
updated_at TIMESTAMP
);
```
**3. Historical Data Snapshots**
```sql
-- Store historical state to avoid complex temporal queries
CREATE TABLE customer_status_history (
id INT PRIMARY KEY,
customer_id INT,
status VARCHAR(20),
tier VARCHAR(10),
total_lifetime_value DECIMAL(12,2), -- Snapshot at this point in time
snapshot_date DATE
);
```
## Trade-offs Analysis
### Normalization Benefits
- **Data Integrity**: Reduced risk of inconsistent data
- **Storage Efficiency**: Less data duplication
- **Update Efficiency**: Changes need to be made in only one place
- **Flexibility**: Easier to modify schema as requirements change
### Normalization Costs
- **Query Complexity**: More joins required for data retrieval
- **Performance Impact**: Joins can be expensive on large datasets
- **Development Complexity**: More complex data access patterns
### Denormalization Benefits
- **Query Performance**: Fewer joins, faster queries
- **Simplified Queries**: Direct access to related data
- **Read Optimization**: Optimized for data retrieval patterns
- **Reduced Load**: Less database processing for common operations
### Denormalization Costs
- **Data Redundancy**: Increased storage requirements
- **Update Complexity**: Multiple places may need updates
- **Consistency Risk**: Higher risk of data inconsistencies
- **Maintenance Overhead**: Additional code to maintain derived values
## Best Practices
### 1. Start with Full Normalization
- Begin with a fully normalized design
- Identify performance bottlenecks through testing
- Selectively denormalize based on actual performance needs
### 2. Use Triggers for Consistency
```sql
-- Trigger to maintain denormalized order_total
CREATE TRIGGER update_order_total
AFTER INSERT OR UPDATE OR DELETE ON order_items
FOR EACH ROW
BEGIN
UPDATE orders
SET order_total = (
SELECT SUM(quantity * unit_price)
FROM order_items
WHERE order_id = NEW.order_id
)
WHERE order_id = NEW.order_id;
END;
```
### 3. Consider Materialized Views
```sql
-- Materialized view for complex aggregations
CREATE MATERIALIZED VIEW customer_summary AS
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) as order_count,
SUM(o.order_total) as lifetime_value,
AVG(o.order_total) as avg_order_value,
MAX(o.created_at) as last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
```
### 4. Document Denormalization Decisions
- Clearly document why denormalization was chosen
- Specify which data is derived and how it's maintained
- Include performance benchmarks that justify the decision
### 5. Monitor and Validate
- Implement validation checks for denormalized data
- Regular audits to ensure data consistency
- Performance monitoring to validate denormalization benefits
## Common Anti-Patterns
### 1. Premature Denormalization
Starting with denormalized design without understanding actual performance requirements.
### 2. Over-Normalization
Creating too many small tables that require excessive joins for simple queries.
### 3. Inconsistent Approach
Mixing normalized and denormalized patterns without clear strategy.
### 4. Ignoring Maintenance
Denormalizing without proper mechanisms to maintain data consistency.
## Conclusion
Normalization and denormalization are both valuable tools in database design. The key is understanding when to apply each approach:
- **Use normalization** for transactional systems where data integrity is paramount
- **Consider denormalization** for analytical systems or when performance testing reveals bottlenecks
- **Apply selectively** based on actual usage patterns and performance requirements
- **Maintain consistency** through proper design patterns and validation mechanisms
The goal is not to achieve perfect normalization or denormalization, but to create a design that best serves your application's specific needs while maintaining data quality and system performance.