add brain
This commit is contained in:
@@ -0,0 +1,476 @@
|
||||
# database-designer reference
|
||||
|
||||
## Database Design Principles
|
||||
|
||||
### Normalization Forms
|
||||
|
||||
#### First Normal Form (1NF)
|
||||
- **Atomic Values**: Each column contains indivisible values
|
||||
- **Unique Column Names**: No duplicate column names within a table
|
||||
- **Uniform Data Types**: Each column contains the same type of data
|
||||
- **Row Uniqueness**: No duplicate rows in the table
|
||||
|
||||
**Example Violation:**
|
||||
```sql
|
||||
-- BAD: Multiple phone numbers in one column
|
||||
CREATE TABLE contacts (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
phones VARCHAR(200) -- "123-456-7890, 098-765-4321"
|
||||
);
|
||||
|
||||
-- GOOD: Separate table for phone numbers
|
||||
CREATE TABLE contacts (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE TABLE contact_phones (
|
||||
id INT PRIMARY KEY,
|
||||
contact_id INT REFERENCES contacts(id),
|
||||
phone_number VARCHAR(20),
|
||||
phone_type VARCHAR(10)
|
||||
);
|
||||
```
|
||||
|
||||
#### Second Normal Form (2NF)
|
||||
- **1NF Compliance**: Must satisfy First Normal Form
|
||||
- **Full Functional Dependency**: Non-key attributes depend on the entire primary key
|
||||
- **Partial Dependency Elimination**: Remove attributes that depend on part of a composite key
|
||||
|
||||
**Example Violation:**
|
||||
```sql
|
||||
-- BAD: Student course table with partial dependencies
|
||||
CREATE TABLE student_courses (
|
||||
student_id INT,
|
||||
course_id INT,
|
||||
student_name VARCHAR(100), -- Depends only on student_id
|
||||
course_name VARCHAR(100), -- Depends only on course_id
|
||||
grade CHAR(1),
|
||||
PRIMARY KEY (student_id, course_id)
|
||||
);
|
||||
|
||||
-- GOOD: Separate tables eliminate partial dependencies
|
||||
CREATE TABLE students (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE TABLE courses (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE TABLE enrollments (
|
||||
student_id INT REFERENCES students(id),
|
||||
course_id INT REFERENCES courses(id),
|
||||
grade CHAR(1),
|
||||
PRIMARY KEY (student_id, course_id)
|
||||
);
|
||||
```
|
||||
|
||||
#### Third Normal Form (3NF)
|
||||
- **2NF Compliance**: Must satisfy Second Normal Form
|
||||
- **Transitive Dependency Elimination**: Non-key attributes should not depend on other non-key attributes
|
||||
- **Direct Dependency**: Non-key attributes depend directly on the primary key
|
||||
|
||||
**Example Violation:**
|
||||
```sql
|
||||
-- BAD: Employee table with transitive dependency
|
||||
CREATE TABLE employees (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
department_id INT,
|
||||
department_name VARCHAR(100), -- Depends on department_id, not employee id
|
||||
department_budget DECIMAL(10,2) -- Transitive dependency
|
||||
);
|
||||
|
||||
-- GOOD: Separate department information
|
||||
CREATE TABLE departments (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
budget DECIMAL(10,2)
|
||||
);
|
||||
|
||||
CREATE TABLE employees (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
department_id INT REFERENCES departments(id)
|
||||
);
|
||||
```
|
||||
|
||||
#### Boyce-Codd Normal Form (BCNF)
|
||||
- **3NF Compliance**: Must satisfy Third Normal Form
|
||||
- **Determinant Key Rule**: Every determinant must be a candidate key
|
||||
- **Stricter 3NF**: Handles anomalies not covered by 3NF
|
||||
|
||||
### Denormalization Strategies
|
||||
|
||||
#### When to Denormalize
|
||||
1. **Read-Heavy Workloads**: High query frequency with acceptable write trade-offs
|
||||
2. **Performance Bottlenecks**: Join operations causing significant latency
|
||||
3. **Aggregation Needs**: Frequent calculation of derived values
|
||||
4. **Caching Requirements**: Pre-computed results for common queries
|
||||
|
||||
#### Common Denormalization Patterns
|
||||
|
||||
**Redundant Storage**
|
||||
```sql
|
||||
-- Store calculated values to avoid expensive joins
|
||||
CREATE TABLE orders (
|
||||
id INT PRIMARY KEY,
|
||||
customer_id INT REFERENCES customers(id),
|
||||
customer_name VARCHAR(100), -- Denormalized from customers table
|
||||
order_total DECIMAL(10,2), -- Denormalized calculation
|
||||
created_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
**Materialized Aggregates**
|
||||
```sql
|
||||
-- Pre-computed summary tables
|
||||
CREATE TABLE customer_statistics (
|
||||
customer_id INT PRIMARY KEY,
|
||||
total_orders INT,
|
||||
lifetime_value DECIMAL(12,2),
|
||||
last_order_date DATE,
|
||||
updated_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
## Index Optimization Strategies
|
||||
|
||||
### B-Tree Indexes
|
||||
- **Default Choice**: Best for range queries, sorting, and equality matches
|
||||
- **Column Order**: Most selective columns first for composite indexes
|
||||
- **Prefix Matching**: Supports leading column subset queries
|
||||
- **Maintenance Cost**: Balanced tree structure with logarithmic operations
|
||||
|
||||
### Hash Indexes
|
||||
- **Equality Queries**: Optimal for exact match lookups
|
||||
- **Memory Efficiency**: Constant-time access for single-value queries
|
||||
- **Range Limitations**: Cannot support range or partial matches
|
||||
- **Use Cases**: Primary keys, unique constraints, cache keys
|
||||
|
||||
### Composite Indexes
|
||||
```sql
|
||||
-- Query pattern determines optimal column order
|
||||
-- Query: WHERE status = 'active' AND created_date > '2023-01-01' ORDER BY priority DESC
|
||||
CREATE INDEX idx_task_status_date_priority
|
||||
ON tasks (status, created_date, priority DESC);
|
||||
|
||||
-- Query: WHERE user_id = 123 AND category IN ('A', 'B') AND date_field BETWEEN '...' AND '...'
|
||||
CREATE INDEX idx_user_category_date
|
||||
ON user_activities (user_id, category, date_field);
|
||||
```
|
||||
|
||||
### Covering Indexes
|
||||
```sql
|
||||
-- Include additional columns to avoid table lookups
|
||||
CREATE INDEX idx_user_email_covering
|
||||
ON users (email)
|
||||
INCLUDE (first_name, last_name, status);
|
||||
|
||||
-- Query can be satisfied entirely from the index
|
||||
-- SELECT first_name, last_name, status FROM users WHERE email = 'user@example.com';
|
||||
```
|
||||
|
||||
### Partial Indexes
|
||||
```sql
|
||||
-- Index only relevant subset of data
|
||||
CREATE INDEX idx_active_users_email
|
||||
ON users (email)
|
||||
WHERE status = 'active';
|
||||
|
||||
-- Index for recent orders only
|
||||
CREATE INDEX idx_recent_orders_customer
|
||||
ON orders (customer_id, created_at)
|
||||
WHERE created_at > CURRENT_DATE - INTERVAL '30 days';
|
||||
```
|
||||
|
||||
## Query Analysis & Optimization
|
||||
|
||||
### Query Patterns Recognition
|
||||
1. **Equality Filters**: Single-column B-tree indexes
|
||||
2. **Range Queries**: B-tree with proper column ordering
|
||||
3. **Text Search**: Full-text indexes or trigram indexes
|
||||
4. **Join Operations**: Foreign key indexes on both sides
|
||||
5. **Sorting Requirements**: Indexes matching ORDER BY clauses
|
||||
|
||||
### Index Selection Algorithm
|
||||
```
|
||||
1. Identify WHERE clause columns
|
||||
2. Determine most selective columns first
|
||||
3. Consider JOIN conditions
|
||||
4. Include ORDER BY columns if possible
|
||||
5. Evaluate covering index opportunities
|
||||
6. Check for existing overlapping indexes
|
||||
```
|
||||
|
||||
## Data Modeling Patterns
|
||||
|
||||
### Star Schema (Data Warehousing)
|
||||
```sql
|
||||
-- Central fact table
|
||||
CREATE TABLE sales_facts (
|
||||
sale_id BIGINT PRIMARY KEY,
|
||||
product_id INT REFERENCES products(id),
|
||||
customer_id INT REFERENCES customers(id),
|
||||
date_id INT REFERENCES date_dimension(id),
|
||||
store_id INT REFERENCES stores(id),
|
||||
quantity INT,
|
||||
unit_price DECIMAL(8,2),
|
||||
total_amount DECIMAL(10,2)
|
||||
);
|
||||
|
||||
-- Dimension tables
|
||||
CREATE TABLE date_dimension (
|
||||
id INT PRIMARY KEY,
|
||||
date_value DATE,
|
||||
year INT,
|
||||
quarter INT,
|
||||
month INT,
|
||||
day_of_week INT,
|
||||
is_weekend BOOLEAN
|
||||
);
|
||||
```
|
||||
|
||||
### Snowflake Schema
|
||||
```sql
|
||||
-- Normalized dimension tables
|
||||
CREATE TABLE products (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(200),
|
||||
category_id INT REFERENCES product_categories(id),
|
||||
brand_id INT REFERENCES brands(id)
|
||||
);
|
||||
|
||||
CREATE TABLE product_categories (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
parent_category_id INT REFERENCES product_categories(id)
|
||||
);
|
||||
```
|
||||
|
||||
### Document Model (JSON Storage)
|
||||
```sql
|
||||
-- Flexible document storage with indexing
|
||||
CREATE TABLE documents (
|
||||
id UUID PRIMARY KEY,
|
||||
document_type VARCHAR(50),
|
||||
data JSONB,
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
updated_at TIMESTAMP DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Index on JSON properties
|
||||
CREATE INDEX idx_documents_user_id
|
||||
ON documents USING GIN ((data->>'user_id'));
|
||||
|
||||
CREATE INDEX idx_documents_status
|
||||
ON documents ((data->>'status'))
|
||||
WHERE document_type = 'order';
|
||||
```
|
||||
|
||||
### Graph Data Patterns
|
||||
```sql
|
||||
-- Adjacency list for hierarchical data
|
||||
CREATE TABLE categories (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
parent_id INT REFERENCES categories(id),
|
||||
level INT,
|
||||
path VARCHAR(500) -- Materialized path: "/1/5/12/"
|
||||
);
|
||||
|
||||
-- Many-to-many relationships
|
||||
CREATE TABLE relationships (
|
||||
id UUID PRIMARY KEY,
|
||||
from_entity_id UUID,
|
||||
to_entity_id UUID,
|
||||
relationship_type VARCHAR(50),
|
||||
created_at TIMESTAMP,
|
||||
INDEX (from_entity_id, relationship_type),
|
||||
INDEX (to_entity_id, relationship_type)
|
||||
);
|
||||
```
|
||||
|
||||
## Migration Strategies
|
||||
|
||||
### Zero-Downtime Migration (Expand-Contract Pattern)
|
||||
|
||||
**Phase 1: Expand**
|
||||
```sql
|
||||
-- Add new column without constraints
|
||||
ALTER TABLE users ADD COLUMN new_email VARCHAR(255);
|
||||
|
||||
-- Backfill data in batches
|
||||
UPDATE users SET new_email = email WHERE id BETWEEN 1 AND 1000;
|
||||
-- Continue in batches...
|
||||
|
||||
-- Add constraints after backfill
|
||||
ALTER TABLE users ADD CONSTRAINT users_new_email_unique UNIQUE (new_email);
|
||||
ALTER TABLE users ALTER COLUMN new_email SET NOT NULL;
|
||||
```
|
||||
|
||||
**Phase 2: Contract**
|
||||
```sql
|
||||
-- Update application to use new column
|
||||
-- Deploy application changes
|
||||
-- Verify new column is being used
|
||||
|
||||
-- Remove old column
|
||||
ALTER TABLE users DROP COLUMN email;
|
||||
-- Rename new column
|
||||
ALTER TABLE users RENAME COLUMN new_email TO email;
|
||||
```
|
||||
|
||||
### Data Type Changes
|
||||
```sql
|
||||
-- Safe string to integer conversion
|
||||
ALTER TABLE products ADD COLUMN sku_number INTEGER;
|
||||
UPDATE products SET sku_number = CAST(sku AS INTEGER) WHERE sku ~ '^[0-9]+$';
|
||||
-- Validate conversion success before dropping old column
|
||||
```
|
||||
|
||||
## Partitioning Strategies
|
||||
|
||||
### Horizontal Partitioning (Sharding)
|
||||
```sql
|
||||
-- Range partitioning by date
|
||||
CREATE TABLE sales_2023 PARTITION OF sales
|
||||
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
|
||||
|
||||
CREATE TABLE sales_2024 PARTITION OF sales
|
||||
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
|
||||
|
||||
-- Hash partitioning by user_id
|
||||
CREATE TABLE user_data_0 PARTITION OF user_data
|
||||
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
|
||||
|
||||
CREATE TABLE user_data_1 PARTITION OF user_data
|
||||
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
|
||||
```
|
||||
|
||||
### Vertical Partitioning
|
||||
```sql
|
||||
-- Separate frequently accessed columns
|
||||
CREATE TABLE users_core (
|
||||
id INT PRIMARY KEY,
|
||||
email VARCHAR(255),
|
||||
status VARCHAR(20),
|
||||
created_at TIMESTAMP
|
||||
);
|
||||
|
||||
-- Less frequently accessed profile data
|
||||
CREATE TABLE users_profile (
|
||||
user_id INT PRIMARY KEY REFERENCES users_core(id),
|
||||
bio TEXT,
|
||||
preferences JSONB,
|
||||
last_login TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
## Connection Management
|
||||
|
||||
### Connection Pooling
|
||||
- **Pool Size**: CPU cores × 2 + effective spindle count
|
||||
- **Connection Lifetime**: Rotate connections to prevent resource leaks
|
||||
- **Timeout Settings**: Connection, idle, and query timeouts
|
||||
- **Health Checks**: Regular connection validation
|
||||
|
||||
### Read Replicas Strategy
|
||||
```sql
|
||||
-- Write queries to primary
|
||||
INSERT INTO users (email, name) VALUES ('user@example.com', 'John Doe');
|
||||
|
||||
-- Read queries to replicas (with appropriate read preference)
|
||||
SELECT * FROM users WHERE status = 'active'; -- Route to read replica
|
||||
|
||||
-- Consistent reads when required
|
||||
SELECT * FROM users WHERE id = LAST_INSERT_ID(); -- Route to primary
|
||||
```
|
||||
|
||||
## Caching Layers
|
||||
|
||||
### Cache-Aside Pattern
|
||||
```python
|
||||
def get_user(user_id):
|
||||
# Try cache first
|
||||
user = cache.get(f"user:{user_id}")
|
||||
if user is None:
|
||||
# Cache miss - query database
|
||||
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
|
||||
# Store in cache
|
||||
cache.set(f"user:{user_id}", user, ttl=3600)
|
||||
return user
|
||||
```
|
||||
|
||||
### Write-Through Cache
|
||||
- **Consistency**: Always keep cache and database in sync
|
||||
- **Write Latency**: Higher due to dual writes
|
||||
- **Data Safety**: No data loss on cache failures
|
||||
|
||||
### Cache Invalidation Strategies
|
||||
1. **TTL-Based**: Time-based expiration
|
||||
2. **Event-Driven**: Invalidate on data changes
|
||||
3. **Version-Based**: Use version numbers for consistency
|
||||
4. **Tag-Based**: Group related cache entries
|
||||
|
||||
## Database Selection Guide
|
||||
|
||||
### SQL Databases
|
||||
**PostgreSQL**
|
||||
- **Strengths**: ACID compliance, complex queries, JSON support, extensibility
|
||||
- **Use Cases**: OLTP applications, data warehousing, geospatial data
|
||||
- **Scale**: Vertical scaling with read replicas
|
||||
|
||||
**MySQL**
|
||||
- **Strengths**: Performance, replication, wide ecosystem support
|
||||
- **Use Cases**: Web applications, content management, e-commerce
|
||||
- **Scale**: Horizontal scaling through sharding
|
||||
|
||||
### NoSQL Databases
|
||||
|
||||
**Document Stores (MongoDB, CouchDB)**
|
||||
- **Strengths**: Flexible schema, horizontal scaling, developer productivity
|
||||
- **Use Cases**: Content management, catalogs, user profiles
|
||||
- **Trade-offs**: Eventual consistency, complex queries limitations
|
||||
|
||||
**Key-Value Stores (Redis, DynamoDB)**
|
||||
- **Strengths**: High performance, simple model, excellent caching
|
||||
- **Use Cases**: Session storage, real-time analytics, gaming leaderboards
|
||||
- **Trade-offs**: Limited query capabilities, data modeling constraints
|
||||
|
||||
**Column-Family (Cassandra, HBase)**
|
||||
- **Strengths**: Write-heavy workloads, linear scalability, fault tolerance
|
||||
- **Use Cases**: Time-series data, IoT applications, messaging systems
|
||||
- **Trade-offs**: Query flexibility, consistency model complexity
|
||||
|
||||
**Graph Databases (Neo4j, Amazon Neptune)**
|
||||
- **Strengths**: Relationship queries, pattern matching, recommendation engines
|
||||
- **Use Cases**: Social networks, fraud detection, knowledge graphs
|
||||
- **Trade-offs**: Specialized use cases, learning curve
|
||||
|
||||
### NewSQL Databases
|
||||
**Distributed SQL (CockroachDB, TiDB, Spanner)**
|
||||
- **Strengths**: SQL compatibility with horizontal scaling
|
||||
- **Use Cases**: Global applications requiring ACID guarantees
|
||||
- **Trade-offs**: Complexity, latency for distributed transactions
|
||||
|
||||
## Tools & Scripts
|
||||
|
||||
### Schema Analyzer
|
||||
- **Input**: SQL DDL files, JSON schema definitions
|
||||
- **Analysis**: Normalization compliance, constraint validation, naming conventions
|
||||
- **Output**: Analysis report, Mermaid ERD, improvement recommendations
|
||||
|
||||
### Index Optimizer
|
||||
- **Input**: Schema definition, query patterns
|
||||
- **Analysis**: Missing indexes, redundancy detection, selectivity estimation
|
||||
- **Output**: Index recommendations, CREATE INDEX statements, performance projections
|
||||
|
||||
### Migration Generator
|
||||
- **Input**: Current and target schemas
|
||||
- **Analysis**: Schema differences, dependency resolution, risk assessment
|
||||
- **Output**: Migration scripts, rollback plans, validation queries
|
||||
@@ -0,0 +1,373 @@
|
||||
# Database Selection Decision Tree
|
||||
|
||||
## Overview
|
||||
|
||||
Choosing the right database technology is crucial for application success. This guide provides a systematic approach to database selection based on specific requirements, data patterns, and operational constraints.
|
||||
|
||||
## Decision Framework
|
||||
|
||||
### Primary Questions
|
||||
|
||||
1. **What is your primary use case?**
|
||||
- OLTP (Online Transaction Processing)
|
||||
- OLAP (Online Analytical Processing)
|
||||
- Real-time analytics
|
||||
- Content management
|
||||
- Search and discovery
|
||||
- Time-series data
|
||||
- Graph relationships
|
||||
|
||||
2. **What are your consistency requirements?**
|
||||
- Strong consistency (ACID)
|
||||
- Eventual consistency
|
||||
- Causal consistency
|
||||
- Session consistency
|
||||
|
||||
3. **What are your scalability needs?**
|
||||
- Vertical scaling sufficient
|
||||
- Horizontal scaling required
|
||||
- Global distribution needed
|
||||
- Multi-region requirements
|
||||
|
||||
4. **What is your data structure?**
|
||||
- Structured (relational)
|
||||
- Semi-structured (JSON/XML)
|
||||
- Unstructured (documents, media)
|
||||
- Graph relationships
|
||||
- Time-series data
|
||||
- Key-value pairs
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
START: What is your primary use case?
|
||||
│
|
||||
├── OLTP (Transactional Applications)
|
||||
│ │
|
||||
│ ├── Do you need strong ACID guarantees?
|
||||
│ │ ├── YES → Do you need horizontal scaling?
|
||||
│ │ │ ├── YES → Distributed SQL
|
||||
│ │ │ │ ├── CockroachDB (Global, multi-region)
|
||||
│ │ │ │ ├── TiDB (MySQL compatibility)
|
||||
│ │ │ │ └── Spanner (Google Cloud)
|
||||
│ │ │ └── NO → Traditional SQL
|
||||
│ │ │ ├── PostgreSQL (Feature-rich, extensions)
|
||||
│ │ │ ├── MySQL (Performance, ecosystem)
|
||||
│ │ │ └── SQL Server (Microsoft stack)
|
||||
│ │ └── NO → Are you primarily key-value access?
|
||||
│ │ ├── YES → Key-Value Stores
|
||||
│ │ │ ├── Redis (In-memory, caching)
|
||||
│ │ │ ├── DynamoDB (AWS managed)
|
||||
│ │ │ └── Cassandra (High availability)
|
||||
│ │ └── NO → Document Stores
|
||||
│ │ ├── MongoDB (General purpose)
|
||||
│ │ ├── CouchDB (Sync, replication)
|
||||
│ │ └── Amazon DocumentDB (MongoDB compatible)
|
||||
│ │
|
||||
├── OLAP (Analytics and Reporting)
|
||||
│ │
|
||||
│ ├── What is your data volume?
|
||||
│ │ ├── Small to Medium (< 1TB) → Traditional SQL with optimization
|
||||
│ │ │ ├── PostgreSQL with columnar extensions
|
||||
│ │ │ ├── MySQL with analytics engine
|
||||
│ │ │ └── SQL Server with columnstore
|
||||
│ │ ├── Large (1TB - 100TB) → Data Warehouse Solutions
|
||||
│ │ │ ├── Snowflake (Cloud-native)
|
||||
│ │ │ ├── BigQuery (Google Cloud)
|
||||
│ │ │ ├── Redshift (AWS)
|
||||
│ │ │ └── Synapse (Azure)
|
||||
│ │ └── Very Large (> 100TB) → Big Data Platforms
|
||||
│ │ ├── Databricks (Unified analytics)
|
||||
│ │ ├── Apache Spark on cloud
|
||||
│ │ └── Hadoop ecosystem
|
||||
│ │
|
||||
├── Real-time Analytics
|
||||
│ │
|
||||
│ ├── Do you need sub-second query responses?
|
||||
│ │ ├── YES → Stream Processing + OLAP
|
||||
│ │ │ ├── ClickHouse (Fast analytics)
|
||||
│ │ │ ├── Apache Druid (Real-time OLAP)
|
||||
│ │ │ ├── Pinot (LinkedIn's real-time DB)
|
||||
│ │ │ └── TimescaleDB (Time-series)
|
||||
│ │ └── NO → Traditional OLAP solutions
|
||||
│ │
|
||||
├── Search and Discovery
|
||||
│ │
|
||||
│ ├── What type of search?
|
||||
│ │ ├── Full-text search → Search Engines
|
||||
│ │ │ ├── Elasticsearch (Full-featured)
|
||||
│ │ │ ├── OpenSearch (AWS fork of ES)
|
||||
│ │ │ └── Solr (Apache Lucene-based)
|
||||
│ │ ├── Vector/similarity search → Vector Databases
|
||||
│ │ │ ├── Pinecone (Managed vector DB)
|
||||
│ │ │ ├── Weaviate (Open source)
|
||||
│ │ │ ├── Chroma (Embeddings)
|
||||
│ │ │ └── PostgreSQL with pgvector
|
||||
│ │ └── Faceted search → Search + SQL combination
|
||||
│ │
|
||||
├── Graph Relationships
|
||||
│ │
|
||||
│ ├── Do you need complex graph traversals?
|
||||
│ │ ├── YES → Graph Databases
|
||||
│ │ │ ├── Neo4j (Property graph)
|
||||
│ │ │ ├── Amazon Neptune (Multi-model)
|
||||
│ │ │ ├── ArangoDB (Multi-model)
|
||||
│ │ │ └── TigerGraph (Analytics focused)
|
||||
│ │ └── NO → SQL with recursive queries
|
||||
│ │ └── PostgreSQL with recursive CTEs
|
||||
│ │
|
||||
└── Time-series Data
|
||||
│
|
||||
├── What is your write volume?
|
||||
├── High (millions/sec) → Specialized Time-series
|
||||
│ ├── InfluxDB (Purpose-built)
|
||||
│ ├── TimescaleDB (PostgreSQL extension)
|
||||
│ ├── Apache Druid (Analytics focused)
|
||||
│ └── Prometheus (Monitoring)
|
||||
└── Medium → SQL with time-series optimization
|
||||
└── PostgreSQL with partitioning
|
||||
```
|
||||
|
||||
## Database Categories Deep Dive
|
||||
|
||||
### Traditional SQL Databases
|
||||
|
||||
**PostgreSQL**
|
||||
- **Best For**: Complex queries, JSON data, extensions, geospatial
|
||||
- **Strengths**: Feature-rich, reliable, strong consistency, extensible
|
||||
- **Use Cases**: OLTP, mixed workloads, JSON documents, geospatial applications
|
||||
- **Scaling**: Vertical scaling, read replicas, partitioning
|
||||
- **When to Choose**: Need SQL features, complex queries, moderate scale
|
||||
|
||||
**MySQL**
|
||||
- **Best For**: Web applications, read-heavy workloads, simple schema
|
||||
- **Strengths**: Performance, replication, large ecosystem
|
||||
- **Use Cases**: Web apps, content management, e-commerce
|
||||
- **Scaling**: Read replicas, sharding, clustering (MySQL Cluster)
|
||||
- **When to Choose**: Simple schema, performance priority, large community
|
||||
|
||||
**SQL Server**
|
||||
- **Best For**: Microsoft ecosystem, enterprise features, business intelligence
|
||||
- **Strengths**: Integration, tooling, enterprise features
|
||||
- **Use Cases**: Enterprise applications, .NET applications, BI
|
||||
- **Scaling**: Always On availability groups, partitioning
|
||||
- **When to Choose**: Microsoft stack, enterprise requirements
|
||||
|
||||
### Distributed SQL (NewSQL)
|
||||
|
||||
**CockroachDB**
|
||||
- **Best For**: Global applications, strong consistency, horizontal scaling
|
||||
- **Strengths**: ACID guarantees, automatic scaling, survival
|
||||
- **Use Cases**: Multi-region apps, financial services, global SaaS
|
||||
- **Trade-offs**: Complex setup, higher latency for global transactions
|
||||
- **When to Choose**: Need SQL + global scale + consistency
|
||||
|
||||
**TiDB**
|
||||
- **Best For**: MySQL compatibility with horizontal scaling
|
||||
- **Strengths**: MySQL protocol, HTAP (hybrid), cloud-native
|
||||
- **Use Cases**: MySQL migrations, hybrid workloads
|
||||
- **When to Choose**: Existing MySQL expertise, need scale
|
||||
|
||||
### NoSQL Document Stores
|
||||
|
||||
**MongoDB**
|
||||
- **Best For**: Flexible schema, rapid development, document-centric data
|
||||
- **Strengths**: Developer experience, flexible schema, rich queries
|
||||
- **Use Cases**: Content management, catalogs, user profiles, IoT
|
||||
- **Scaling**: Automatic sharding, replica sets
|
||||
- **When to Choose**: Schema evolution, document structure, rapid development
|
||||
|
||||
**CouchDB**
|
||||
- **Best For**: Offline-first applications, multi-master replication
|
||||
- **Strengths**: HTTP API, replication, conflict resolution
|
||||
- **Use Cases**: Mobile apps, distributed systems, offline scenarios
|
||||
- **When to Choose**: Need offline capabilities, bi-directional sync
|
||||
|
||||
### Key-Value Stores
|
||||
|
||||
**Redis**
|
||||
- **Best For**: Caching, sessions, real-time applications, pub/sub
|
||||
- **Strengths**: Performance, data structures, persistence options
|
||||
- **Use Cases**: Caching, leaderboards, real-time analytics, queues
|
||||
- **Scaling**: Clustering, sentinel for HA
|
||||
- **When to Choose**: High performance, simple data model, caching
|
||||
|
||||
**DynamoDB**
|
||||
- **Best For**: Serverless applications, predictable performance, AWS ecosystem
|
||||
- **Strengths**: Managed, auto-scaling, consistent performance
|
||||
- **Use Cases**: Web applications, gaming, IoT, mobile backends
|
||||
- **Trade-offs**: Vendor lock-in, limited querying
|
||||
- **When to Choose**: AWS ecosystem, serverless, managed solution
|
||||
|
||||
### Column-Family Stores
|
||||
|
||||
**Cassandra**
|
||||
- **Best For**: Write-heavy workloads, high availability, linear scalability
|
||||
- **Strengths**: No single point of failure, tunable consistency
|
||||
- **Use Cases**: Time-series, IoT, messaging, activity feeds
|
||||
- **Trade-offs**: Complex operations, eventual consistency
|
||||
- **When to Choose**: High write volume, availability over consistency
|
||||
|
||||
**HBase**
|
||||
- **Best For**: Big data applications, Hadoop ecosystem
|
||||
- **Strengths**: Hadoop integration, consistent reads
|
||||
- **Use Cases**: Analytics on big data, time-series at scale
|
||||
- **When to Choose**: Hadoop ecosystem, very large datasets
|
||||
|
||||
### Graph Databases
|
||||
|
||||
**Neo4j**
|
||||
- **Best For**: Complex relationships, graph algorithms, traversals
|
||||
- **Strengths**: Mature ecosystem, Cypher query language, algorithms
|
||||
- **Use Cases**: Social networks, recommendation engines, fraud detection
|
||||
- **Trade-offs**: Specialized use case, learning curve
|
||||
- **When to Choose**: Relationship-heavy data, graph algorithms
|
||||
|
||||
### Time-Series Databases
|
||||
|
||||
**InfluxDB**
|
||||
- **Best For**: Time-series data, IoT, monitoring, analytics
|
||||
- **Strengths**: Purpose-built, efficient storage, query language
|
||||
- **Use Cases**: IoT sensors, monitoring, DevOps metrics
|
||||
- **When to Choose**: High-volume time-series data
|
||||
|
||||
**TimescaleDB**
|
||||
- **Best For**: Time-series with SQL familiarity
|
||||
- **Strengths**: PostgreSQL compatibility, SQL queries, ecosystem
|
||||
- **Use Cases**: Financial data, IoT with complex queries
|
||||
- **When to Choose**: Time-series + SQL requirements
|
||||
|
||||
### Search Engines
|
||||
|
||||
**Elasticsearch**
|
||||
- **Best For**: Full-text search, log analysis, real-time search
|
||||
- **Strengths**: Powerful search, analytics, ecosystem (ELK stack)
|
||||
- **Use Cases**: Search applications, log analysis, monitoring
|
||||
- **Trade-offs**: Complex operations, resource intensive
|
||||
- **When to Choose**: Advanced search requirements, analytics
|
||||
|
||||
### Data Warehouses
|
||||
|
||||
**Snowflake**
|
||||
- **Best For**: Cloud-native analytics, data sharing, varied workloads
|
||||
- **Strengths**: Separation of compute/storage, automatic scaling
|
||||
- **Use Cases**: Data warehousing, analytics, data science
|
||||
- **When to Choose**: Cloud-native, analytics-focused, multi-cloud
|
||||
|
||||
**BigQuery**
|
||||
- **Best For**: Serverless analytics, Google ecosystem, machine learning
|
||||
- **Strengths**: Serverless, petabyte scale, ML integration
|
||||
- **Use Cases**: Analytics, data science, reporting
|
||||
- **When to Choose**: Google Cloud, serverless analytics
|
||||
|
||||
## Selection Criteria Matrix
|
||||
|
||||
| Criterion | SQL | NewSQL | Document | Key-Value | Column-Family | Graph | Time-Series |
|
||||
|-----------|-----|--------|----------|-----------|---------------|-------|-------------|
|
||||
| ACID Guarantees | ✅ Strong | ✅ Strong | ⚠️ Eventual | ⚠️ Eventual | ⚠️ Tunable | ⚠️ Varies | ⚠️ Varies |
|
||||
| Horizontal Scaling | ❌ Limited | ✅ Native | ✅ Native | ✅ Native | ✅ Native | ⚠️ Limited | ✅ Native |
|
||||
| Query Flexibility | ✅ High | ✅ High | ⚠️ Moderate | ❌ Low | ❌ Low | ✅ High | ⚠️ Specialized |
|
||||
| Schema Flexibility | ❌ Rigid | ❌ Rigid | ✅ High | ✅ High | ⚠️ Moderate | ✅ High | ⚠️ Structured |
|
||||
| Performance (Reads) | ⚠️ Good | ⚠️ Good | ✅ Excellent | ✅ Excellent | ✅ Excellent | ⚠️ Good | ✅ Excellent |
|
||||
| Performance (Writes) | ⚠️ Good | ⚠️ Good | ✅ Excellent | ✅ Excellent | ✅ Excellent | ⚠️ Good | ✅ Excellent |
|
||||
| Operational Complexity | ✅ Low | ❌ High | ⚠️ Moderate | ✅ Low | ❌ High | ⚠️ Moderate | ⚠️ Moderate |
|
||||
| Ecosystem Maturity | ✅ Mature | ⚠️ Growing | ✅ Mature | ✅ Mature | ✅ Mature | ✅ Mature | ⚠️ Growing |
|
||||
|
||||
## Decision Checklist
|
||||
|
||||
### Requirements Analysis
|
||||
- [ ] **Data Volume**: Current and projected data size
|
||||
- [ ] **Transaction Volume**: Reads per second, writes per second
|
||||
- [ ] **Consistency Requirements**: Strong vs eventual consistency needs
|
||||
- [ ] **Query Patterns**: Simple lookups vs complex analytics
|
||||
- [ ] **Schema Evolution**: How often does schema change?
|
||||
- [ ] **Geographic Distribution**: Single region vs global
|
||||
- [ ] **Availability Requirements**: Acceptable downtime
|
||||
- [ ] **Team Expertise**: Existing knowledge and learning curve
|
||||
- [ ] **Budget Constraints**: Licensing, infrastructure, operational costs
|
||||
- [ ] **Compliance Requirements**: Data residency, audit trails
|
||||
|
||||
### Technical Evaluation
|
||||
- [ ] **Performance Testing**: Benchmark with realistic data and queries
|
||||
- [ ] **Scalability Testing**: Test scaling limits and patterns
|
||||
- [ ] **Failure Scenarios**: Test backup, recovery, and failure handling
|
||||
- [ ] **Integration Testing**: APIs, connectors, ecosystem tools
|
||||
- [ ] **Migration Path**: How to migrate from current system
|
||||
- [ ] **Monitoring and Observability**: Available tooling and metrics
|
||||
|
||||
### Operational Considerations
|
||||
- [ ] **Management Complexity**: Setup, configuration, maintenance
|
||||
- [ ] **Backup and Recovery**: Built-in vs external tools
|
||||
- [ ] **Security Features**: Authentication, authorization, encryption
|
||||
- [ ] **Upgrade Path**: Version compatibility and upgrade process
|
||||
- [ ] **Support Options**: Community vs commercial support
|
||||
- [ ] **Lock-in Risk**: Portability and vendor independence
|
||||
|
||||
## Common Decision Patterns
|
||||
|
||||
### E-commerce Platform
|
||||
**Typical Choice**: PostgreSQL or MySQL
|
||||
- **Primary Data**: Product catalog, orders, users (structured)
|
||||
- **Query Patterns**: OLTP with some analytics
|
||||
- **Consistency**: Strong consistency for financial data
|
||||
- **Scale**: Moderate with read replicas
|
||||
- **Additional**: Redis for caching, Elasticsearch for product search
|
||||
|
||||
### IoT/Sensor Data Platform
|
||||
**Typical Choice**: TimescaleDB or InfluxDB
|
||||
- **Primary Data**: Time-series sensor readings
|
||||
- **Query Patterns**: Time-based aggregations, trend analysis
|
||||
- **Scale**: High write volume, moderate read volume
|
||||
- **Additional**: Kafka for ingestion, PostgreSQL for metadata
|
||||
|
||||
### Social Media Application
|
||||
**Typical Choice**: Combination approach
|
||||
- **User Profiles**: MongoDB (flexible schema)
|
||||
- **Relationships**: Neo4j (graph relationships)
|
||||
- **Activity Feeds**: Cassandra (high write volume)
|
||||
- **Search**: Elasticsearch (content discovery)
|
||||
- **Caching**: Redis (sessions, real-time data)
|
||||
|
||||
### Analytics Platform
|
||||
**Typical Choice**: Snowflake or BigQuery
|
||||
- **Primary Use**: Complex analytical queries
|
||||
- **Data Volume**: Large (TB to PB scale)
|
||||
- **Query Patterns**: Ad-hoc analytics, reporting
|
||||
- **Users**: Data analysts, data scientists
|
||||
- **Additional**: Data lake (S3/GCS) for raw data storage
|
||||
|
||||
### Global SaaS Application
|
||||
**Typical Choice**: CockroachDB or DynamoDB
|
||||
- **Requirements**: Multi-region, strong consistency
|
||||
- **Scale**: Global user base
|
||||
- **Compliance**: Data residency requirements
|
||||
- **Availability**: High availability across regions
|
||||
|
||||
## Migration Strategies
|
||||
|
||||
### From Monolithic to Distributed
|
||||
1. **Assessment**: Identify scaling bottlenecks
|
||||
2. **Data Partitioning**: Plan how to split data
|
||||
3. **Gradual Migration**: Move non-critical data first
|
||||
4. **Dual Writes**: Run both systems temporarily
|
||||
5. **Validation**: Verify data consistency
|
||||
6. **Cutover**: Switch reads and writes gradually
|
||||
|
||||
### Technology Stack Evolution
|
||||
1. **Start Simple**: Begin with PostgreSQL or MySQL
|
||||
2. **Identify Bottlenecks**: Monitor performance and scaling issues
|
||||
3. **Selective Scaling**: Move specific workloads to specialized databases
|
||||
4. **Polyglot Persistence**: Use multiple databases for different use cases
|
||||
5. **Service Boundaries**: Align database choice with service boundaries
|
||||
|
||||
## Conclusion
|
||||
|
||||
Database selection should be driven by:
|
||||
|
||||
1. **Specific Use Case Requirements**: Not all applications need the same database
|
||||
2. **Data Characteristics**: Structure, volume, and access patterns matter
|
||||
3. **Non-functional Requirements**: Consistency, availability, performance targets
|
||||
4. **Team and Organizational Factors**: Expertise, operational capacity, budget
|
||||
5. **Evolution Path**: How requirements and scale will change over time
|
||||
|
||||
The best database choice is often not a single technology, but a combination of databases that each excel at their specific use case within your application architecture.
|
||||
@@ -0,0 +1,424 @@
|
||||
# Index Strategy Patterns
|
||||
|
||||
## Overview
|
||||
|
||||
Database indexes are critical for query performance, but they come with trade-offs. This guide covers proven patterns for index design, optimization strategies, and common pitfalls to avoid.
|
||||
|
||||
## Index Types and Use Cases
|
||||
|
||||
### B-Tree Indexes (Default)
|
||||
|
||||
**Best For:**
|
||||
- Equality queries (`WHERE column = value`)
|
||||
- Range queries (`WHERE column BETWEEN x AND y`)
|
||||
- Sorting (`ORDER BY column`)
|
||||
- Pattern matching with leading wildcards (`WHERE column LIKE 'prefix%'`)
|
||||
|
||||
**Characteristics:**
|
||||
- Logarithmic lookup time O(log n)
|
||||
- Supports partial matches on composite indexes
|
||||
- Most versatile index type
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Single column B-tree index
|
||||
CREATE INDEX idx_customers_email ON customers (email);
|
||||
|
||||
-- Composite B-tree index
|
||||
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
|
||||
```
|
||||
|
||||
### Hash Indexes
|
||||
|
||||
**Best For:**
|
||||
- Exact equality matches only
|
||||
- High-cardinality columns
|
||||
- Primary key lookups
|
||||
|
||||
**Characteristics:**
|
||||
- Constant lookup time O(1) for exact matches
|
||||
- Cannot support range queries or sorting
|
||||
- Memory-efficient for equality operations
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Hash index for exact lookups (PostgreSQL)
|
||||
CREATE INDEX idx_users_id_hash ON users USING HASH (user_id);
|
||||
```
|
||||
|
||||
### Partial Indexes
|
||||
|
||||
**Best For:**
|
||||
- Filtering on subset of data
|
||||
- Reducing index size and maintenance overhead
|
||||
- Query patterns that consistently use specific filters
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Index only active users
|
||||
CREATE INDEX idx_active_users_email
|
||||
ON users (email)
|
||||
WHERE status = 'active';
|
||||
|
||||
-- Index recent orders only
|
||||
CREATE INDEX idx_recent_orders
|
||||
ON orders (customer_id, created_at)
|
||||
WHERE created_at > CURRENT_DATE - INTERVAL '90 days';
|
||||
|
||||
-- Index non-null values only
|
||||
CREATE INDEX idx_customers_phone
|
||||
ON customers (phone_number)
|
||||
WHERE phone_number IS NOT NULL;
|
||||
```
|
||||
|
||||
### Covering Indexes
|
||||
|
||||
**Best For:**
|
||||
- Eliminating table lookups for SELECT queries
|
||||
- Frequently accessed column combinations
|
||||
- Read-heavy workloads
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Covering index with INCLUDE clause (SQL Server/PostgreSQL)
|
||||
CREATE INDEX idx_orders_customer_covering
|
||||
ON orders (customer_id, order_date)
|
||||
INCLUDE (order_total, status);
|
||||
|
||||
-- Query can be satisfied entirely from index:
|
||||
-- SELECT order_total, status FROM orders
|
||||
-- WHERE customer_id = 123 AND order_date > '2024-01-01';
|
||||
```
|
||||
|
||||
### Functional/Expression Indexes
|
||||
|
||||
**Best For:**
|
||||
- Queries on transformed column values
|
||||
- Case-insensitive searches
|
||||
- Complex calculations
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Case-insensitive email searches
|
||||
CREATE INDEX idx_users_email_lower
|
||||
ON users (LOWER(email));
|
||||
|
||||
-- Date part extraction
|
||||
CREATE INDEX idx_orders_month
|
||||
ON orders (EXTRACT(MONTH FROM order_date));
|
||||
|
||||
-- JSON field indexing
|
||||
CREATE INDEX idx_users_preferences_theme
|
||||
ON users ((preferences->>'theme'));
|
||||
```
|
||||
|
||||
## Composite Index Design Patterns
|
||||
|
||||
### Column Ordering Strategy
|
||||
|
||||
**Rule: Most Selective First**
|
||||
```sql
|
||||
-- Query: WHERE status = 'active' AND city = 'New York' AND age > 25
|
||||
-- Assume: status has 3 values, city has 100 values, age has 80 values
|
||||
|
||||
-- GOOD: Most selective column first
|
||||
CREATE INDEX idx_users_city_age_status ON users (city, age, status);
|
||||
|
||||
-- BAD: Least selective first
|
||||
CREATE INDEX idx_users_status_city_age ON users (status, city, age);
|
||||
```
|
||||
|
||||
**Selectivity Calculation:**
|
||||
```sql
|
||||
-- Estimate selectivity for each column
|
||||
SELECT
|
||||
'status' as column_name,
|
||||
COUNT(DISTINCT status)::float / COUNT(*) as selectivity
|
||||
FROM users
|
||||
UNION ALL
|
||||
SELECT
|
||||
'city' as column_name,
|
||||
COUNT(DISTINCT city)::float / COUNT(*) as selectivity
|
||||
FROM users
|
||||
UNION ALL
|
||||
SELECT
|
||||
'age' as column_name,
|
||||
COUNT(DISTINCT age)::float / COUNT(*) as selectivity
|
||||
FROM users;
|
||||
```
|
||||
|
||||
### Query Pattern Matching
|
||||
|
||||
**Pattern 1: Equality + Range**
|
||||
```sql
|
||||
-- Query: WHERE customer_id = 123 AND order_date BETWEEN '2024-01-01' AND '2024-03-31'
|
||||
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
|
||||
```
|
||||
|
||||
**Pattern 2: Multiple Equality Conditions**
|
||||
```sql
|
||||
-- Query: WHERE status = 'active' AND category = 'premium' AND region = 'US'
|
||||
CREATE INDEX idx_users_status_category_region ON users (status, category, region);
|
||||
```
|
||||
|
||||
**Pattern 3: Equality + Sorting**
|
||||
```sql
|
||||
-- Query: WHERE category = 'electronics' ORDER BY price DESC, created_at DESC
|
||||
CREATE INDEX idx_products_category_price_date ON products (category, price DESC, created_at DESC);
|
||||
```
|
||||
|
||||
### Prefix Optimization
|
||||
|
||||
**Efficient Prefix Usage:**
|
||||
```sql
|
||||
-- Index supports all these queries efficiently:
|
||||
CREATE INDEX idx_users_lastname_firstname_email ON users (last_name, first_name, email);
|
||||
|
||||
-- ✓ Uses index: WHERE last_name = 'Smith'
|
||||
-- ✓ Uses index: WHERE last_name = 'Smith' AND first_name = 'John'
|
||||
-- ✓ Uses index: WHERE last_name = 'Smith' AND first_name = 'John' AND email = 'john@...'
|
||||
-- ✗ Cannot use index: WHERE first_name = 'John'
|
||||
-- ✗ Cannot use index: WHERE email = 'john@...'
|
||||
```
|
||||
|
||||
## Performance Optimization Patterns
|
||||
|
||||
### Index Intersection vs Composite Indexes
|
||||
|
||||
**Scenario: Multiple single-column indexes**
|
||||
```sql
|
||||
CREATE INDEX idx_users_age ON users (age);
|
||||
CREATE INDEX idx_users_city ON users (city);
|
||||
CREATE INDEX idx_users_status ON users (status);
|
||||
|
||||
-- Query: WHERE age > 25 AND city = 'NYC' AND status = 'active'
|
||||
-- Database may use index intersection (combining multiple indexes)
|
||||
-- Performance varies by database engine and data distribution
|
||||
```
|
||||
|
||||
**Better: Purpose-built composite index**
|
||||
```sql
|
||||
-- More efficient for the specific query pattern
|
||||
CREATE INDEX idx_users_city_status_age ON users (city, status, age);
|
||||
```
|
||||
|
||||
### Index Size vs Performance Trade-off
|
||||
|
||||
**Wide Indexes (Many Columns):**
|
||||
```sql
|
||||
-- Pros: Covers many query patterns, excellent for covering queries
|
||||
-- Cons: Large index size, slower writes, more memory usage
|
||||
CREATE INDEX idx_orders_comprehensive
|
||||
ON orders (customer_id, order_date, status, total_amount, shipping_method, created_at)
|
||||
INCLUDE (order_notes, billing_address);
|
||||
```
|
||||
|
||||
**Narrow Indexes (Few Columns):**
|
||||
```sql
|
||||
-- Pros: Smaller size, faster writes, less memory
|
||||
-- Cons: May not cover all query patterns
|
||||
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
|
||||
CREATE INDEX idx_orders_status ON orders (status);
|
||||
```
|
||||
|
||||
### Maintenance Optimization
|
||||
|
||||
**Regular Index Analysis:**
|
||||
```sql
|
||||
-- PostgreSQL: Check index usage statistics
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
indexname,
|
||||
idx_scan as index_scans,
|
||||
idx_tup_read as tuples_read,
|
||||
idx_tup_fetch as tuples_fetched
|
||||
FROM pg_stat_user_indexes
|
||||
WHERE idx_scan = 0 -- Potentially unused indexes
|
||||
ORDER BY schemaname, tablename;
|
||||
|
||||
-- Check index size
|
||||
SELECT
|
||||
indexname,
|
||||
pg_size_pretty(pg_relation_size(indexname::regclass)) as index_size
|
||||
FROM pg_indexes
|
||||
WHERE schemaname = 'public'
|
||||
ORDER BY pg_relation_size(indexname::regclass) DESC;
|
||||
```
|
||||
|
||||
## Common Anti-Patterns
|
||||
|
||||
### 1. Over-Indexing
|
||||
|
||||
**Problem:**
|
||||
```sql
|
||||
-- Too many similar indexes
|
||||
CREATE INDEX idx_orders_customer ON orders (customer_id);
|
||||
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
|
||||
CREATE INDEX idx_orders_customer_status ON orders (customer_id, status);
|
||||
CREATE INDEX idx_orders_customer_date_status ON orders (customer_id, order_date, status);
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```sql
|
||||
-- One well-designed composite index can often replace several
|
||||
CREATE INDEX idx_orders_customer_date_status ON orders (customer_id, order_date, status);
|
||||
-- Drop redundant indexes: idx_orders_customer, idx_orders_customer_date, idx_orders_customer_status
|
||||
```
|
||||
|
||||
### 2. Wrong Column Order
|
||||
|
||||
**Problem:**
|
||||
```sql
|
||||
-- Query: WHERE active = true AND user_type = 'premium' AND city = 'Chicago'
|
||||
-- Bad order: boolean first (lowest selectivity)
|
||||
CREATE INDEX idx_users_active_type_city ON users (active, user_type, city);
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```sql
|
||||
-- Good order: most selective first
|
||||
CREATE INDEX idx_users_city_type_active ON users (city, user_type, active);
|
||||
```
|
||||
|
||||
### 3. Ignoring Query Patterns
|
||||
|
||||
**Problem:**
|
||||
```sql
|
||||
-- Index doesn't match common query patterns
|
||||
CREATE INDEX idx_products_name ON products (product_name);
|
||||
|
||||
-- But queries are: WHERE category = 'electronics' AND price BETWEEN 100 AND 500
|
||||
-- Index is not helpful for these queries
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```sql
|
||||
-- Match actual query patterns
|
||||
CREATE INDEX idx_products_category_price ON products (category, price);
|
||||
```
|
||||
|
||||
### 4. Function in WHERE Without Functional Index
|
||||
|
||||
**Problem:**
|
||||
```sql
|
||||
-- Query uses function but no functional index
|
||||
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
|
||||
-- Regular index on email won't help
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```sql
|
||||
-- Create functional index
|
||||
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Multi-Column Statistics
|
||||
|
||||
**When Columns Are Correlated:**
|
||||
```sql
|
||||
-- If city and state are highly correlated, create extended statistics
|
||||
CREATE STATISTICS stats_address_correlation ON city, state FROM addresses;
|
||||
ANALYZE addresses;
|
||||
|
||||
-- Helps query planner make better decisions for:
|
||||
-- WHERE city = 'New York' AND state = 'NY'
|
||||
```
|
||||
|
||||
### Conditional Indexes for Data Lifecycle
|
||||
|
||||
**Pattern: Different indexes for different data ages**
|
||||
```sql
|
||||
-- Hot data (recent orders) - optimized for OLTP
|
||||
CREATE INDEX idx_orders_hot_customer_date
|
||||
ON orders (customer_id, order_date DESC)
|
||||
WHERE order_date > CURRENT_DATE - INTERVAL '30 days';
|
||||
|
||||
-- Warm data (older orders) - optimized for analytics
|
||||
CREATE INDEX idx_orders_warm_date_total
|
||||
ON orders (order_date, total_amount)
|
||||
WHERE order_date <= CURRENT_DATE - INTERVAL '30 days'
|
||||
AND order_date > CURRENT_DATE - INTERVAL '1 year';
|
||||
|
||||
-- Cold data (archived orders) - minimal indexing
|
||||
CREATE INDEX idx_orders_cold_date
|
||||
ON orders (order_date)
|
||||
WHERE order_date <= CURRENT_DATE - INTERVAL '1 year';
|
||||
```
|
||||
|
||||
### Index-Only Scan Optimization
|
||||
|
||||
**Design indexes to avoid table access:**
|
||||
```sql
|
||||
-- Query: SELECT order_id, total_amount, status FROM orders WHERE customer_id = ?
|
||||
CREATE INDEX idx_orders_customer_covering
|
||||
ON orders (customer_id)
|
||||
INCLUDE (order_id, total_amount, status);
|
||||
|
||||
-- Or as composite index (if database doesn't support INCLUDE)
|
||||
CREATE INDEX idx_orders_customer_covering
|
||||
ON orders (customer_id, order_id, total_amount, status);
|
||||
```
|
||||
|
||||
## Index Monitoring and Maintenance
|
||||
|
||||
### Performance Monitoring Queries
|
||||
|
||||
**Find slow queries that might benefit from indexes:**
|
||||
```sql
|
||||
-- PostgreSQL: Find queries with high cost
|
||||
SELECT
|
||||
query,
|
||||
calls,
|
||||
total_time,
|
||||
mean_time,
|
||||
rows
|
||||
FROM pg_stat_statements
|
||||
WHERE mean_time > 1000 -- Queries taking > 1 second
|
||||
ORDER BY mean_time DESC;
|
||||
```
|
||||
|
||||
**Identify missing indexes:**
|
||||
```sql
|
||||
-- Look for sequential scans on large tables
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
seq_scan,
|
||||
seq_tup_read,
|
||||
idx_scan,
|
||||
n_tup_ins + n_tup_upd + n_tup_del as write_activity
|
||||
FROM pg_stat_user_tables
|
||||
WHERE seq_scan > 100
|
||||
AND seq_tup_read > 100000 -- Large sequential scans
|
||||
AND (idx_scan = 0 OR seq_scan > idx_scan * 2)
|
||||
ORDER BY seq_tup_read DESC;
|
||||
```
|
||||
|
||||
### Index Maintenance Schedule
|
||||
|
||||
**Regular Maintenance Tasks:**
|
||||
```sql
|
||||
-- Rebuild fragmented indexes (SQL Server)
|
||||
ALTER INDEX ALL ON orders REBUILD;
|
||||
|
||||
-- Update statistics (PostgreSQL)
|
||||
ANALYZE orders;
|
||||
|
||||
-- Check for unused indexes monthly
|
||||
SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0;
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
Effective index strategy requires:
|
||||
|
||||
1. **Understanding Query Patterns**: Analyze actual application queries, not theoretical scenarios
|
||||
2. **Measuring Performance**: Use query execution plans and timing to validate index effectiveness
|
||||
3. **Balancing Trade-offs**: More indexes improve reads but slow writes and increase storage
|
||||
4. **Regular Maintenance**: Monitor index usage and performance, remove unused indexes
|
||||
5. **Iterative Improvement**: Start with essential indexes, add and optimize based on real usage
|
||||
|
||||
The goal is not to index every possible query pattern, but to create a focused set of indexes that provide maximum benefit for your application's specific workload while minimizing maintenance overhead.
|
||||
@@ -0,0 +1,354 @@
|
||||
# Database Normalization Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Database normalization is the process of organizing data to minimize redundancy and dependency issues. It involves decomposing tables to eliminate data anomalies and improve data integrity.
|
||||
|
||||
## Normal Forms
|
||||
|
||||
### First Normal Form (1NF)
|
||||
|
||||
**Requirements:**
|
||||
- Each column contains atomic (indivisible) values
|
||||
- Each column contains values of the same type
|
||||
- Each column has a unique name
|
||||
- The order of data storage doesn't matter
|
||||
|
||||
**Violations and Solutions:**
|
||||
|
||||
**Problem: Multiple values in single column**
|
||||
```sql
|
||||
-- BAD: Multiple phone numbers in one column
|
||||
CREATE TABLE customers (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
phones VARCHAR(500) -- "555-1234, 555-5678, 555-9012"
|
||||
);
|
||||
|
||||
-- GOOD: Separate table for multiple phones
|
||||
CREATE TABLE customers (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE TABLE customer_phones (
|
||||
id INT PRIMARY KEY,
|
||||
customer_id INT REFERENCES customers(id),
|
||||
phone VARCHAR(20),
|
||||
phone_type VARCHAR(10) -- 'mobile', 'home', 'work'
|
||||
);
|
||||
```
|
||||
|
||||
**Problem: Repeating groups**
|
||||
```sql
|
||||
-- BAD: Repeating column patterns
|
||||
CREATE TABLE orders (
|
||||
order_id INT PRIMARY KEY,
|
||||
customer_id INT,
|
||||
item1_name VARCHAR(100),
|
||||
item1_qty INT,
|
||||
item1_price DECIMAL(8,2),
|
||||
item2_name VARCHAR(100),
|
||||
item2_qty INT,
|
||||
item2_price DECIMAL(8,2),
|
||||
item3_name VARCHAR(100),
|
||||
item3_qty INT,
|
||||
item3_price DECIMAL(8,2)
|
||||
);
|
||||
|
||||
-- GOOD: Separate table for order items
|
||||
CREATE TABLE orders (
|
||||
order_id INT PRIMARY KEY,
|
||||
customer_id INT,
|
||||
order_date DATE
|
||||
);
|
||||
|
||||
CREATE TABLE order_items (
|
||||
id INT PRIMARY KEY,
|
||||
order_id INT REFERENCES orders(order_id),
|
||||
item_name VARCHAR(100),
|
||||
quantity INT,
|
||||
unit_price DECIMAL(8,2)
|
||||
);
|
||||
```
|
||||
|
||||
### Second Normal Form (2NF)
|
||||
|
||||
**Requirements:**
|
||||
- Must be in 1NF
|
||||
- All non-key attributes must be fully functionally dependent on the primary key
|
||||
- No partial dependencies (applies only to tables with composite primary keys)
|
||||
|
||||
**Violations and Solutions:**
|
||||
|
||||
**Problem: Partial dependency on composite key**
|
||||
```sql
|
||||
-- BAD: Student course enrollment with partial dependencies
|
||||
CREATE TABLE student_courses (
|
||||
student_id INT,
|
||||
course_id INT,
|
||||
student_name VARCHAR(100), -- Depends only on student_id
|
||||
student_major VARCHAR(50), -- Depends only on student_id
|
||||
course_title VARCHAR(200), -- Depends only on course_id
|
||||
course_credits INT, -- Depends only on course_id
|
||||
grade CHAR(2), -- Depends on both student_id AND course_id
|
||||
PRIMARY KEY (student_id, course_id)
|
||||
);
|
||||
|
||||
-- GOOD: Separate tables eliminate partial dependencies
|
||||
CREATE TABLE students (
|
||||
student_id INT PRIMARY KEY,
|
||||
student_name VARCHAR(100),
|
||||
student_major VARCHAR(50)
|
||||
);
|
||||
|
||||
CREATE TABLE courses (
|
||||
course_id INT PRIMARY KEY,
|
||||
course_title VARCHAR(200),
|
||||
course_credits INT
|
||||
);
|
||||
|
||||
CREATE TABLE enrollments (
|
||||
student_id INT,
|
||||
course_id INT,
|
||||
grade CHAR(2),
|
||||
enrollment_date DATE,
|
||||
PRIMARY KEY (student_id, course_id),
|
||||
FOREIGN KEY (student_id) REFERENCES students(student_id),
|
||||
FOREIGN KEY (course_id) REFERENCES courses(course_id)
|
||||
);
|
||||
```
|
||||
|
||||
### Third Normal Form (3NF)
|
||||
|
||||
**Requirements:**
|
||||
- Must be in 2NF
|
||||
- No transitive dependencies (non-key attributes should not depend on other non-key attributes)
|
||||
- All non-key attributes must depend directly on the primary key
|
||||
|
||||
**Violations and Solutions:**
|
||||
|
||||
**Problem: Transitive dependency**
|
||||
```sql
|
||||
-- BAD: Employee table with transitive dependency
|
||||
CREATE TABLE employees (
|
||||
employee_id INT PRIMARY KEY,
|
||||
employee_name VARCHAR(100),
|
||||
department_id INT,
|
||||
department_name VARCHAR(100), -- Depends on department_id, not employee_id
|
||||
department_location VARCHAR(100), -- Transitive dependency through department_id
|
||||
department_budget DECIMAL(10,2), -- Transitive dependency through department_id
|
||||
salary DECIMAL(8,2)
|
||||
);
|
||||
|
||||
-- GOOD: Separate department information
|
||||
CREATE TABLE departments (
|
||||
department_id INT PRIMARY KEY,
|
||||
department_name VARCHAR(100),
|
||||
department_location VARCHAR(100),
|
||||
department_budget DECIMAL(10,2)
|
||||
);
|
||||
|
||||
CREATE TABLE employees (
|
||||
employee_id INT PRIMARY KEY,
|
||||
employee_name VARCHAR(100),
|
||||
department_id INT,
|
||||
salary DECIMAL(8,2),
|
||||
FOREIGN KEY (department_id) REFERENCES departments(department_id)
|
||||
);
|
||||
```
|
||||
|
||||
### Boyce-Codd Normal Form (BCNF)
|
||||
|
||||
**Requirements:**
|
||||
- Must be in 3NF
|
||||
- Every determinant must be a candidate key
|
||||
- Stricter than 3NF - handles cases where 3NF doesn't eliminate all anomalies
|
||||
|
||||
**Violations and Solutions:**
|
||||
|
||||
**Problem: Determinant that's not a candidate key**
|
||||
```sql
|
||||
-- BAD: Student advisor relationship with BCNF violation
|
||||
-- Assumption: Each student has one advisor per subject,
|
||||
-- each advisor teaches only one subject, but can advise multiple students
|
||||
CREATE TABLE student_advisor (
|
||||
student_id INT,
|
||||
subject VARCHAR(50),
|
||||
advisor_id INT,
|
||||
PRIMARY KEY (student_id, subject)
|
||||
);
|
||||
-- Problem: advisor_id determines subject, but advisor_id is not a candidate key
|
||||
|
||||
-- GOOD: Separate the functional dependencies
|
||||
CREATE TABLE advisors (
|
||||
advisor_id INT PRIMARY KEY,
|
||||
subject VARCHAR(50)
|
||||
);
|
||||
|
||||
CREATE TABLE student_advisor_assignments (
|
||||
student_id INT,
|
||||
advisor_id INT,
|
||||
PRIMARY KEY (student_id, advisor_id),
|
||||
FOREIGN KEY (advisor_id) REFERENCES advisors(advisor_id)
|
||||
);
|
||||
```
|
||||
|
||||
## Denormalization Strategies
|
||||
|
||||
### When to Denormalize
|
||||
|
||||
1. **Performance Requirements**: When query performance is more critical than storage efficiency
|
||||
2. **Read-Heavy Workloads**: When data is read much more frequently than it's updated
|
||||
3. **Reporting Systems**: When complex joins negatively impact reporting performance
|
||||
4. **Caching Strategies**: When pre-computed values eliminate expensive calculations
|
||||
|
||||
### Common Denormalization Patterns
|
||||
|
||||
**1. Redundant Storage for Performance**
|
||||
```sql
|
||||
-- Store frequently accessed calculated values
|
||||
CREATE TABLE orders (
|
||||
order_id INT PRIMARY KEY,
|
||||
customer_id INT,
|
||||
order_total DECIMAL(10,2), -- Denormalized: sum of order_items.total
|
||||
item_count INT, -- Denormalized: count of order_items
|
||||
created_at TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE order_items (
|
||||
item_id INT PRIMARY KEY,
|
||||
order_id INT,
|
||||
product_id INT,
|
||||
quantity INT,
|
||||
unit_price DECIMAL(8,2),
|
||||
total DECIMAL(10,2) -- quantity * unit_price (denormalized)
|
||||
);
|
||||
```
|
||||
|
||||
**2. Materialized Aggregates**
|
||||
```sql
|
||||
-- Pre-computed summary tables for reporting
|
||||
CREATE TABLE monthly_sales_summary (
|
||||
year_month VARCHAR(7), -- '2024-03'
|
||||
product_category VARCHAR(50),
|
||||
total_sales DECIMAL(12,2),
|
||||
total_units INT,
|
||||
avg_order_value DECIMAL(8,2),
|
||||
unique_customers INT,
|
||||
updated_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
**3. Historical Data Snapshots**
|
||||
```sql
|
||||
-- Store historical state to avoid complex temporal queries
|
||||
CREATE TABLE customer_status_history (
|
||||
id INT PRIMARY KEY,
|
||||
customer_id INT,
|
||||
status VARCHAR(20),
|
||||
tier VARCHAR(10),
|
||||
total_lifetime_value DECIMAL(12,2), -- Snapshot at this point in time
|
||||
snapshot_date DATE
|
||||
);
|
||||
```
|
||||
|
||||
## Trade-offs Analysis
|
||||
|
||||
### Normalization Benefits
|
||||
- **Data Integrity**: Reduced risk of inconsistent data
|
||||
- **Storage Efficiency**: Less data duplication
|
||||
- **Update Efficiency**: Changes need to be made in only one place
|
||||
- **Flexibility**: Easier to modify schema as requirements change
|
||||
|
||||
### Normalization Costs
|
||||
- **Query Complexity**: More joins required for data retrieval
|
||||
- **Performance Impact**: Joins can be expensive on large datasets
|
||||
- **Development Complexity**: More complex data access patterns
|
||||
|
||||
### Denormalization Benefits
|
||||
- **Query Performance**: Fewer joins, faster queries
|
||||
- **Simplified Queries**: Direct access to related data
|
||||
- **Read Optimization**: Optimized for data retrieval patterns
|
||||
- **Reduced Load**: Less database processing for common operations
|
||||
|
||||
### Denormalization Costs
|
||||
- **Data Redundancy**: Increased storage requirements
|
||||
- **Update Complexity**: Multiple places may need updates
|
||||
- **Consistency Risk**: Higher risk of data inconsistencies
|
||||
- **Maintenance Overhead**: Additional code to maintain derived values
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start with Full Normalization
|
||||
- Begin with a fully normalized design
|
||||
- Identify performance bottlenecks through testing
|
||||
- Selectively denormalize based on actual performance needs
|
||||
|
||||
### 2. Use Triggers for Consistency
|
||||
```sql
|
||||
-- Trigger to maintain denormalized order_total
|
||||
CREATE TRIGGER update_order_total
|
||||
AFTER INSERT OR UPDATE OR DELETE ON order_items
|
||||
FOR EACH ROW
|
||||
BEGIN
|
||||
UPDATE orders
|
||||
SET order_total = (
|
||||
SELECT SUM(quantity * unit_price)
|
||||
FROM order_items
|
||||
WHERE order_id = NEW.order_id
|
||||
)
|
||||
WHERE order_id = NEW.order_id;
|
||||
END;
|
||||
```
|
||||
|
||||
### 3. Consider Materialized Views
|
||||
```sql
|
||||
-- Materialized view for complex aggregations
|
||||
CREATE MATERIALIZED VIEW customer_summary AS
|
||||
SELECT
|
||||
c.customer_id,
|
||||
c.customer_name,
|
||||
COUNT(o.order_id) as order_count,
|
||||
SUM(o.order_total) as lifetime_value,
|
||||
AVG(o.order_total) as avg_order_value,
|
||||
MAX(o.created_at) as last_order_date
|
||||
FROM customers c
|
||||
LEFT JOIN orders o ON c.customer_id = o.customer_id
|
||||
GROUP BY c.customer_id, c.customer_name;
|
||||
```
|
||||
|
||||
### 4. Document Denormalization Decisions
|
||||
- Clearly document why denormalization was chosen
|
||||
- Specify which data is derived and how it's maintained
|
||||
- Include performance benchmarks that justify the decision
|
||||
|
||||
### 5. Monitor and Validate
|
||||
- Implement validation checks for denormalized data
|
||||
- Regular audits to ensure data consistency
|
||||
- Performance monitoring to validate denormalization benefits
|
||||
|
||||
## Common Anti-Patterns
|
||||
|
||||
### 1. Premature Denormalization
|
||||
Starting with denormalized design without understanding actual performance requirements.
|
||||
|
||||
### 2. Over-Normalization
|
||||
Creating too many small tables that require excessive joins for simple queries.
|
||||
|
||||
### 3. Inconsistent Approach
|
||||
Mixing normalized and denormalized patterns without clear strategy.
|
||||
|
||||
### 4. Ignoring Maintenance
|
||||
Denormalizing without proper mechanisms to maintain data consistency.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Normalization and denormalization are both valuable tools in database design. The key is understanding when to apply each approach:
|
||||
|
||||
- **Use normalization** for transactional systems where data integrity is paramount
|
||||
- **Consider denormalization** for analytical systems or when performance testing reveals bottlenecks
|
||||
- **Apply selectively** based on actual usage patterns and performance requirements
|
||||
- **Maintain consistency** through proper design patterns and validation mechanisms
|
||||
|
||||
The goal is not to achieve perfect normalization or denormalization, but to create a design that best serves your application's specific needs while maintaining data quality and system performance.
|
||||
Reference in New Issue
Block a user