add brain
This commit is contained in:
@@ -0,0 +1,388 @@
|
||||
# Database Designer - POWERFUL Tier Skill
|
||||
|
||||
A comprehensive database design and analysis toolkit that provides expert-level schema analysis, index optimization, and migration generation capabilities for modern database systems.
|
||||
|
||||
## Features
|
||||
|
||||
### 🔍 Schema Analyzer
|
||||
- **Normalization Analysis**: Automated detection of 1NF through BCNF violations
|
||||
- **Data Type Optimization**: Identifies antipatterns and inappropriate types
|
||||
- **Constraint Analysis**: Finds missing foreign keys, unique constraints, and checks
|
||||
- **ERD Generation**: Creates Mermaid diagrams from DDL or JSON schema
|
||||
- **Naming Convention Validation**: Ensures consistent naming patterns
|
||||
|
||||
### ⚡ Index Optimizer
|
||||
- **Missing Index Detection**: Identifies indexes needed for query patterns
|
||||
- **Composite Index Design**: Optimizes column ordering for maximum efficiency
|
||||
- **Redundancy Analysis**: Finds duplicate and overlapping indexes
|
||||
- **Performance Modeling**: Estimates selectivity and query performance impact
|
||||
- **Covering Index Recommendations**: Eliminates table lookups
|
||||
|
||||
### 🚀 Migration Generator
|
||||
- **Zero-Downtime Migrations**: Implements expand-contract patterns
|
||||
- **Schema Evolution**: Handles column changes, table renames, constraint updates
|
||||
- **Data Migration Scripts**: Automated data transformation and validation
|
||||
- **Rollback Planning**: Complete reversal capabilities for all changes
|
||||
- **Execution Orchestration**: Dependency-aware migration ordering
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
- Python 3.7+ (no external dependencies required)
|
||||
- Database schema in SQL DDL format or JSON
|
||||
- Query patterns (for index optimization)
|
||||
|
||||
### Installation
|
||||
```bash
|
||||
# Clone or download the database-designer skill
|
||||
cd engineering/database-designer/
|
||||
|
||||
# Make scripts executable
|
||||
chmod +x *.py
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Schema Analysis
|
||||
|
||||
**Analyze SQL DDL file:**
|
||||
```bash
|
||||
python schema_analyzer.py --input assets/sample_schema.sql --output-format text
|
||||
```
|
||||
|
||||
**Generate ERD diagram:**
|
||||
```bash
|
||||
python schema_analyzer.py --input assets/sample_schema.sql --generate-erd --output analysis.txt
|
||||
```
|
||||
|
||||
**JSON schema analysis:**
|
||||
```bash
|
||||
python schema_analyzer.py --input assets/sample_schema.json --output-format json --output results.json
|
||||
```
|
||||
|
||||
### Index Optimization
|
||||
|
||||
**Basic index analysis:**
|
||||
```bash
|
||||
python index_optimizer.py --schema assets/sample_schema.json --queries assets/sample_query_patterns.json
|
||||
```
|
||||
|
||||
**High-priority recommendations only:**
|
||||
```bash
|
||||
python index_optimizer.py --schema assets/sample_schema.json --queries assets/sample_query_patterns.json --min-priority 2
|
||||
```
|
||||
|
||||
**JSON output with existing index analysis:**
|
||||
```bash
|
||||
python index_optimizer.py --schema assets/sample_schema.json --queries assets/sample_query_patterns.json --format json --analyze-existing
|
||||
```
|
||||
|
||||
### Migration Generation
|
||||
|
||||
**Generate migration between schemas:**
|
||||
```bash
|
||||
python migration_generator.py --current assets/current_schema.json --target assets/target_schema.json
|
||||
```
|
||||
|
||||
**Zero-downtime migration:**
|
||||
```bash
|
||||
python migration_generator.py --current current.json --target target.json --zero-downtime --format sql
|
||||
```
|
||||
|
||||
**Include validation queries:**
|
||||
```bash
|
||||
python migration_generator.py --current current.json --target target.json --include-validations --output migration_plan.txt
|
||||
```
|
||||
|
||||
## Tool Documentation
|
||||
|
||||
### Schema Analyzer
|
||||
|
||||
**Input Formats:**
|
||||
- SQL DDL files (.sql)
|
||||
- JSON schema definitions (.json)
|
||||
|
||||
**Key Capabilities:**
|
||||
- Detects 1NF violations (non-atomic values, repeating groups)
|
||||
- Identifies 2NF issues (partial dependencies in composite keys)
|
||||
- Finds 3NF problems (transitive dependencies)
|
||||
- Checks BCNF compliance (determinant key requirements)
|
||||
- Validates data types (VARCHAR(255) antipattern, inappropriate types)
|
||||
- Missing constraints (NOT NULL, UNIQUE, CHECK, foreign keys)
|
||||
- Naming convention adherence
|
||||
|
||||
**Sample Command:**
|
||||
```bash
|
||||
python schema_analyzer.py \
|
||||
--input sample_schema.sql \
|
||||
--generate-erd \
|
||||
--output-format text \
|
||||
--output analysis.txt
|
||||
```
|
||||
|
||||
**Output:**
|
||||
- Comprehensive text or JSON analysis report
|
||||
- Mermaid ERD diagram
|
||||
- Prioritized recommendations
|
||||
- SQL statements for improvements
|
||||
|
||||
### Index Optimizer
|
||||
|
||||
**Input Requirements:**
|
||||
- Schema definition (JSON format)
|
||||
- Query patterns with frequency and selectivity data
|
||||
|
||||
**Analysis Features:**
|
||||
- Selectivity estimation based on column patterns
|
||||
- Composite index column ordering optimization
|
||||
- Covering index recommendations for SELECT queries
|
||||
- Foreign key index validation
|
||||
- Redundancy detection (duplicates, overlaps, unused indexes)
|
||||
- Performance impact modeling
|
||||
|
||||
**Sample Command:**
|
||||
```bash
|
||||
python index_optimizer.py \
|
||||
--schema schema.json \
|
||||
--queries query_patterns.json \
|
||||
--format text \
|
||||
--min-priority 3 \
|
||||
--output recommendations.txt
|
||||
```
|
||||
|
||||
**Output:**
|
||||
- Prioritized index recommendations
|
||||
- CREATE INDEX statements
|
||||
- Drop statements for redundant indexes
|
||||
- Performance impact analysis
|
||||
- Storage size estimates
|
||||
|
||||
### Migration Generator
|
||||
|
||||
**Input Requirements:**
|
||||
- Current schema (JSON format)
|
||||
- Target schema (JSON format)
|
||||
|
||||
**Migration Strategies:**
|
||||
- Standard migrations with ALTER statements
|
||||
- Zero-downtime expand-contract patterns
|
||||
- Data migration and transformation scripts
|
||||
- Constraint management (add/drop in correct order)
|
||||
- Index management with timing estimates
|
||||
|
||||
**Sample Command:**
|
||||
```bash
|
||||
python migration_generator.py \
|
||||
--current current_schema.json \
|
||||
--target target_schema.json \
|
||||
--zero-downtime \
|
||||
--include-validations \
|
||||
--format text
|
||||
```
|
||||
|
||||
**Output:**
|
||||
- Step-by-step migration plan
|
||||
- Forward and rollback SQL statements
|
||||
- Risk assessment for each step
|
||||
- Validation queries
|
||||
- Execution time estimates
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
database-designer/
|
||||
├── README.md # This file
|
||||
├── SKILL.md # Comprehensive database design guide
|
||||
├── schema_analyzer.py # Schema analysis tool
|
||||
├── index_optimizer.py # Index optimization tool
|
||||
├── migration_generator.py # Migration generation tool
|
||||
├── references/ # Reference documentation
|
||||
│ ├── normalization_guide.md # Normalization principles and patterns
|
||||
│ ├── index_strategy_patterns.md # Index design and optimization guide
|
||||
│ └── database_selection_decision_tree.md # Database technology selection
|
||||
├── assets/ # Sample files and test data
|
||||
│ ├── sample_schema.sql # Sample DDL with various issues
|
||||
│ ├── sample_schema.json # JSON schema definition
|
||||
│ └── sample_query_patterns.json # Query patterns for index analysis
|
||||
└── expected_outputs/ # Example tool outputs
|
||||
├── schema_analysis_sample.txt # Sample schema analysis report
|
||||
├── index_optimization_sample.txt # Sample index recommendations
|
||||
└── migration_sample.txt # Sample migration plan
|
||||
```
|
||||
|
||||
## JSON Schema Format
|
||||
|
||||
The tools use a standardized JSON format for schema definitions:
|
||||
|
||||
```json
|
||||
{
|
||||
"tables": {
|
||||
"table_name": {
|
||||
"columns": {
|
||||
"column_name": {
|
||||
"type": "VARCHAR(255)",
|
||||
"nullable": true,
|
||||
"unique": false,
|
||||
"foreign_key": "other_table.column",
|
||||
"default": "default_value",
|
||||
"cardinality_estimate": 1000
|
||||
}
|
||||
},
|
||||
"primary_key": ["id"],
|
||||
"unique_constraints": [["email"], ["username"]],
|
||||
"check_constraints": {
|
||||
"chk_positive_price": "price > 0"
|
||||
},
|
||||
"indexes": [
|
||||
{
|
||||
"name": "idx_table_column",
|
||||
"columns": ["column_name"],
|
||||
"unique": false,
|
||||
"partial_condition": "status = 'active'"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Query Patterns Format
|
||||
|
||||
For index optimization, provide query patterns in this format:
|
||||
|
||||
```json
|
||||
{
|
||||
"queries": [
|
||||
{
|
||||
"id": "user_lookup",
|
||||
"type": "SELECT",
|
||||
"table": "users",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "email",
|
||||
"operator": "=",
|
||||
"selectivity": 0.95
|
||||
}
|
||||
],
|
||||
"join_conditions": [
|
||||
{
|
||||
"local_column": "user_id",
|
||||
"foreign_table": "orders",
|
||||
"foreign_column": "id",
|
||||
"join_type": "INNER"
|
||||
}
|
||||
],
|
||||
"order_by": [
|
||||
{"column": "created_at", "direction": "DESC"}
|
||||
],
|
||||
"frequency": 1000,
|
||||
"avg_execution_time_ms": 5.2
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Schema Analysis
|
||||
1. **Start with DDL**: Use actual CREATE TABLE statements when possible
|
||||
2. **Include Constraints**: Capture all existing constraints and indexes
|
||||
3. **Consider History**: Some denormalization may be intentional for performance
|
||||
4. **Validate Results**: Review recommendations against business requirements
|
||||
|
||||
### Index Optimization
|
||||
1. **Real Query Patterns**: Use actual application queries, not theoretical ones
|
||||
2. **Include Frequency**: Query frequency is crucial for prioritization
|
||||
3. **Monitor Performance**: Validate recommendations with actual performance testing
|
||||
4. **Gradual Implementation**: Add indexes incrementally and monitor impact
|
||||
|
||||
### Migration Planning
|
||||
1. **Test Migrations**: Always test on non-production environments first
|
||||
2. **Backup First**: Ensure complete backups before running migrations
|
||||
3. **Monitor Progress**: Watch for locks and performance impacts during execution
|
||||
4. **Rollback Ready**: Have rollback procedures tested and ready
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Selectivity Estimation
|
||||
The index optimizer uses pattern-based selectivity estimation. You can improve accuracy by providing cardinality estimates in your schema JSON:
|
||||
|
||||
```json
|
||||
{
|
||||
"columns": {
|
||||
"status": {
|
||||
"type": "VARCHAR(20)",
|
||||
"cardinality_estimate": 5 # Only 5 distinct values
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Zero-Downtime Migration Strategy
|
||||
For production systems, use the zero-downtime flag to generate expand-contract migrations:
|
||||
|
||||
1. **Expand Phase**: Add new columns/tables without constraints
|
||||
2. **Dual Write**: Application writes to both old and new structures
|
||||
3. **Backfill**: Populate new structures with existing data
|
||||
4. **Contract Phase**: Remove old structures after validation
|
||||
|
||||
### Integration with CI/CD
|
||||
Integrate these tools into your deployment pipeline:
|
||||
|
||||
```bash
|
||||
# Schema validation in CI
|
||||
python schema_analyzer.py --input schema.sql --output-format json | \
|
||||
jq '.constraint_analysis.total_issues' | \
|
||||
test $(cat) -eq 0 || exit 1
|
||||
|
||||
# Generate migrations automatically
|
||||
python migration_generator.py \
|
||||
--current prod_schema.json \
|
||||
--target new_schema.json \
|
||||
--zero-downtime \
|
||||
--output migration.sql
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**"No tables found in input file"**
|
||||
- Ensure SQL DDL uses standard CREATE TABLE syntax
|
||||
- Check for syntax errors in DDL
|
||||
- Verify file encoding (UTF-8 recommended)
|
||||
|
||||
**"Invalid JSON schema"**
|
||||
- Validate JSON syntax with a JSON validator
|
||||
- Ensure all required fields are present
|
||||
- Check that foreign key references use "table.column" format
|
||||
|
||||
**"Analysis shows no issues but problems exist"**
|
||||
- Tools use heuristic analysis - review recommendations carefully
|
||||
- Some design decisions may be intentional (denormalization for performance)
|
||||
- Consider domain-specific requirements not captured by general rules
|
||||
|
||||
### Performance Tips
|
||||
|
||||
**Large Schemas:**
|
||||
- Use `--output-format json` for machine processing
|
||||
- Consider analyzing subsets of tables for very large schemas
|
||||
- Provide cardinality estimates for better index recommendations
|
||||
|
||||
**Complex Queries:**
|
||||
- Include actual execution times in query patterns
|
||||
- Provide realistic frequency estimates
|
||||
- Consider seasonal or usage pattern variations
|
||||
|
||||
## Contributing
|
||||
|
||||
This is a self-contained skill with no external dependencies. To extend functionality:
|
||||
|
||||
1. Follow the existing code patterns
|
||||
2. Maintain Python standard library only requirement
|
||||
3. Add comprehensive test cases for new features
|
||||
4. Update documentation and examples
|
||||
|
||||
## License
|
||||
|
||||
This database designer skill is part of the claude-skills collection and follows the same licensing terms.
|
||||
@@ -0,0 +1,66 @@
|
||||
---
|
||||
name: "database-designer"
|
||||
description: "Database Designer - POWERFUL Tier Skill"
|
||||
---
|
||||
|
||||
# Database Designer - POWERFUL Tier Skill
|
||||
|
||||
## Overview
|
||||
|
||||
A comprehensive database design skill that provides expert-level analysis, optimization, and migration capabilities for modern database systems. This skill combines theoretical principles with practical tools to help architects and developers create scalable, performant, and maintainable database schemas.
|
||||
|
||||
## Core Competencies
|
||||
|
||||
### Schema Design & Analysis
|
||||
- **Normalization Analysis**: Automated detection of normalization levels (1NF through BCNF)
|
||||
- **Denormalization Strategy**: Smart recommendations for performance optimization
|
||||
- **Data Type Optimization**: Identification of inappropriate types and size issues
|
||||
- **Constraint Analysis**: Missing foreign keys, unique constraints, and null checks
|
||||
- **Naming Convention Validation**: Consistent table and column naming patterns
|
||||
- **ERD Generation**: Automatic Mermaid diagram creation from DDL
|
||||
|
||||
### Index Optimization
|
||||
- **Index Gap Analysis**: Identification of missing indexes on foreign keys and query patterns
|
||||
- **Composite Index Strategy**: Optimal column ordering for multi-column indexes
|
||||
- **Index Redundancy Detection**: Elimination of overlapping and unused indexes
|
||||
- **Performance Impact Modeling**: Selectivity estimation and query cost analysis
|
||||
- **Index Type Selection**: B-tree, hash, partial, covering, and specialized indexes
|
||||
|
||||
### Migration Management
|
||||
- **Zero-Downtime Migrations**: Expand-contract pattern implementation
|
||||
- **Schema Evolution**: Safe column additions, deletions, and type changes
|
||||
- **Data Migration Scripts**: Automated data transformation and validation
|
||||
- **Rollback Strategy**: Complete reversal capabilities with validation
|
||||
- **Execution Planning**: Ordered migration steps with dependency resolution
|
||||
|
||||
## Database Design Principles
|
||||
→ See references/database-design-reference.md for details
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Schema Design
|
||||
1. **Use meaningful names**: Clear, consistent naming conventions
|
||||
2. **Choose appropriate data types**: Right-sized columns for storage efficiency
|
||||
3. **Define proper constraints**: Foreign keys, check constraints, unique indexes
|
||||
4. **Consider future growth**: Plan for scale from the beginning
|
||||
5. **Document relationships**: Clear foreign key relationships and business rules
|
||||
|
||||
### Performance Optimization
|
||||
1. **Index strategically**: Cover common query patterns without over-indexing
|
||||
2. **Monitor query performance**: Regular analysis of slow queries
|
||||
3. **Partition large tables**: Improve query performance and maintenance
|
||||
4. **Use appropriate isolation levels**: Balance consistency with performance
|
||||
5. **Implement connection pooling**: Efficient resource utilization
|
||||
|
||||
### Security Considerations
|
||||
1. **Principle of least privilege**: Grant minimal necessary permissions
|
||||
2. **Encrypt sensitive data**: At rest and in transit
|
||||
3. **Audit access patterns**: Monitor and log database access
|
||||
4. **Validate inputs**: Prevent SQL injection attacks
|
||||
5. **Regular security updates**: Keep database software current
|
||||
|
||||
## Conclusion
|
||||
|
||||
Effective database design requires balancing multiple competing concerns: performance, scalability, maintainability, and business requirements. This skill provides the tools and knowledge to make informed decisions throughout the database lifecycle, from initial schema design through production optimization and evolution.
|
||||
|
||||
The included tools automate common analysis and optimization tasks, while the comprehensive guides provide the theoretical foundation for making sound architectural decisions. Whether building a new system or optimizing an existing one, these resources provide expert-level guidance for creating robust, scalable database solutions.
|
||||
@@ -0,0 +1,375 @@
|
||||
{
|
||||
"queries": [
|
||||
{
|
||||
"id": "user_login",
|
||||
"type": "SELECT",
|
||||
"table": "users",
|
||||
"description": "User authentication lookup by email",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "email",
|
||||
"operator": "=",
|
||||
"selectivity": 0.95
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [],
|
||||
"group_by": [],
|
||||
"frequency": 5000,
|
||||
"avg_execution_time_ms": 2.5
|
||||
},
|
||||
{
|
||||
"id": "product_search_category",
|
||||
"type": "SELECT",
|
||||
"table": "products",
|
||||
"description": "Product search within category with pagination",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "category_id",
|
||||
"operator": "=",
|
||||
"selectivity": 0.2
|
||||
},
|
||||
{
|
||||
"column": "is_active",
|
||||
"operator": "=",
|
||||
"selectivity": 0.1
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [
|
||||
{"column": "created_at", "direction": "DESC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 2500,
|
||||
"avg_execution_time_ms": 15.2
|
||||
},
|
||||
{
|
||||
"id": "product_search_price_range",
|
||||
"type": "SELECT",
|
||||
"table": "products",
|
||||
"description": "Product search by price range and brand",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "price",
|
||||
"operator": "BETWEEN",
|
||||
"selectivity": 0.3
|
||||
},
|
||||
{
|
||||
"column": "brand",
|
||||
"operator": "=",
|
||||
"selectivity": 0.05
|
||||
},
|
||||
{
|
||||
"column": "is_active",
|
||||
"operator": "=",
|
||||
"selectivity": 0.1
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [
|
||||
{"column": "price", "direction": "ASC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 800,
|
||||
"avg_execution_time_ms": 25.7
|
||||
},
|
||||
{
|
||||
"id": "user_orders_history",
|
||||
"type": "SELECT",
|
||||
"table": "orders",
|
||||
"description": "User order history with pagination",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "user_id",
|
||||
"operator": "=",
|
||||
"selectivity": 0.8
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [
|
||||
{"column": "created_at", "direction": "DESC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 1200,
|
||||
"avg_execution_time_ms": 8.3
|
||||
},
|
||||
{
|
||||
"id": "order_details_with_items",
|
||||
"type": "SELECT",
|
||||
"table": "orders",
|
||||
"description": "Order details with order items (JOIN query)",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "id",
|
||||
"operator": "=",
|
||||
"selectivity": 1.0
|
||||
}
|
||||
],
|
||||
"join_conditions": [
|
||||
{
|
||||
"local_column": "id",
|
||||
"foreign_table": "order_items",
|
||||
"foreign_column": "order_id",
|
||||
"join_type": "INNER"
|
||||
}
|
||||
],
|
||||
"order_by": [],
|
||||
"group_by": [],
|
||||
"frequency": 3000,
|
||||
"avg_execution_time_ms": 12.1
|
||||
},
|
||||
{
|
||||
"id": "pending_orders_processing",
|
||||
"type": "SELECT",
|
||||
"table": "orders",
|
||||
"description": "Processing queue - pending orders by date",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "status",
|
||||
"operator": "=",
|
||||
"selectivity": 0.15
|
||||
},
|
||||
{
|
||||
"column": "created_at",
|
||||
"operator": ">=",
|
||||
"selectivity": 0.3
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [
|
||||
{"column": "created_at", "direction": "ASC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 150,
|
||||
"avg_execution_time_ms": 45.2
|
||||
},
|
||||
{
|
||||
"id": "user_orders_by_status",
|
||||
"type": "SELECT",
|
||||
"table": "orders",
|
||||
"description": "User orders filtered by status",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "user_id",
|
||||
"operator": "=",
|
||||
"selectivity": 0.8
|
||||
},
|
||||
{
|
||||
"column": "status",
|
||||
"operator": "IN",
|
||||
"selectivity": 0.4
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [
|
||||
{"column": "created_at", "direction": "DESC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 600,
|
||||
"avg_execution_time_ms": 18.5
|
||||
},
|
||||
{
|
||||
"id": "product_reviews_summary",
|
||||
"type": "SELECT",
|
||||
"table": "product_reviews",
|
||||
"description": "Product review aggregation",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "product_id",
|
||||
"operator": "=",
|
||||
"selectivity": 0.85
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [],
|
||||
"group_by": ["product_id"],
|
||||
"frequency": 1800,
|
||||
"avg_execution_time_ms": 22.3
|
||||
},
|
||||
{
|
||||
"id": "inventory_low_stock",
|
||||
"type": "SELECT",
|
||||
"table": "products",
|
||||
"description": "Low inventory alert query",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "inventory_count",
|
||||
"operator": "<=",
|
||||
"selectivity": 0.1
|
||||
},
|
||||
{
|
||||
"column": "is_active",
|
||||
"operator": "=",
|
||||
"selectivity": 0.1
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [
|
||||
{"column": "inventory_count", "direction": "ASC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 50,
|
||||
"avg_execution_time_ms": 35.8
|
||||
},
|
||||
{
|
||||
"id": "popular_products_by_category",
|
||||
"type": "SELECT",
|
||||
"table": "order_items",
|
||||
"description": "Popular products analysis with category join",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "created_at",
|
||||
"operator": ">=",
|
||||
"selectivity": 0.2
|
||||
}
|
||||
],
|
||||
"join_conditions": [
|
||||
{
|
||||
"local_column": "product_id",
|
||||
"foreign_table": "products",
|
||||
"foreign_column": "id",
|
||||
"join_type": "INNER"
|
||||
},
|
||||
{
|
||||
"local_column": "category_id",
|
||||
"foreign_table": "categories",
|
||||
"foreign_column": "id",
|
||||
"join_type": "INNER"
|
||||
}
|
||||
],
|
||||
"order_by": [
|
||||
{"column": "total_quantity", "direction": "DESC"}
|
||||
],
|
||||
"group_by": ["product_id", "category_id"],
|
||||
"frequency": 25,
|
||||
"avg_execution_time_ms": 180.5
|
||||
},
|
||||
{
|
||||
"id": "customer_purchase_history",
|
||||
"type": "SELECT",
|
||||
"table": "orders",
|
||||
"description": "Customer analytics - purchase history with items",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "user_id",
|
||||
"operator": "=",
|
||||
"selectivity": 0.8
|
||||
},
|
||||
{
|
||||
"column": "status",
|
||||
"operator": "IN",
|
||||
"selectivity": 0.6
|
||||
}
|
||||
],
|
||||
"join_conditions": [
|
||||
{
|
||||
"local_column": "id",
|
||||
"foreign_table": "order_items",
|
||||
"foreign_column": "order_id",
|
||||
"join_type": "INNER"
|
||||
}
|
||||
],
|
||||
"order_by": [
|
||||
{"column": "created_at", "direction": "DESC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 300,
|
||||
"avg_execution_time_ms": 65.2
|
||||
},
|
||||
{
|
||||
"id": "daily_sales_report",
|
||||
"type": "SELECT",
|
||||
"table": "orders",
|
||||
"description": "Daily sales aggregation report",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "created_at",
|
||||
"operator": ">=",
|
||||
"selectivity": 0.05
|
||||
},
|
||||
{
|
||||
"column": "status",
|
||||
"operator": "IN",
|
||||
"selectivity": 0.6
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [
|
||||
{"column": "order_date", "direction": "DESC"}
|
||||
],
|
||||
"group_by": ["DATE(created_at)"],
|
||||
"frequency": 10,
|
||||
"avg_execution_time_ms": 250.8
|
||||
},
|
||||
{
|
||||
"id": "category_hierarchy_nav",
|
||||
"type": "SELECT",
|
||||
"table": "categories",
|
||||
"description": "Category navigation - parent-child relationships",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "parent_id",
|
||||
"operator": "=",
|
||||
"selectivity": 0.2
|
||||
},
|
||||
{
|
||||
"column": "is_active",
|
||||
"operator": "=",
|
||||
"selectivity": 0.1
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [
|
||||
{"column": "sort_order", "direction": "ASC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 800,
|
||||
"avg_execution_time_ms": 5.1
|
||||
},
|
||||
{
|
||||
"id": "recent_user_reviews",
|
||||
"type": "SELECT",
|
||||
"table": "product_reviews",
|
||||
"description": "Recent product reviews by user",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "user_id",
|
||||
"operator": "=",
|
||||
"selectivity": 0.95
|
||||
}
|
||||
],
|
||||
"join_conditions": [
|
||||
{
|
||||
"local_column": "product_id",
|
||||
"foreign_table": "products",
|
||||
"foreign_column": "id",
|
||||
"join_type": "INNER"
|
||||
}
|
||||
],
|
||||
"order_by": [
|
||||
{"column": "created_at", "direction": "DESC"}
|
||||
],
|
||||
"group_by": [],
|
||||
"frequency": 200,
|
||||
"avg_execution_time_ms": 12.7
|
||||
},
|
||||
{
|
||||
"id": "product_avg_rating",
|
||||
"type": "SELECT",
|
||||
"table": "product_reviews",
|
||||
"description": "Product average rating calculation",
|
||||
"where_conditions": [
|
||||
{
|
||||
"column": "product_id",
|
||||
"operator": "IN",
|
||||
"selectivity": 0.1
|
||||
}
|
||||
],
|
||||
"join_conditions": [],
|
||||
"order_by": [],
|
||||
"group_by": ["product_id"],
|
||||
"frequency": 400,
|
||||
"avg_execution_time_ms": 35.4
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,372 @@
|
||||
{
|
||||
"tables": {
|
||||
"users": {
|
||||
"columns": {
|
||||
"id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 50000
|
||||
},
|
||||
"email": {
|
||||
"type": "VARCHAR(255)",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 50000
|
||||
},
|
||||
"username": {
|
||||
"type": "VARCHAR(50)",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 50000
|
||||
},
|
||||
"password_hash": {
|
||||
"type": "VARCHAR(255)",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 50000
|
||||
},
|
||||
"first_name": {
|
||||
"type": "VARCHAR(100)",
|
||||
"nullable": true,
|
||||
"cardinality_estimate": 25000
|
||||
},
|
||||
"last_name": {
|
||||
"type": "VARCHAR(100)",
|
||||
"nullable": true,
|
||||
"cardinality_estimate": 30000
|
||||
},
|
||||
"status": {
|
||||
"type": "VARCHAR(20)",
|
||||
"nullable": false,
|
||||
"default": "active",
|
||||
"cardinality_estimate": 5
|
||||
},
|
||||
"created_at": {
|
||||
"type": "TIMESTAMP",
|
||||
"nullable": false,
|
||||
"default": "CURRENT_TIMESTAMP"
|
||||
}
|
||||
},
|
||||
"primary_key": ["id"],
|
||||
"unique_constraints": [
|
||||
["email"],
|
||||
["username"]
|
||||
],
|
||||
"check_constraints": {
|
||||
"chk_status_valid": "status IN ('active', 'inactive', 'suspended', 'deleted')"
|
||||
},
|
||||
"indexes": [
|
||||
{
|
||||
"name": "idx_users_email",
|
||||
"columns": ["email"],
|
||||
"unique": true
|
||||
},
|
||||
{
|
||||
"name": "idx_users_status",
|
||||
"columns": ["status"]
|
||||
}
|
||||
]
|
||||
},
|
||||
"products": {
|
||||
"columns": {
|
||||
"id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 10000
|
||||
},
|
||||
"name": {
|
||||
"type": "VARCHAR(255)",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 9500
|
||||
},
|
||||
"sku": {
|
||||
"type": "VARCHAR(50)",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 10000
|
||||
},
|
||||
"price": {
|
||||
"type": "DECIMAL(10,2)",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 5000
|
||||
},
|
||||
"category_id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"foreign_key": "categories.id",
|
||||
"cardinality_estimate": 50
|
||||
},
|
||||
"brand": {
|
||||
"type": "VARCHAR(100)",
|
||||
"nullable": true,
|
||||
"cardinality_estimate": 200
|
||||
},
|
||||
"is_active": {
|
||||
"type": "BOOLEAN",
|
||||
"nullable": false,
|
||||
"default": true,
|
||||
"cardinality_estimate": 2
|
||||
},
|
||||
"inventory_count": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"default": 0,
|
||||
"cardinality_estimate": 1000
|
||||
},
|
||||
"created_at": {
|
||||
"type": "TIMESTAMP",
|
||||
"nullable": false,
|
||||
"default": "CURRENT_TIMESTAMP"
|
||||
}
|
||||
},
|
||||
"primary_key": ["id"],
|
||||
"unique_constraints": [
|
||||
["sku"]
|
||||
],
|
||||
"check_constraints": {
|
||||
"chk_price_positive": "price > 0",
|
||||
"chk_inventory_non_negative": "inventory_count >= 0"
|
||||
},
|
||||
"indexes": [
|
||||
{
|
||||
"name": "idx_products_category",
|
||||
"columns": ["category_id"]
|
||||
},
|
||||
{
|
||||
"name": "idx_products_brand",
|
||||
"columns": ["brand"]
|
||||
},
|
||||
{
|
||||
"name": "idx_products_price",
|
||||
"columns": ["price"]
|
||||
},
|
||||
{
|
||||
"name": "idx_products_active_category",
|
||||
"columns": ["is_active", "category_id"],
|
||||
"partial_condition": "is_active = true"
|
||||
}
|
||||
]
|
||||
},
|
||||
"orders": {
|
||||
"columns": {
|
||||
"id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 200000
|
||||
},
|
||||
"order_number": {
|
||||
"type": "VARCHAR(50)",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 200000
|
||||
},
|
||||
"user_id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"foreign_key": "users.id",
|
||||
"cardinality_estimate": 40000
|
||||
},
|
||||
"status": {
|
||||
"type": "VARCHAR(50)",
|
||||
"nullable": false,
|
||||
"default": "pending",
|
||||
"cardinality_estimate": 8
|
||||
},
|
||||
"total_amount": {
|
||||
"type": "DECIMAL(10,2)",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 50000
|
||||
},
|
||||
"payment_method": {
|
||||
"type": "VARCHAR(50)",
|
||||
"nullable": true,
|
||||
"cardinality_estimate": 10
|
||||
},
|
||||
"created_at": {
|
||||
"type": "TIMESTAMP",
|
||||
"nullable": false,
|
||||
"default": "CURRENT_TIMESTAMP"
|
||||
},
|
||||
"shipped_at": {
|
||||
"type": "TIMESTAMP",
|
||||
"nullable": true
|
||||
}
|
||||
},
|
||||
"primary_key": ["id"],
|
||||
"unique_constraints": [
|
||||
["order_number"]
|
||||
],
|
||||
"check_constraints": {
|
||||
"chk_total_positive": "total_amount > 0",
|
||||
"chk_status_valid": "status IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled')"
|
||||
},
|
||||
"indexes": [
|
||||
{
|
||||
"name": "idx_orders_user",
|
||||
"columns": ["user_id"]
|
||||
},
|
||||
{
|
||||
"name": "idx_orders_status",
|
||||
"columns": ["status"]
|
||||
},
|
||||
{
|
||||
"name": "idx_orders_created",
|
||||
"columns": ["created_at"]
|
||||
},
|
||||
{
|
||||
"name": "idx_orders_user_status",
|
||||
"columns": ["user_id", "status"]
|
||||
}
|
||||
]
|
||||
},
|
||||
"order_items": {
|
||||
"columns": {
|
||||
"id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 800000
|
||||
},
|
||||
"order_id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"foreign_key": "orders.id",
|
||||
"cardinality_estimate": 200000
|
||||
},
|
||||
"product_id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"foreign_key": "products.id",
|
||||
"cardinality_estimate": 8000
|
||||
},
|
||||
"quantity": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 20
|
||||
},
|
||||
"unit_price": {
|
||||
"type": "DECIMAL(10,2)",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 5000
|
||||
},
|
||||
"total_price": {
|
||||
"type": "DECIMAL(10,2)",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 10000
|
||||
}
|
||||
},
|
||||
"primary_key": ["id"],
|
||||
"check_constraints": {
|
||||
"chk_quantity_positive": "quantity > 0",
|
||||
"chk_unit_price_positive": "unit_price > 0"
|
||||
},
|
||||
"indexes": [
|
||||
{
|
||||
"name": "idx_order_items_order",
|
||||
"columns": ["order_id"]
|
||||
},
|
||||
{
|
||||
"name": "idx_order_items_product",
|
||||
"columns": ["product_id"]
|
||||
}
|
||||
]
|
||||
},
|
||||
"categories": {
|
||||
"columns": {
|
||||
"id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 100
|
||||
},
|
||||
"name": {
|
||||
"type": "VARCHAR(100)",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 100
|
||||
},
|
||||
"parent_id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": true,
|
||||
"foreign_key": "categories.id",
|
||||
"cardinality_estimate": 20
|
||||
},
|
||||
"is_active": {
|
||||
"type": "BOOLEAN",
|
||||
"nullable": false,
|
||||
"default": true,
|
||||
"cardinality_estimate": 2
|
||||
}
|
||||
},
|
||||
"primary_key": ["id"],
|
||||
"indexes": [
|
||||
{
|
||||
"name": "idx_categories_parent",
|
||||
"columns": ["parent_id"]
|
||||
},
|
||||
{
|
||||
"name": "idx_categories_active",
|
||||
"columns": ["is_active"]
|
||||
}
|
||||
]
|
||||
},
|
||||
"product_reviews": {
|
||||
"columns": {
|
||||
"id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"unique": true,
|
||||
"cardinality_estimate": 150000
|
||||
},
|
||||
"product_id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"foreign_key": "products.id",
|
||||
"cardinality_estimate": 8000
|
||||
},
|
||||
"user_id": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"foreign_key": "users.id",
|
||||
"cardinality_estimate": 30000
|
||||
},
|
||||
"rating": {
|
||||
"type": "INTEGER",
|
||||
"nullable": false,
|
||||
"cardinality_estimate": 5
|
||||
},
|
||||
"review_text": {
|
||||
"type": "TEXT",
|
||||
"nullable": true
|
||||
},
|
||||
"created_at": {
|
||||
"type": "TIMESTAMP",
|
||||
"nullable": false,
|
||||
"default": "CURRENT_TIMESTAMP"
|
||||
}
|
||||
},
|
||||
"primary_key": ["id"],
|
||||
"unique_constraints": [
|
||||
["product_id", "user_id"]
|
||||
],
|
||||
"check_constraints": {
|
||||
"chk_rating_valid": "rating BETWEEN 1 AND 5"
|
||||
},
|
||||
"indexes": [
|
||||
{
|
||||
"name": "idx_reviews_product",
|
||||
"columns": ["product_id"]
|
||||
},
|
||||
{
|
||||
"name": "idx_reviews_user",
|
||||
"columns": ["user_id"]
|
||||
},
|
||||
{
|
||||
"name": "idx_reviews_rating",
|
||||
"columns": ["rating"]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,207 @@
|
||||
-- Sample E-commerce Database Schema
|
||||
-- Demonstrates various normalization levels and common patterns
|
||||
|
||||
-- Users table - well normalized
|
||||
CREATE TABLE users (
|
||||
id INTEGER PRIMARY KEY,
|
||||
email VARCHAR(255) NOT NULL UNIQUE,
|
||||
username VARCHAR(50) NOT NULL UNIQUE,
|
||||
password_hash VARCHAR(255) NOT NULL,
|
||||
first_name VARCHAR(100),
|
||||
last_name VARCHAR(100),
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
status VARCHAR(20) DEFAULT 'active'
|
||||
);
|
||||
|
||||
-- Categories table - hierarchical structure
|
||||
CREATE TABLE categories (
|
||||
id INTEGER PRIMARY KEY,
|
||||
name VARCHAR(100) NOT NULL,
|
||||
slug VARCHAR(100) NOT NULL UNIQUE,
|
||||
parent_id INTEGER REFERENCES categories(id),
|
||||
description TEXT,
|
||||
is_active BOOLEAN DEFAULT true,
|
||||
sort_order INTEGER DEFAULT 0,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Products table - potential normalization issues
|
||||
CREATE TABLE products (
|
||||
id INTEGER PRIMARY KEY,
|
||||
name VARCHAR(255) NOT NULL,
|
||||
sku VARCHAR(50) NOT NULL UNIQUE,
|
||||
description TEXT,
|
||||
price DECIMAL(10,2) NOT NULL,
|
||||
cost DECIMAL(10,2),
|
||||
weight DECIMAL(8,2),
|
||||
dimensions VARCHAR(50), -- Potential 1NF violation: "10x5x3 inches"
|
||||
category_id INTEGER REFERENCES categories(id),
|
||||
category_name VARCHAR(100), -- Redundant with categories.name (3NF violation)
|
||||
brand VARCHAR(100), -- Should be normalized to separate brands table
|
||||
tags VARCHAR(500), -- Potential 1NF violation: comma-separated tags
|
||||
inventory_count INTEGER DEFAULT 0,
|
||||
reorder_point INTEGER DEFAULT 10,
|
||||
supplier_name VARCHAR(100), -- Should be normalized
|
||||
supplier_contact VARCHAR(255), -- Should be normalized
|
||||
is_active BOOLEAN DEFAULT true,
|
||||
featured BOOLEAN DEFAULT false,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Addresses table - good normalization
|
||||
CREATE TABLE addresses (
|
||||
id INTEGER PRIMARY KEY,
|
||||
user_id INTEGER REFERENCES users(id),
|
||||
address_type VARCHAR(20) DEFAULT 'shipping', -- 'shipping', 'billing'
|
||||
street_address VARCHAR(255) NOT NULL,
|
||||
street_address_2 VARCHAR(255),
|
||||
city VARCHAR(100) NOT NULL,
|
||||
state VARCHAR(50) NOT NULL,
|
||||
postal_code VARCHAR(20) NOT NULL,
|
||||
country VARCHAR(50) NOT NULL DEFAULT 'US',
|
||||
is_default BOOLEAN DEFAULT false,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Orders table - mixed normalization issues
|
||||
CREATE TABLE orders (
|
||||
id INTEGER PRIMARY KEY,
|
||||
order_number VARCHAR(50) NOT NULL UNIQUE,
|
||||
user_id INTEGER REFERENCES users(id),
|
||||
user_email VARCHAR(255), -- Denormalized for performance/historical reasons
|
||||
user_name VARCHAR(200), -- Denormalized for performance/historical reasons
|
||||
status VARCHAR(50) NOT NULL DEFAULT 'pending',
|
||||
total_amount DECIMAL(10,2) NOT NULL,
|
||||
tax_amount DECIMAL(10,2) NOT NULL,
|
||||
shipping_amount DECIMAL(10,2) NOT NULL,
|
||||
discount_amount DECIMAL(10,2) DEFAULT 0,
|
||||
payment_method VARCHAR(50), -- Should be normalized to payment_methods
|
||||
payment_status VARCHAR(50) DEFAULT 'pending',
|
||||
shipping_address_id INTEGER REFERENCES addresses(id),
|
||||
billing_address_id INTEGER REFERENCES addresses(id),
|
||||
-- Denormalized shipping address for historical preservation
|
||||
shipping_street VARCHAR(255),
|
||||
shipping_city VARCHAR(100),
|
||||
shipping_state VARCHAR(50),
|
||||
shipping_postal_code VARCHAR(20),
|
||||
shipping_country VARCHAR(50),
|
||||
notes TEXT,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
shipped_at TIMESTAMP,
|
||||
delivered_at TIMESTAMP
|
||||
);
|
||||
|
||||
-- Order items table - properly normalized
|
||||
CREATE TABLE order_items (
|
||||
id INTEGER PRIMARY KEY,
|
||||
order_id INTEGER REFERENCES orders(id),
|
||||
product_id INTEGER REFERENCES products(id),
|
||||
product_name VARCHAR(255), -- Denormalized for historical reasons
|
||||
product_sku VARCHAR(50), -- Denormalized for historical reasons
|
||||
quantity INTEGER NOT NULL,
|
||||
unit_price DECIMAL(10,2) NOT NULL,
|
||||
total_price DECIMAL(10,2) NOT NULL, -- Calculated field (could be computed)
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Shopping cart table - session-based data
|
||||
CREATE TABLE shopping_cart (
|
||||
id INTEGER PRIMARY KEY,
|
||||
user_id INTEGER REFERENCES users(id),
|
||||
session_id VARCHAR(255), -- For anonymous users
|
||||
product_id INTEGER REFERENCES products(id),
|
||||
quantity INTEGER NOT NULL DEFAULT 1,
|
||||
added_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
UNIQUE(user_id, product_id),
|
||||
UNIQUE(session_id, product_id)
|
||||
);
|
||||
|
||||
-- Product reviews - user-generated content
|
||||
CREATE TABLE product_reviews (
|
||||
id INTEGER PRIMARY KEY,
|
||||
product_id INTEGER REFERENCES products(id),
|
||||
user_id INTEGER REFERENCES users(id),
|
||||
rating INTEGER NOT NULL CHECK (rating BETWEEN 1 AND 5),
|
||||
title VARCHAR(200),
|
||||
review_text TEXT,
|
||||
verified_purchase BOOLEAN DEFAULT false,
|
||||
helpful_count INTEGER DEFAULT 0,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
UNIQUE(product_id, user_id) -- One review per user per product
|
||||
);
|
||||
|
||||
-- Coupons table - promotional data
|
||||
CREATE TABLE coupons (
|
||||
id INTEGER PRIMARY KEY,
|
||||
code VARCHAR(50) NOT NULL UNIQUE,
|
||||
description VARCHAR(255),
|
||||
discount_type VARCHAR(20) NOT NULL, -- 'percentage', 'fixed_amount'
|
||||
discount_value DECIMAL(8,2) NOT NULL,
|
||||
minimum_amount DECIMAL(10,2),
|
||||
maximum_discount DECIMAL(10,2),
|
||||
usage_limit INTEGER,
|
||||
usage_count INTEGER DEFAULT 0,
|
||||
valid_from TIMESTAMP NOT NULL,
|
||||
valid_until TIMESTAMP NOT NULL,
|
||||
is_active BOOLEAN DEFAULT true,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Audit log table - for tracking changes
|
||||
CREATE TABLE audit_log (
|
||||
id INTEGER PRIMARY KEY,
|
||||
table_name VARCHAR(50) NOT NULL,
|
||||
record_id INTEGER NOT NULL,
|
||||
action VARCHAR(20) NOT NULL, -- 'INSERT', 'UPDATE', 'DELETE'
|
||||
old_values TEXT, -- JSON format
|
||||
new_values TEXT, -- JSON format
|
||||
user_id INTEGER REFERENCES users(id),
|
||||
ip_address VARCHAR(45),
|
||||
user_agent TEXT,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Problematic table - multiple normalization violations
|
||||
CREATE TABLE user_preferences (
|
||||
user_id INTEGER PRIMARY KEY REFERENCES users(id),
|
||||
preferred_categories VARCHAR(500), -- CSV list - 1NF violation
|
||||
email_notifications VARCHAR(255), -- "daily,weekly,promotions" - 1NF violation
|
||||
user_name VARCHAR(200), -- Redundant with users table - 3NF violation
|
||||
user_email VARCHAR(255), -- Redundant with users table - 3NF violation
|
||||
theme VARCHAR(50) DEFAULT 'light',
|
||||
language VARCHAR(10) DEFAULT 'en',
|
||||
timezone VARCHAR(50) DEFAULT 'UTC',
|
||||
currency VARCHAR(3) DEFAULT 'USD',
|
||||
date_format VARCHAR(20) DEFAULT 'YYYY-MM-DD',
|
||||
newsletter_subscribed BOOLEAN DEFAULT true,
|
||||
sms_notifications BOOLEAN DEFAULT false,
|
||||
push_notifications BOOLEAN DEFAULT true,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Create some basic indexes (some missing, some redundant for demonstration)
|
||||
CREATE INDEX idx_users_email ON users (email);
|
||||
CREATE INDEX idx_users_username ON users (username); -- Redundant due to UNIQUE constraint
|
||||
CREATE INDEX idx_products_category ON products (category_id);
|
||||
CREATE INDEX idx_products_brand ON products (brand);
|
||||
CREATE INDEX idx_products_sku ON products (sku); -- Redundant due to UNIQUE constraint
|
||||
CREATE INDEX idx_orders_user ON orders (user_id);
|
||||
CREATE INDEX idx_orders_status ON orders (status);
|
||||
CREATE INDEX idx_orders_created ON orders (created_at);
|
||||
CREATE INDEX idx_order_items_order ON order_items (order_id);
|
||||
CREATE INDEX idx_order_items_product ON order_items (product_id);
|
||||
-- Missing index on addresses.user_id
|
||||
-- Missing composite index on orders (user_id, status)
|
||||
-- Missing index on product_reviews.product_id
|
||||
|
||||
-- Constraints that should exist but are missing
|
||||
-- ALTER TABLE products ADD CONSTRAINT chk_price_positive CHECK (price > 0);
|
||||
-- ALTER TABLE products ADD CONSTRAINT chk_inventory_non_negative CHECK (inventory_count >= 0);
|
||||
-- ALTER TABLE order_items ADD CONSTRAINT chk_quantity_positive CHECK (quantity > 0);
|
||||
-- ALTER TABLE orders ADD CONSTRAINT chk_total_positive CHECK (total_amount > 0);
|
||||
@@ -0,0 +1,60 @@
|
||||
DATABASE INDEX OPTIMIZATION REPORT
|
||||
==================================================
|
||||
|
||||
ANALYSIS SUMMARY
|
||||
----------------
|
||||
Tables Analyzed: 6
|
||||
Query Patterns: 15
|
||||
Existing Indexes: 12
|
||||
New Recommendations: 8
|
||||
High Priority: 4
|
||||
Redundancy Issues: 2
|
||||
|
||||
HIGH PRIORITY RECOMMENDATIONS (4)
|
||||
----------------------------------
|
||||
1. orders: Optimize multi-column WHERE conditions: user_id, status, created_at
|
||||
Columns: user_id, status, created_at
|
||||
Benefit: Very High
|
||||
SQL: CREATE INDEX idx_orders_user_status_created ON orders (user_id, status, created_at);
|
||||
|
||||
2. products: Optimize WHERE category_id = AND is_active = queries
|
||||
Columns: category_id, is_active
|
||||
Benefit: High
|
||||
SQL: CREATE INDEX idx_products_category_active ON products (category_id, is_active);
|
||||
|
||||
3. order_items: Optimize JOIN with products table on product_id
|
||||
Columns: product_id
|
||||
Benefit: High (frequent JOINs)
|
||||
SQL: CREATE INDEX idx_order_items_product_join ON order_items (product_id);
|
||||
|
||||
4. product_reviews: Covering index for WHERE + ORDER BY optimization
|
||||
Columns: product_id, created_at
|
||||
Benefit: High (eliminates table lookups for SELECT)
|
||||
SQL: CREATE INDEX idx_product_reviews_covering_product_created ON product_reviews (product_id, created_at) INCLUDE (rating, review_text);
|
||||
|
||||
REDUNDANCY ISSUES (2)
|
||||
---------------------
|
||||
• DUPLICATE: Indexes 'idx_users_email' and 'unique_users_email' are identical
|
||||
Recommendation: Drop one of the duplicate indexes
|
||||
SQL: DROP INDEX idx_users_email;
|
||||
|
||||
• OVERLAPPING: Index 'idx_products_category' overlaps 85% with 'idx_products_category_active'
|
||||
Recommendation: Consider dropping 'idx_products_category' as it's largely covered by 'idx_products_category_active'
|
||||
SQL: DROP INDEX idx_products_category;
|
||||
|
||||
PERFORMANCE IMPACT ANALYSIS
|
||||
----------------------------
|
||||
Queries to be optimized: 12
|
||||
High impact optimizations: 6
|
||||
Estimated insert overhead: 40%
|
||||
|
||||
RECOMMENDED CREATE INDEX STATEMENTS
|
||||
------------------------------------
|
||||
1. CREATE INDEX idx_orders_user_status_created ON orders (user_id, status, created_at);
|
||||
2. CREATE INDEX idx_products_category_active ON products (category_id, is_active);
|
||||
3. CREATE INDEX idx_order_items_product_join ON order_items (product_id);
|
||||
4. CREATE INDEX idx_product_reviews_covering_product_created ON product_reviews (product_id, created_at) INCLUDE (rating, review_text);
|
||||
5. CREATE INDEX idx_products_price_brand ON products (price, brand);
|
||||
6. CREATE INDEX idx_orders_status_created ON orders (status, created_at);
|
||||
7. CREATE INDEX idx_categories_parent_active ON categories (parent_id, is_active);
|
||||
8. CREATE INDEX idx_product_reviews_user_created ON product_reviews (user_id, created_at);
|
||||
@@ -0,0 +1,124 @@
|
||||
DATABASE MIGRATION PLAN
|
||||
==================================================
|
||||
Migration ID: a7b3c9d2
|
||||
Created: 2024-02-16T15:30:00Z
|
||||
Zero Downtime: false
|
||||
|
||||
MIGRATION SUMMARY
|
||||
-----------------
|
||||
Total Steps: 12
|
||||
Tables Added: 1
|
||||
Tables Dropped: 0
|
||||
Tables Renamed: 0
|
||||
Columns Added: 3
|
||||
Columns Dropped: 1
|
||||
Columns Modified: 2
|
||||
Constraints Added: 4
|
||||
Constraints Dropped: 1
|
||||
Indexes Added: 2
|
||||
Indexes Dropped: 1
|
||||
|
||||
RISK ASSESSMENT
|
||||
---------------
|
||||
High Risk Steps: 3
|
||||
Medium Risk Steps: 4
|
||||
Low Risk Steps: 5
|
||||
|
||||
MIGRATION STEPS
|
||||
---------------
|
||||
1. Create table brands with 4 columns (LOW risk)
|
||||
Type: CREATE_TABLE
|
||||
Forward SQL: CREATE TABLE brands (
|
||||
id INTEGER PRIMARY KEY,
|
||||
name VARCHAR(100) NOT NULL,
|
||||
description TEXT,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
Rollback SQL: DROP TABLE IF EXISTS brands;
|
||||
|
||||
2. Add column brand_id to products (LOW risk)
|
||||
Type: ADD_COLUMN
|
||||
Forward SQL: ALTER TABLE products ADD COLUMN brand_id INTEGER;
|
||||
Rollback SQL: ALTER TABLE products DROP COLUMN brand_id;
|
||||
|
||||
3. Add column email_verified to users (LOW risk)
|
||||
Type: ADD_COLUMN
|
||||
Forward SQL: ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT false;
|
||||
Rollback SQL: ALTER TABLE users DROP COLUMN email_verified;
|
||||
|
||||
4. Add column last_login to users (LOW risk)
|
||||
Type: ADD_COLUMN
|
||||
Forward SQL: ALTER TABLE users ADD COLUMN last_login TIMESTAMP;
|
||||
Rollback SQL: ALTER TABLE users DROP COLUMN last_login;
|
||||
|
||||
5. Modify column price: type: DECIMAL(10,2) -> DECIMAL(12,2) (LOW risk)
|
||||
Type: MODIFY_COLUMN
|
||||
Forward SQL: ALTER TABLE products
|
||||
ALTER COLUMN price TYPE DECIMAL(12,2);
|
||||
Rollback SQL: ALTER TABLE products
|
||||
ALTER COLUMN price TYPE DECIMAL(10,2);
|
||||
|
||||
6. Modify column inventory_count: nullable: true -> false (HIGH risk)
|
||||
Type: MODIFY_COLUMN
|
||||
Forward SQL: ALTER TABLE products
|
||||
ALTER COLUMN inventory_count SET NOT NULL;
|
||||
Rollback SQL: ALTER TABLE products
|
||||
ALTER COLUMN inventory_count DROP NOT NULL;
|
||||
|
||||
7. Add primary key on id (MEDIUM risk)
|
||||
Type: ADD_CONSTRAINT
|
||||
Forward SQL: ALTER TABLE brands ADD CONSTRAINT pk_brands PRIMARY KEY (id);
|
||||
Rollback SQL: ALTER TABLE brands DROP CONSTRAINT pk_brands;
|
||||
|
||||
8. Add foreign key constraint on brand_id (MEDIUM risk)
|
||||
Type: ADD_CONSTRAINT
|
||||
Forward SQL: ALTER TABLE products ADD CONSTRAINT fk_products_brand_id FOREIGN KEY (brand_id) REFERENCES brands(id);
|
||||
Rollback SQL: ALTER TABLE products DROP CONSTRAINT fk_products_brand_id;
|
||||
|
||||
9. Add unique constraint on name (MEDIUM risk)
|
||||
Type: ADD_CONSTRAINT
|
||||
Forward SQL: ALTER TABLE brands ADD CONSTRAINT uq_brands_name UNIQUE (name);
|
||||
Rollback SQL: ALTER TABLE brands DROP CONSTRAINT uq_brands_name;
|
||||
|
||||
10. Add check constraint: price > 0 (MEDIUM risk)
|
||||
Type: ADD_CONSTRAINT
|
||||
Forward SQL: ALTER TABLE products ADD CONSTRAINT chk_products_price_positive CHECK (price > 0);
|
||||
Rollback SQL: ALTER TABLE products DROP CONSTRAINT chk_products_price_positive;
|
||||
|
||||
11. Create index idx_products_brand_id on (brand_id) (LOW risk)
|
||||
Type: ADD_INDEX
|
||||
Forward SQL: CREATE INDEX idx_products_brand_id ON products (brand_id);
|
||||
Rollback SQL: DROP INDEX idx_products_brand_id;
|
||||
Estimated Time: 1-5 minutes depending on table size
|
||||
|
||||
12. Create index idx_users_email_verified on (email_verified) (LOW risk)
|
||||
Type: ADD_INDEX
|
||||
Forward SQL: CREATE INDEX idx_users_email_verified ON users (email_verified);
|
||||
Rollback SQL: DROP INDEX idx_users_email_verified;
|
||||
Estimated Time: 1-5 minutes depending on table size
|
||||
|
||||
VALIDATION CHECKS
|
||||
-----------------
|
||||
• Verify table brands exists
|
||||
SQL: SELECT COUNT(*) FROM information_schema.tables WHERE table_name = 'brands';
|
||||
Expected: 1
|
||||
|
||||
• Verify column brand_id exists in products
|
||||
SQL: SELECT COUNT(*) FROM information_schema.columns WHERE table_name = 'products' AND column_name = 'brand_id';
|
||||
Expected: 1
|
||||
|
||||
• Verify column email_verified exists in users
|
||||
SQL: SELECT COUNT(*) FROM information_schema.columns WHERE table_name = 'users' AND column_name = 'email_verified';
|
||||
Expected: 1
|
||||
|
||||
• Verify column modification in products
|
||||
SQL: SELECT data_type, is_nullable FROM information_schema.columns WHERE table_name = 'products' AND column_name = 'price';
|
||||
Expected: 1
|
||||
|
||||
• Verify index idx_products_brand_id exists
|
||||
SQL: SELECT COUNT(*) FROM information_schema.statistics WHERE index_name = 'idx_products_brand_id';
|
||||
Expected: 1
|
||||
|
||||
• Verify index idx_users_email_verified exists
|
||||
SQL: SELECT COUNT(*) FROM information_schema.statistics WHERE index_name = 'idx_users_email_verified';
|
||||
Expected: 1
|
||||
@@ -0,0 +1,222 @@
|
||||
DATABASE SCHEMA ANALYSIS REPORT
|
||||
==================================================
|
||||
|
||||
SCHEMA OVERVIEW
|
||||
---------------
|
||||
Total Tables: 8
|
||||
Total Columns: 52
|
||||
Tables with Primary Keys: 8
|
||||
Total Foreign Keys: 6
|
||||
Total Indexes: 15
|
||||
|
||||
KEY RECOMMENDATIONS
|
||||
------------------
|
||||
1. Address 3 high-severity issues immediately
|
||||
2. Add primary keys to tables:
|
||||
3. Review 4 VARCHAR(255) columns for right-sizing
|
||||
4. Consider adding 2 foreign key constraints for referential integrity
|
||||
5. Review 8 normalization issues for schema optimization
|
||||
|
||||
NORMALIZATION ISSUES (8 total)
|
||||
------------------------------
|
||||
High: 2, Medium: 3, Low: 2, Warning: 1
|
||||
|
||||
• products: Column 'dimensions' appears to store delimited values
|
||||
Suggestion: Create separate table for individual values with foreign key relationship
|
||||
|
||||
• products: Column 'tags' appears to store delimited values
|
||||
Suggestion: Create separate table for individual values with foreign key relationship
|
||||
|
||||
• products: Columns ['category_name'] may have transitive dependency through 'category_id'
|
||||
Suggestion: Consider creating separate 'category' table with these columns
|
||||
|
||||
• orders: Columns ['shipping_street', 'shipping_city', 'shipping_state', 'shipping_postal_code', 'shipping_country'] may have transitive dependency through 'shipping_address_id'
|
||||
Suggestion: Consider creating separate 'shipping_address' table with these columns
|
||||
|
||||
• user_preferences: Column 'preferred_categories' appears to store delimited values
|
||||
Suggestion: Create separate table for individual values with foreign key relationship
|
||||
|
||||
DATA TYPE ISSUES (4 total)
|
||||
--------------------------
|
||||
• products.dimensions: VARCHAR(255) antipattern
|
||||
Current: VARCHAR(50) → Suggested: Appropriately sized VARCHAR or TEXT
|
||||
Rationale: VARCHAR(255) is often used as default without considering actual data length requirements
|
||||
|
||||
• products.tags: VARCHAR(255) antipattern
|
||||
Current: VARCHAR(500) → Suggested: Appropriately sized VARCHAR or TEXT
|
||||
Rationale: VARCHAR(255) is often used as default without considering actual data length requirements
|
||||
|
||||
• user_preferences.preferred_categories: VARCHAR(255) antipattern
|
||||
Current: VARCHAR(500) → Suggested: Appropriately sized VARCHAR or TEXT
|
||||
Rationale: VARCHAR(255) is often used as default without considering actual data length requirements
|
||||
|
||||
• user_preferences.email_notifications: VARCHAR(255) antipattern
|
||||
Current: VARCHAR(255) → Suggested: Appropriately sized VARCHAR or TEXT
|
||||
Rationale: VARCHAR(255) is often used as default without considering actual data length requirements
|
||||
|
||||
CONSTRAINT ISSUES (12 total)
|
||||
-----------------------------
|
||||
High: 0, Medium: 4, Low: 8
|
||||
|
||||
• products: Column 'price' should validate positive values
|
||||
Suggestion: Add CHECK constraint: price > 0
|
||||
|
||||
• products: Column 'inventory_count' should validate positive values
|
||||
Suggestion: Add CHECK constraint: inventory_count > 0
|
||||
|
||||
• orders: Column 'total_amount' should validate positive values
|
||||
Suggestion: Add CHECK constraint: total_amount > 0
|
||||
|
||||
• order_items: Column 'quantity' should validate positive values
|
||||
Suggestion: Add CHECK constraint: quantity > 0
|
||||
|
||||
• order_items: Column 'unit_price' should validate positive values
|
||||
Suggestion: Add CHECK constraint: unit_price > 0
|
||||
|
||||
MISSING INDEXES (3 total)
|
||||
-------------------------
|
||||
• addresses.user_id (foreign_key)
|
||||
SQL: CREATE INDEX idx_addresses_user_id ON addresses (user_id);
|
||||
|
||||
• product_reviews.product_id (foreign_key)
|
||||
SQL: CREATE INDEX idx_product_reviews_product_id ON product_reviews (product_id);
|
||||
|
||||
• shopping_cart.user_id (foreign_key)
|
||||
SQL: CREATE INDEX idx_shopping_cart_user_id ON shopping_cart (user_id);
|
||||
|
||||
MERMAID ERD
|
||||
===========
|
||||
erDiagram
|
||||
USERS {
|
||||
INTEGER id "PK"
|
||||
VARCHAR(255) email "NOT NULL"
|
||||
VARCHAR(50) username "NOT NULL"
|
||||
VARCHAR(255) password_hash "NOT NULL"
|
||||
VARCHAR(100) first_name
|
||||
VARCHAR(100) last_name
|
||||
TIMESTAMP created_at
|
||||
TIMESTAMP updated_at
|
||||
VARCHAR(20) status
|
||||
}
|
||||
|
||||
CATEGORIES {
|
||||
INTEGER id "PK"
|
||||
VARCHAR(100) name "NOT NULL"
|
||||
VARCHAR(100) slug "NOT NULL UNIQUE"
|
||||
INTEGER parent_id "FK"
|
||||
TEXT description
|
||||
BOOLEAN is_active
|
||||
INTEGER sort_order
|
||||
TIMESTAMP created_at
|
||||
}
|
||||
|
||||
PRODUCTS {
|
||||
INTEGER id "PK"
|
||||
VARCHAR(255) name "NOT NULL"
|
||||
VARCHAR(50) sku "NOT NULL UNIQUE"
|
||||
TEXT description
|
||||
DECIMAL(10,2) price "NOT NULL"
|
||||
DECIMAL(10,2) cost
|
||||
DECIMAL(8,2) weight
|
||||
VARCHAR(50) dimensions
|
||||
INTEGER category_id "FK"
|
||||
VARCHAR(100) category_name
|
||||
VARCHAR(100) brand
|
||||
VARCHAR(500) tags
|
||||
INTEGER inventory_count
|
||||
INTEGER reorder_point
|
||||
VARCHAR(100) supplier_name
|
||||
VARCHAR(255) supplier_contact
|
||||
BOOLEAN is_active
|
||||
BOOLEAN featured
|
||||
TIMESTAMP created_at
|
||||
TIMESTAMP updated_at
|
||||
}
|
||||
|
||||
ADDRESSES {
|
||||
INTEGER id "PK"
|
||||
INTEGER user_id "FK"
|
||||
VARCHAR(20) address_type
|
||||
VARCHAR(255) street_address "NOT NULL"
|
||||
VARCHAR(255) street_address_2
|
||||
VARCHAR(100) city "NOT NULL"
|
||||
VARCHAR(50) state "NOT NULL"
|
||||
VARCHAR(20) postal_code "NOT NULL"
|
||||
VARCHAR(50) country "NOT NULL"
|
||||
BOOLEAN is_default
|
||||
TIMESTAMP created_at
|
||||
}
|
||||
|
||||
ORDERS {
|
||||
INTEGER id "PK"
|
||||
VARCHAR(50) order_number "NOT NULL UNIQUE"
|
||||
INTEGER user_id "FK"
|
||||
VARCHAR(255) user_email
|
||||
VARCHAR(200) user_name
|
||||
VARCHAR(50) status "NOT NULL"
|
||||
DECIMAL(10,2) total_amount "NOT NULL"
|
||||
DECIMAL(10,2) tax_amount "NOT NULL"
|
||||
DECIMAL(10,2) shipping_amount "NOT NULL"
|
||||
DECIMAL(10,2) discount_amount
|
||||
VARCHAR(50) payment_method
|
||||
VARCHAR(50) payment_status
|
||||
INTEGER shipping_address_id "FK"
|
||||
INTEGER billing_address_id "FK"
|
||||
VARCHAR(255) shipping_street
|
||||
VARCHAR(100) shipping_city
|
||||
VARCHAR(50) shipping_state
|
||||
VARCHAR(20) shipping_postal_code
|
||||
VARCHAR(50) shipping_country
|
||||
TEXT notes
|
||||
TIMESTAMP created_at
|
||||
TIMESTAMP updated_at
|
||||
TIMESTAMP shipped_at
|
||||
TIMESTAMP delivered_at
|
||||
}
|
||||
|
||||
ORDER_ITEMS {
|
||||
INTEGER id "PK"
|
||||
INTEGER order_id "FK"
|
||||
INTEGER product_id "FK"
|
||||
VARCHAR(255) product_name
|
||||
VARCHAR(50) product_sku
|
||||
INTEGER quantity "NOT NULL"
|
||||
DECIMAL(10,2) unit_price "NOT NULL"
|
||||
DECIMAL(10,2) total_price "NOT NULL"
|
||||
TIMESTAMP created_at
|
||||
}
|
||||
|
||||
SHOPPING_CART {
|
||||
INTEGER id "PK"
|
||||
INTEGER user_id "FK"
|
||||
VARCHAR(255) session_id
|
||||
INTEGER product_id "FK"
|
||||
INTEGER quantity "NOT NULL"
|
||||
TIMESTAMP added_at
|
||||
TIMESTAMP updated_at
|
||||
}
|
||||
|
||||
PRODUCT_REVIEWS {
|
||||
INTEGER id "PK"
|
||||
INTEGER product_id "FK"
|
||||
INTEGER user_id "FK"
|
||||
INTEGER rating "NOT NULL"
|
||||
VARCHAR(200) title
|
||||
TEXT review_text
|
||||
BOOLEAN verified_purchase
|
||||
INTEGER helpful_count
|
||||
TIMESTAMP created_at
|
||||
TIMESTAMP updated_at
|
||||
}
|
||||
|
||||
CATEGORIES ||--o{ CATEGORIES : has
|
||||
CATEGORIES ||--o{ PRODUCTS : has
|
||||
USERS ||--o{ ADDRESSES : has
|
||||
USERS ||--o{ ORDERS : has
|
||||
USERS ||--o{ SHOPPING_CART : has
|
||||
USERS ||--o{ PRODUCT_REVIEWS : has
|
||||
ADDRESSES ||--o{ ORDERS : has
|
||||
ORDERS ||--o{ ORDER_ITEMS : has
|
||||
PRODUCTS ||--o{ ORDER_ITEMS : has
|
||||
PRODUCTS ||--o{ SHOPPING_CART : has
|
||||
PRODUCTS ||--o{ PRODUCT_REVIEWS : has
|
||||
@@ -0,0 +1,926 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Database Index Optimizer
|
||||
|
||||
Analyzes schema definitions and query patterns to recommend optimal indexes:
|
||||
- Identifies missing indexes for common query patterns
|
||||
- Detects redundant and overlapping indexes
|
||||
- Suggests composite index column ordering
|
||||
- Estimates selectivity and performance impact
|
||||
- Generates CREATE INDEX statements with rationale
|
||||
|
||||
Input: Schema JSON + Query patterns JSON
|
||||
Output: Index recommendations + CREATE INDEX SQL + before/after analysis
|
||||
|
||||
Usage:
|
||||
python index_optimizer.py --schema schema.json --queries queries.json --output recommendations.json
|
||||
python index_optimizer.py --schema schema.json --queries queries.json --format text
|
||||
python index_optimizer.py --schema schema.json --queries queries.json --analyze-existing
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import defaultdict, namedtuple, Counter
|
||||
from typing import Dict, List, Set, Tuple, Optional, Any
|
||||
from dataclasses import dataclass, asdict
|
||||
import hashlib
|
||||
|
||||
|
||||
@dataclass
|
||||
class Column:
|
||||
name: str
|
||||
data_type: str
|
||||
nullable: bool = True
|
||||
unique: bool = False
|
||||
cardinality_estimate: Optional[int] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class Index:
|
||||
name: str
|
||||
table: str
|
||||
columns: List[str]
|
||||
unique: bool = False
|
||||
index_type: str = "btree"
|
||||
partial_condition: Optional[str] = None
|
||||
include_columns: List[str] = None
|
||||
size_estimate: Optional[int] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class QueryPattern:
|
||||
query_id: str
|
||||
query_type: str # SELECT, INSERT, UPDATE, DELETE
|
||||
table: str
|
||||
where_conditions: List[Dict[str, Any]]
|
||||
join_conditions: List[Dict[str, Any]]
|
||||
order_by: List[Dict[str, str]] # column, direction
|
||||
group_by: List[str]
|
||||
frequency: int = 1
|
||||
avg_execution_time_ms: Optional[float] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class IndexRecommendation:
|
||||
recommendation_id: str
|
||||
table: str
|
||||
recommended_index: Index
|
||||
reason: str
|
||||
query_patterns_helped: List[str]
|
||||
estimated_benefit: str
|
||||
estimated_overhead: str
|
||||
priority: int # 1 = highest priority
|
||||
sql_statement: str
|
||||
selectivity_analysis: Dict[str, Any]
|
||||
|
||||
|
||||
@dataclass
|
||||
class RedundancyIssue:
|
||||
issue_type: str # DUPLICATE, OVERLAPPING, UNUSED
|
||||
affected_indexes: List[str]
|
||||
table: str
|
||||
description: str
|
||||
recommendation: str
|
||||
sql_statements: List[str]
|
||||
|
||||
|
||||
class SelectivityEstimator:
|
||||
"""Estimates column selectivity based on naming patterns and data types."""
|
||||
|
||||
def __init__(self):
|
||||
# Selectivity patterns based on common column names and types
|
||||
self.high_selectivity_patterns = [
|
||||
r'.*_id$', r'^id$', r'uuid', r'guid', r'email', r'username', r'ssn',
|
||||
r'account.*number', r'transaction.*id', r'reference.*number'
|
||||
]
|
||||
|
||||
self.medium_selectivity_patterns = [
|
||||
r'name$', r'title$', r'description$', r'address', r'phone', r'zip',
|
||||
r'postal.*code', r'serial.*number', r'sku', r'product.*code'
|
||||
]
|
||||
|
||||
self.low_selectivity_patterns = [
|
||||
r'status$', r'type$', r'category', r'state$', r'flag$', r'active$',
|
||||
r'enabled$', r'deleted$', r'visible$', r'gender$', r'priority$'
|
||||
]
|
||||
|
||||
self.very_low_selectivity_patterns = [
|
||||
r'is_.*', r'has_.*', r'can_.*', r'boolean', r'bool'
|
||||
]
|
||||
|
||||
def estimate_selectivity(self, column: Column, table_size_estimate: int = 10000) -> float:
|
||||
"""Estimate column selectivity (0.0 = all same values, 1.0 = all unique values)."""
|
||||
column_name_lower = column.name.lower()
|
||||
|
||||
# Primary key or unique columns
|
||||
if column.unique or column.name.lower() in ['id', 'uuid', 'guid']:
|
||||
return 1.0
|
||||
|
||||
# Check cardinality estimate if available
|
||||
if column.cardinality_estimate:
|
||||
return min(column.cardinality_estimate / table_size_estimate, 1.0)
|
||||
|
||||
# Pattern-based estimation
|
||||
for pattern in self.high_selectivity_patterns:
|
||||
if re.search(pattern, column_name_lower):
|
||||
return 0.9 # Very high selectivity
|
||||
|
||||
for pattern in self.medium_selectivity_patterns:
|
||||
if re.search(pattern, column_name_lower):
|
||||
return 0.7 # Good selectivity
|
||||
|
||||
for pattern in self.low_selectivity_patterns:
|
||||
if re.search(pattern, column_name_lower):
|
||||
return 0.2 # Poor selectivity
|
||||
|
||||
for pattern in self.very_low_selectivity_patterns:
|
||||
if re.search(pattern, column_name_lower):
|
||||
return 0.1 # Very poor selectivity
|
||||
|
||||
# Data type based estimation
|
||||
data_type_upper = column.data_type.upper()
|
||||
if data_type_upper.startswith('BOOL'):
|
||||
return 0.1
|
||||
elif data_type_upper.startswith(('TINYINT', 'SMALLINT')):
|
||||
return 0.3
|
||||
elif data_type_upper.startswith('INT'):
|
||||
return 0.8
|
||||
elif data_type_upper.startswith(('VARCHAR', 'TEXT')):
|
||||
# Estimate based on column name
|
||||
if 'name' in column_name_lower:
|
||||
return 0.7
|
||||
elif 'description' in column_name_lower or 'comment' in column_name_lower:
|
||||
return 0.9
|
||||
else:
|
||||
return 0.6
|
||||
|
||||
# Default moderate selectivity
|
||||
return 0.5
|
||||
|
||||
|
||||
class IndexOptimizer:
|
||||
def __init__(self):
|
||||
self.tables: Dict[str, Dict[str, Column]] = {}
|
||||
self.existing_indexes: Dict[str, List[Index]] = {}
|
||||
self.query_patterns: List[QueryPattern] = []
|
||||
self.selectivity_estimator = SelectivityEstimator()
|
||||
|
||||
# Configuration
|
||||
self.max_composite_index_columns = 6
|
||||
self.min_selectivity_for_index = 0.1
|
||||
self.redundancy_overlap_threshold = 0.8
|
||||
|
||||
def load_schema(self, schema_data: Dict[str, Any]) -> None:
|
||||
"""Load schema definition."""
|
||||
if 'tables' not in schema_data:
|
||||
raise ValueError("Schema must contain 'tables' key")
|
||||
|
||||
for table_name, table_def in schema_data['tables'].items():
|
||||
self.tables[table_name] = {}
|
||||
self.existing_indexes[table_name] = []
|
||||
|
||||
# Load columns
|
||||
for col_name, col_def in table_def.get('columns', {}).items():
|
||||
column = Column(
|
||||
name=col_name,
|
||||
data_type=col_def.get('type', 'VARCHAR(255)'),
|
||||
nullable=col_def.get('nullable', True),
|
||||
unique=col_def.get('unique', False),
|
||||
cardinality_estimate=col_def.get('cardinality_estimate')
|
||||
)
|
||||
self.tables[table_name][col_name] = column
|
||||
|
||||
# Load existing indexes
|
||||
for idx_def in table_def.get('indexes', []):
|
||||
index = Index(
|
||||
name=idx_def['name'],
|
||||
table=table_name,
|
||||
columns=idx_def['columns'],
|
||||
unique=idx_def.get('unique', False),
|
||||
index_type=idx_def.get('type', 'btree'),
|
||||
partial_condition=idx_def.get('partial_condition'),
|
||||
include_columns=idx_def.get('include_columns', [])
|
||||
)
|
||||
self.existing_indexes[table_name].append(index)
|
||||
|
||||
def load_query_patterns(self, query_data: Dict[str, Any]) -> None:
|
||||
"""Load query patterns for analysis."""
|
||||
if 'queries' not in query_data:
|
||||
raise ValueError("Query data must contain 'queries' key")
|
||||
|
||||
for query_def in query_data['queries']:
|
||||
pattern = QueryPattern(
|
||||
query_id=query_def['id'],
|
||||
query_type=query_def.get('type', 'SELECT').upper(),
|
||||
table=query_def['table'],
|
||||
where_conditions=query_def.get('where_conditions', []),
|
||||
join_conditions=query_def.get('join_conditions', []),
|
||||
order_by=query_def.get('order_by', []),
|
||||
group_by=query_def.get('group_by', []),
|
||||
frequency=query_def.get('frequency', 1),
|
||||
avg_execution_time_ms=query_def.get('avg_execution_time_ms')
|
||||
)
|
||||
self.query_patterns.append(pattern)
|
||||
|
||||
def analyze_missing_indexes(self) -> List[IndexRecommendation]:
|
||||
"""Identify missing indexes based on query patterns."""
|
||||
recommendations = []
|
||||
|
||||
for pattern in self.query_patterns:
|
||||
table_name = pattern.table
|
||||
if table_name not in self.tables:
|
||||
continue
|
||||
|
||||
# Analyze WHERE conditions for single-column indexes
|
||||
for condition in pattern.where_conditions:
|
||||
column = condition.get('column')
|
||||
operator = condition.get('operator', '=')
|
||||
|
||||
if column and column in self.tables[table_name]:
|
||||
if not self._has_covering_index(table_name, [column]):
|
||||
recommendation = self._create_single_column_recommendation(
|
||||
table_name, column, pattern, operator
|
||||
)
|
||||
if recommendation:
|
||||
recommendations.append(recommendation)
|
||||
|
||||
# Analyze composite indexes for multi-column WHERE conditions
|
||||
where_columns = [cond.get('column') for cond in pattern.where_conditions
|
||||
if cond.get('column') and cond.get('column') in self.tables[table_name]]
|
||||
|
||||
if len(where_columns) > 1:
|
||||
composite_recommendation = self._create_composite_recommendation(
|
||||
table_name, where_columns, pattern
|
||||
)
|
||||
if composite_recommendation:
|
||||
recommendations.append(composite_recommendation)
|
||||
|
||||
# Analyze covering indexes for SELECT with ORDER BY
|
||||
if pattern.order_by and where_columns:
|
||||
covering_recommendation = self._create_covering_index_recommendation(
|
||||
table_name, where_columns, pattern
|
||||
)
|
||||
if covering_recommendation:
|
||||
recommendations.append(covering_recommendation)
|
||||
|
||||
# Analyze JOIN conditions
|
||||
for join_condition in pattern.join_conditions:
|
||||
local_column = join_condition.get('local_column')
|
||||
if local_column and local_column in self.tables[table_name]:
|
||||
if not self._has_covering_index(table_name, [local_column]):
|
||||
recommendation = self._create_join_index_recommendation(
|
||||
table_name, local_column, pattern, join_condition
|
||||
)
|
||||
if recommendation:
|
||||
recommendations.append(recommendation)
|
||||
|
||||
# Remove duplicates and prioritize
|
||||
recommendations = self._deduplicate_recommendations(recommendations)
|
||||
recommendations = self._prioritize_recommendations(recommendations)
|
||||
|
||||
return recommendations
|
||||
|
||||
def _has_covering_index(self, table_name: str, columns: List[str]) -> bool:
|
||||
"""Check if existing indexes cover the specified columns."""
|
||||
if table_name not in self.existing_indexes:
|
||||
return False
|
||||
|
||||
for index in self.existing_indexes[table_name]:
|
||||
# Check if index starts with required columns (prefix match for composite)
|
||||
if len(index.columns) >= len(columns):
|
||||
if index.columns[:len(columns)] == columns:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def _create_single_column_recommendation(
|
||||
self,
|
||||
table_name: str,
|
||||
column: str,
|
||||
pattern: QueryPattern,
|
||||
operator: str
|
||||
) -> Optional[IndexRecommendation]:
|
||||
"""Create recommendation for single-column index."""
|
||||
column_obj = self.tables[table_name][column]
|
||||
selectivity = self.selectivity_estimator.estimate_selectivity(column_obj)
|
||||
|
||||
# Skip very low selectivity columns unless frequently used
|
||||
if selectivity < self.min_selectivity_for_index and pattern.frequency < 100:
|
||||
return None
|
||||
|
||||
index_name = f"idx_{table_name}_{column}"
|
||||
index = Index(
|
||||
name=index_name,
|
||||
table=table_name,
|
||||
columns=[column],
|
||||
unique=column_obj.unique,
|
||||
index_type="btree"
|
||||
)
|
||||
|
||||
reason = f"Optimize WHERE {column} {operator} queries"
|
||||
if pattern.frequency > 10:
|
||||
reason += f" (used {pattern.frequency} times)"
|
||||
|
||||
return IndexRecommendation(
|
||||
recommendation_id=self._generate_recommendation_id(table_name, [column]),
|
||||
table=table_name,
|
||||
recommended_index=index,
|
||||
reason=reason,
|
||||
query_patterns_helped=[pattern.query_id],
|
||||
estimated_benefit=self._estimate_benefit(selectivity, pattern.frequency),
|
||||
estimated_overhead="Low (single column)",
|
||||
priority=self._calculate_priority(selectivity, pattern.frequency, 1),
|
||||
sql_statement=f"CREATE INDEX {index_name} ON {table_name} ({column});",
|
||||
selectivity_analysis={
|
||||
"column_selectivity": selectivity,
|
||||
"estimated_reduction": f"{int(selectivity * 100)}%"
|
||||
}
|
||||
)
|
||||
|
||||
def _create_composite_recommendation(
|
||||
self,
|
||||
table_name: str,
|
||||
columns: List[str],
|
||||
pattern: QueryPattern
|
||||
) -> Optional[IndexRecommendation]:
|
||||
"""Create recommendation for composite index."""
|
||||
if len(columns) > self.max_composite_index_columns:
|
||||
columns = columns[:self.max_composite_index_columns]
|
||||
|
||||
# Order columns by selectivity (most selective first)
|
||||
column_selectivities = []
|
||||
for col in columns:
|
||||
col_obj = self.tables[table_name][col]
|
||||
selectivity = self.selectivity_estimator.estimate_selectivity(col_obj)
|
||||
column_selectivities.append((col, selectivity))
|
||||
|
||||
# Sort by selectivity descending
|
||||
column_selectivities.sort(key=lambda x: x[1], reverse=True)
|
||||
ordered_columns = [col for col, _ in column_selectivities]
|
||||
|
||||
# Calculate combined selectivity
|
||||
combined_selectivity = min(sum(sel for _, sel in column_selectivities) / len(columns), 0.95)
|
||||
|
||||
index_name = f"idx_{table_name}_{'_'.join(ordered_columns)}"
|
||||
if len(index_name) > 63: # PostgreSQL limit
|
||||
index_name = f"idx_{table_name}_composite_{abs(hash('_'.join(ordered_columns))) % 10000}"
|
||||
|
||||
index = Index(
|
||||
name=index_name,
|
||||
table=table_name,
|
||||
columns=ordered_columns,
|
||||
index_type="btree"
|
||||
)
|
||||
|
||||
reason = f"Optimize multi-column WHERE conditions: {', '.join(ordered_columns)}"
|
||||
|
||||
return IndexRecommendation(
|
||||
recommendation_id=self._generate_recommendation_id(table_name, ordered_columns),
|
||||
table=table_name,
|
||||
recommended_index=index,
|
||||
reason=reason,
|
||||
query_patterns_helped=[pattern.query_id],
|
||||
estimated_benefit=self._estimate_benefit(combined_selectivity, pattern.frequency),
|
||||
estimated_overhead=f"Medium (composite index with {len(ordered_columns)} columns)",
|
||||
priority=self._calculate_priority(combined_selectivity, pattern.frequency, len(ordered_columns)),
|
||||
sql_statement=f"CREATE INDEX {index_name} ON {table_name} ({', '.join(ordered_columns)});",
|
||||
selectivity_analysis={
|
||||
"column_selectivities": {col: sel for col, sel in column_selectivities},
|
||||
"combined_selectivity": combined_selectivity,
|
||||
"column_order_rationale": "Ordered by selectivity (most selective first)"
|
||||
}
|
||||
)
|
||||
|
||||
def _create_covering_index_recommendation(
|
||||
self,
|
||||
table_name: str,
|
||||
where_columns: List[str],
|
||||
pattern: QueryPattern
|
||||
) -> Optional[IndexRecommendation]:
|
||||
"""Create recommendation for covering index."""
|
||||
order_columns = [col['column'] for col in pattern.order_by if col['column'] in self.tables[table_name]]
|
||||
|
||||
# Combine WHERE and ORDER BY columns
|
||||
index_columns = where_columns.copy()
|
||||
include_columns = []
|
||||
|
||||
# Add ORDER BY columns to index columns
|
||||
for col in order_columns:
|
||||
if col not in index_columns:
|
||||
index_columns.append(col)
|
||||
|
||||
# Limit index columns
|
||||
if len(index_columns) > self.max_composite_index_columns:
|
||||
include_columns = index_columns[self.max_composite_index_columns:]
|
||||
index_columns = index_columns[:self.max_composite_index_columns]
|
||||
|
||||
index_name = f"idx_{table_name}_covering_{'_'.join(index_columns[:3])}"
|
||||
if len(index_name) > 63:
|
||||
index_name = f"idx_{table_name}_covering_{abs(hash('_'.join(index_columns))) % 10000}"
|
||||
|
||||
index = Index(
|
||||
name=index_name,
|
||||
table=table_name,
|
||||
columns=index_columns,
|
||||
include_columns=include_columns,
|
||||
index_type="btree"
|
||||
)
|
||||
|
||||
reason = f"Covering index for WHERE + ORDER BY optimization"
|
||||
|
||||
# Calculate selectivity for main columns
|
||||
main_selectivity = 0.5 # Default for covering indexes
|
||||
if where_columns:
|
||||
selectivities = [
|
||||
self.selectivity_estimator.estimate_selectivity(self.tables[table_name][col])
|
||||
for col in where_columns[:2] # Consider first 2 columns
|
||||
]
|
||||
main_selectivity = max(selectivities)
|
||||
|
||||
sql_parts = [f"CREATE INDEX {index_name} ON {table_name} ({', '.join(index_columns)})"]
|
||||
if include_columns:
|
||||
sql_parts.append(f" INCLUDE ({', '.join(include_columns)})")
|
||||
sql_statement = ''.join(sql_parts) + ";"
|
||||
|
||||
return IndexRecommendation(
|
||||
recommendation_id=self._generate_recommendation_id(table_name, index_columns, "covering"),
|
||||
table=table_name,
|
||||
recommended_index=index,
|
||||
reason=reason,
|
||||
query_patterns_helped=[pattern.query_id],
|
||||
estimated_benefit="High (eliminates table lookups for SELECT)",
|
||||
estimated_overhead=f"High (covering index with {len(index_columns)} columns)",
|
||||
priority=self._calculate_priority(main_selectivity, pattern.frequency, len(index_columns)),
|
||||
sql_statement=sql_statement,
|
||||
selectivity_analysis={
|
||||
"main_columns_selectivity": main_selectivity,
|
||||
"covering_benefit": "Eliminates table lookup for SELECT queries"
|
||||
}
|
||||
)
|
||||
|
||||
def _create_join_index_recommendation(
|
||||
self,
|
||||
table_name: str,
|
||||
column: str,
|
||||
pattern: QueryPattern,
|
||||
join_condition: Dict[str, Any]
|
||||
) -> Optional[IndexRecommendation]:
|
||||
"""Create recommendation for JOIN optimization index."""
|
||||
column_obj = self.tables[table_name][column]
|
||||
selectivity = self.selectivity_estimator.estimate_selectivity(column_obj)
|
||||
|
||||
index_name = f"idx_{table_name}_{column}_join"
|
||||
index = Index(
|
||||
name=index_name,
|
||||
table=table_name,
|
||||
columns=[column],
|
||||
index_type="btree"
|
||||
)
|
||||
|
||||
foreign_table = join_condition.get('foreign_table', 'unknown')
|
||||
reason = f"Optimize JOIN with {foreign_table} table on {column}"
|
||||
|
||||
return IndexRecommendation(
|
||||
recommendation_id=self._generate_recommendation_id(table_name, [column], "join"),
|
||||
table=table_name,
|
||||
recommended_index=index,
|
||||
reason=reason,
|
||||
query_patterns_helped=[pattern.query_id],
|
||||
estimated_benefit=self._estimate_join_benefit(pattern.frequency),
|
||||
estimated_overhead="Low (single column for JOIN)",
|
||||
priority=2, # JOINs are generally high priority
|
||||
sql_statement=f"CREATE INDEX {index_name} ON {table_name} ({column});",
|
||||
selectivity_analysis={
|
||||
"column_selectivity": selectivity,
|
||||
"join_optimization": True
|
||||
}
|
||||
)
|
||||
|
||||
def _generate_recommendation_id(self, table: str, columns: List[str], suffix: str = "") -> str:
|
||||
"""Generate unique recommendation ID."""
|
||||
content = f"{table}_{'_'.join(sorted(columns))}_{suffix}"
|
||||
return hashlib.md5(content.encode()).hexdigest()[:8]
|
||||
|
||||
def _estimate_benefit(self, selectivity: float, frequency: int) -> str:
|
||||
"""Estimate performance benefit of index."""
|
||||
if selectivity > 0.8 and frequency > 50:
|
||||
return "Very High"
|
||||
elif selectivity > 0.6 and frequency > 20:
|
||||
return "High"
|
||||
elif selectivity > 0.4 or frequency > 10:
|
||||
return "Medium"
|
||||
else:
|
||||
return "Low"
|
||||
|
||||
def _estimate_join_benefit(self, frequency: int) -> str:
|
||||
"""Estimate benefit for JOIN indexes."""
|
||||
if frequency > 50:
|
||||
return "Very High (frequent JOINs)"
|
||||
elif frequency > 20:
|
||||
return "High (regular JOINs)"
|
||||
elif frequency > 5:
|
||||
return "Medium (occasional JOINs)"
|
||||
else:
|
||||
return "Low (rare JOINs)"
|
||||
|
||||
def _calculate_priority(self, selectivity: float, frequency: int, column_count: int) -> int:
|
||||
"""Calculate priority score (1 = highest priority)."""
|
||||
# Base score calculation
|
||||
score = 0
|
||||
|
||||
# Selectivity contribution (0-50 points)
|
||||
score += int(selectivity * 50)
|
||||
|
||||
# Frequency contribution (0-30 points)
|
||||
score += min(frequency, 30)
|
||||
|
||||
# Penalty for complex indexes (subtract points)
|
||||
score -= (column_count - 1) * 5
|
||||
|
||||
# Convert to priority levels
|
||||
if score >= 70:
|
||||
return 1 # Highest
|
||||
elif score >= 50:
|
||||
return 2 # High
|
||||
elif score >= 30:
|
||||
return 3 # Medium
|
||||
else:
|
||||
return 4 # Low
|
||||
|
||||
def _deduplicate_recommendations(self, recommendations: List[IndexRecommendation]) -> List[IndexRecommendation]:
|
||||
"""Remove duplicate recommendations."""
|
||||
seen_indexes = set()
|
||||
unique_recommendations = []
|
||||
|
||||
for rec in recommendations:
|
||||
index_signature = (rec.table, tuple(rec.recommended_index.columns))
|
||||
if index_signature not in seen_indexes:
|
||||
seen_indexes.add(index_signature)
|
||||
unique_recommendations.append(rec)
|
||||
else:
|
||||
# Merge query patterns helped
|
||||
for existing_rec in unique_recommendations:
|
||||
if (existing_rec.table == rec.table and
|
||||
existing_rec.recommended_index.columns == rec.recommended_index.columns):
|
||||
existing_rec.query_patterns_helped.extend(rec.query_patterns_helped)
|
||||
break
|
||||
|
||||
return unique_recommendations
|
||||
|
||||
def _prioritize_recommendations(self, recommendations: List[IndexRecommendation]) -> List[IndexRecommendation]:
|
||||
"""Sort recommendations by priority."""
|
||||
return sorted(recommendations, key=lambda x: (x.priority, -len(x.query_patterns_helped)))
|
||||
|
||||
def analyze_redundant_indexes(self) -> List[RedundancyIssue]:
|
||||
"""Identify redundant, overlapping, and potentially unused indexes."""
|
||||
redundancy_issues = []
|
||||
|
||||
for table_name, indexes in self.existing_indexes.items():
|
||||
if len(indexes) < 2:
|
||||
continue
|
||||
|
||||
# Find duplicate indexes
|
||||
duplicates = self._find_duplicate_indexes(table_name, indexes)
|
||||
redundancy_issues.extend(duplicates)
|
||||
|
||||
# Find overlapping indexes
|
||||
overlapping = self._find_overlapping_indexes(table_name, indexes)
|
||||
redundancy_issues.extend(overlapping)
|
||||
|
||||
# Find potentially unused indexes
|
||||
unused = self._find_unused_indexes(table_name, indexes)
|
||||
redundancy_issues.extend(unused)
|
||||
|
||||
return redundancy_issues
|
||||
|
||||
def _find_duplicate_indexes(self, table_name: str, indexes: List[Index]) -> List[RedundancyIssue]:
|
||||
"""Find exactly duplicate indexes."""
|
||||
issues = []
|
||||
seen_signatures = {}
|
||||
|
||||
for index in indexes:
|
||||
signature = (tuple(index.columns), index.unique, index.partial_condition)
|
||||
if signature in seen_signatures:
|
||||
existing_index = seen_signatures[signature]
|
||||
issues.append(RedundancyIssue(
|
||||
issue_type="DUPLICATE",
|
||||
affected_indexes=[existing_index.name, index.name],
|
||||
table=table_name,
|
||||
description=f"Indexes '{existing_index.name}' and '{index.name}' are identical",
|
||||
recommendation=f"Drop one of the duplicate indexes",
|
||||
sql_statements=[f"DROP INDEX {index.name};"]
|
||||
))
|
||||
else:
|
||||
seen_signatures[signature] = index
|
||||
|
||||
return issues
|
||||
|
||||
def _find_overlapping_indexes(self, table_name: str, indexes: List[Index]) -> List[RedundancyIssue]:
|
||||
"""Find overlapping indexes that might be redundant."""
|
||||
issues = []
|
||||
|
||||
for i, index1 in enumerate(indexes):
|
||||
for index2 in indexes[i+1:]:
|
||||
overlap_ratio = self._calculate_overlap_ratio(index1, index2)
|
||||
|
||||
if overlap_ratio >= self.redundancy_overlap_threshold:
|
||||
# Determine which index to keep
|
||||
if len(index1.columns) <= len(index2.columns):
|
||||
redundant_index = index1
|
||||
keep_index = index2
|
||||
else:
|
||||
redundant_index = index2
|
||||
keep_index = index1
|
||||
|
||||
issues.append(RedundancyIssue(
|
||||
issue_type="OVERLAPPING",
|
||||
affected_indexes=[index1.name, index2.name],
|
||||
table=table_name,
|
||||
description=f"Index '{redundant_index.name}' overlaps {int(overlap_ratio * 100)}% "
|
||||
f"with '{keep_index.name}'",
|
||||
recommendation=f"Consider dropping '{redundant_index.name}' as it's largely "
|
||||
f"covered by '{keep_index.name}'",
|
||||
sql_statements=[f"DROP INDEX {redundant_index.name};"]
|
||||
))
|
||||
|
||||
return issues
|
||||
|
||||
def _calculate_overlap_ratio(self, index1: Index, index2: Index) -> float:
|
||||
"""Calculate overlap ratio between two indexes."""
|
||||
cols1 = set(index1.columns)
|
||||
cols2 = set(index2.columns)
|
||||
|
||||
if not cols1 or not cols2:
|
||||
return 0.0
|
||||
|
||||
intersection = len(cols1.intersection(cols2))
|
||||
union = len(cols1.union(cols2))
|
||||
|
||||
return intersection / union if union > 0 else 0.0
|
||||
|
||||
def _find_unused_indexes(self, table_name: str, indexes: List[Index]) -> List[RedundancyIssue]:
|
||||
"""Find potentially unused indexes based on query patterns."""
|
||||
issues = []
|
||||
|
||||
# Collect all columns used in query patterns for this table
|
||||
used_columns = set()
|
||||
table_patterns = [p for p in self.query_patterns if p.table == table_name]
|
||||
|
||||
for pattern in table_patterns:
|
||||
# Add WHERE condition columns
|
||||
for condition in pattern.where_conditions:
|
||||
if condition.get('column'):
|
||||
used_columns.add(condition['column'])
|
||||
|
||||
# Add JOIN columns
|
||||
for join in pattern.join_conditions:
|
||||
if join.get('local_column'):
|
||||
used_columns.add(join['local_column'])
|
||||
|
||||
# Add ORDER BY columns
|
||||
for order in pattern.order_by:
|
||||
if order.get('column'):
|
||||
used_columns.add(order['column'])
|
||||
|
||||
# Add GROUP BY columns
|
||||
used_columns.update(pattern.group_by)
|
||||
|
||||
if not used_columns:
|
||||
return issues # Can't determine usage without query patterns
|
||||
|
||||
for index in indexes:
|
||||
index_columns = set(index.columns)
|
||||
if not index_columns.intersection(used_columns):
|
||||
issues.append(RedundancyIssue(
|
||||
issue_type="UNUSED",
|
||||
affected_indexes=[index.name],
|
||||
table=table_name,
|
||||
description=f"Index '{index.name}' columns {index.columns} are not used in any query patterns",
|
||||
recommendation="Consider dropping this index if it's truly unused (verify with query logs)",
|
||||
sql_statements=[f"-- Review usage before dropping\n-- DROP INDEX {index.name};"]
|
||||
))
|
||||
|
||||
return issues
|
||||
|
||||
def estimate_index_sizes(self) -> Dict[str, Dict[str, Any]]:
|
||||
"""Estimate storage requirements for recommended indexes."""
|
||||
size_estimates = {}
|
||||
|
||||
# This is a simplified estimation - in practice, would need actual table statistics
|
||||
for table_name in self.tables:
|
||||
size_estimates[table_name] = {
|
||||
"estimated_table_rows": 10000, # Default estimate
|
||||
"existing_indexes_size_mb": len(self.existing_indexes.get(table_name, [])) * 5, # Rough estimate
|
||||
"index_overhead_per_column_mb": 2 # Rough estimate per column
|
||||
}
|
||||
|
||||
return size_estimates
|
||||
|
||||
def generate_analysis_report(self) -> Dict[str, Any]:
|
||||
"""Generate comprehensive analysis report."""
|
||||
recommendations = self.analyze_missing_indexes()
|
||||
redundancy_issues = self.analyze_redundant_indexes()
|
||||
size_estimates = self.estimate_index_sizes()
|
||||
|
||||
# Calculate statistics
|
||||
total_existing_indexes = sum(len(indexes) for indexes in self.existing_indexes.values())
|
||||
tables_analyzed = len(self.tables)
|
||||
query_patterns_analyzed = len(self.query_patterns)
|
||||
|
||||
# Categorize recommendations by priority
|
||||
high_priority = [r for r in recommendations if r.priority <= 2]
|
||||
medium_priority = [r for r in recommendations if r.priority == 3]
|
||||
low_priority = [r for r in recommendations if r.priority >= 4]
|
||||
|
||||
return {
|
||||
"analysis_summary": {
|
||||
"tables_analyzed": tables_analyzed,
|
||||
"query_patterns_analyzed": query_patterns_analyzed,
|
||||
"existing_indexes": total_existing_indexes,
|
||||
"total_recommendations": len(recommendations),
|
||||
"high_priority_recommendations": len(high_priority),
|
||||
"redundancy_issues_found": len(redundancy_issues)
|
||||
},
|
||||
"index_recommendations": {
|
||||
"high_priority": [asdict(r) for r in high_priority],
|
||||
"medium_priority": [asdict(r) for r in medium_priority],
|
||||
"low_priority": [asdict(r) for r in low_priority]
|
||||
},
|
||||
"redundancy_analysis": [asdict(issue) for issue in redundancy_issues],
|
||||
"size_estimates": size_estimates,
|
||||
"sql_statements": {
|
||||
"create_indexes": [rec.sql_statement for rec in recommendations],
|
||||
"drop_redundant": [
|
||||
stmt for issue in redundancy_issues
|
||||
for stmt in issue.sql_statements
|
||||
]
|
||||
},
|
||||
"performance_impact": self._generate_performance_impact_analysis(recommendations)
|
||||
}
|
||||
|
||||
def _generate_performance_impact_analysis(self, recommendations: List[IndexRecommendation]) -> Dict[str, Any]:
|
||||
"""Generate performance impact analysis."""
|
||||
impact_analysis = {
|
||||
"query_optimization": {},
|
||||
"write_overhead": {},
|
||||
"storage_impact": {}
|
||||
}
|
||||
|
||||
# Analyze query optimization impact
|
||||
query_benefits = defaultdict(list)
|
||||
for rec in recommendations:
|
||||
for query_id in rec.query_patterns_helped:
|
||||
query_benefits[query_id].append(rec.estimated_benefit)
|
||||
|
||||
impact_analysis["query_optimization"] = {
|
||||
"queries_improved": len(query_benefits),
|
||||
"high_impact_queries": len([q for q, benefits in query_benefits.items()
|
||||
if any("High" in benefit for benefit in benefits)]),
|
||||
"benefit_distribution": dict(Counter(
|
||||
rec.estimated_benefit for rec in recommendations
|
||||
))
|
||||
}
|
||||
|
||||
# Analyze write overhead
|
||||
impact_analysis["write_overhead"] = {
|
||||
"total_new_indexes": len(recommendations),
|
||||
"estimated_insert_overhead": f"{len(recommendations) * 5}%", # Rough estimate
|
||||
"tables_most_affected": list(Counter(rec.table for rec in recommendations).most_common(3))
|
||||
}
|
||||
|
||||
return impact_analysis
|
||||
|
||||
def format_text_report(self, analysis: Dict[str, Any]) -> str:
|
||||
"""Format analysis as human-readable text report."""
|
||||
lines = []
|
||||
lines.append("DATABASE INDEX OPTIMIZATION REPORT")
|
||||
lines.append("=" * 50)
|
||||
lines.append("")
|
||||
|
||||
# Summary
|
||||
summary = analysis["analysis_summary"]
|
||||
lines.append("ANALYSIS SUMMARY")
|
||||
lines.append("-" * 16)
|
||||
lines.append(f"Tables Analyzed: {summary['tables_analyzed']}")
|
||||
lines.append(f"Query Patterns: {summary['query_patterns_analyzed']}")
|
||||
lines.append(f"Existing Indexes: {summary['existing_indexes']}")
|
||||
lines.append(f"New Recommendations: {summary['total_recommendations']}")
|
||||
lines.append(f"High Priority: {summary['high_priority_recommendations']}")
|
||||
lines.append(f"Redundancy Issues: {summary['redundancy_issues_found']}")
|
||||
lines.append("")
|
||||
|
||||
# High Priority Recommendations
|
||||
high_priority = analysis["index_recommendations"]["high_priority"]
|
||||
if high_priority:
|
||||
lines.append(f"HIGH PRIORITY RECOMMENDATIONS ({len(high_priority)})")
|
||||
lines.append("-" * 35)
|
||||
for i, rec in enumerate(high_priority[:10], 1): # Show top 10
|
||||
lines.append(f"{i}. {rec['table']}: {rec['reason']}")
|
||||
lines.append(f" Columns: {', '.join(rec['recommended_index']['columns'])}")
|
||||
lines.append(f" Benefit: {rec['estimated_benefit']}")
|
||||
lines.append(f" SQL: {rec['sql_statement']}")
|
||||
lines.append("")
|
||||
|
||||
# Redundancy Issues
|
||||
redundancy = analysis["redundancy_analysis"]
|
||||
if redundancy:
|
||||
lines.append(f"REDUNDANCY ISSUES ({len(redundancy)})")
|
||||
lines.append("-" * 20)
|
||||
for issue in redundancy[:5]: # Show first 5
|
||||
lines.append(f"• {issue['issue_type']}: {issue['description']}")
|
||||
lines.append(f" Recommendation: {issue['recommendation']}")
|
||||
if issue['sql_statements']:
|
||||
lines.append(f" SQL: {issue['sql_statements'][0]}")
|
||||
lines.append("")
|
||||
|
||||
# Performance Impact
|
||||
perf_impact = analysis["performance_impact"]
|
||||
lines.append("PERFORMANCE IMPACT ANALYSIS")
|
||||
lines.append("-" * 30)
|
||||
query_opt = perf_impact["query_optimization"]
|
||||
lines.append(f"Queries to be optimized: {query_opt['queries_improved']}")
|
||||
lines.append(f"High impact optimizations: {query_opt['high_impact_queries']}")
|
||||
|
||||
write_overhead = perf_impact["write_overhead"]
|
||||
lines.append(f"Estimated insert overhead: {write_overhead['estimated_insert_overhead']}")
|
||||
lines.append("")
|
||||
|
||||
# SQL Statements Summary
|
||||
sql_statements = analysis["sql_statements"]
|
||||
create_statements = sql_statements["create_indexes"]
|
||||
if create_statements:
|
||||
lines.append("RECOMMENDED CREATE INDEX STATEMENTS")
|
||||
lines.append("-" * 36)
|
||||
for i, stmt in enumerate(create_statements[:10], 1):
|
||||
lines.append(f"{i}. {stmt}")
|
||||
|
||||
if len(create_statements) > 10:
|
||||
lines.append(f"... and {len(create_statements) - 10} more")
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Optimize database indexes based on schema and query patterns")
|
||||
parser.add_argument("--schema", "-s", required=True, help="Schema definition JSON file")
|
||||
parser.add_argument("--queries", "-q", required=True, help="Query patterns JSON file")
|
||||
parser.add_argument("--output", "-o", help="Output file (default: stdout)")
|
||||
parser.add_argument("--format", "-f", choices=["json", "text"], default="text",
|
||||
help="Output format")
|
||||
parser.add_argument("--analyze-existing", "-e", action="store_true",
|
||||
help="Include analysis of existing indexes")
|
||||
parser.add_argument("--min-priority", "-p", type=int, default=4,
|
||||
help="Minimum priority level to include (1=highest, 4=lowest)")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
# Load schema
|
||||
with open(args.schema, 'r') as f:
|
||||
schema_data = json.load(f)
|
||||
|
||||
# Load queries
|
||||
with open(args.queries, 'r') as f:
|
||||
query_data = json.load(f)
|
||||
|
||||
# Initialize optimizer
|
||||
optimizer = IndexOptimizer()
|
||||
optimizer.load_schema(schema_data)
|
||||
optimizer.load_query_patterns(query_data)
|
||||
|
||||
# Generate analysis
|
||||
analysis = optimizer.generate_analysis_report()
|
||||
|
||||
# Filter by priority if specified
|
||||
if args.min_priority < 4:
|
||||
for priority_level in ["high_priority", "medium_priority", "low_priority"]:
|
||||
analysis["index_recommendations"][priority_level] = [
|
||||
rec for rec in analysis["index_recommendations"][priority_level]
|
||||
if rec["priority"] <= args.min_priority
|
||||
]
|
||||
|
||||
# Format output
|
||||
if args.format == "json":
|
||||
output = json.dumps(analysis, indent=2)
|
||||
else:
|
||||
output = optimizer.format_text_report(analysis)
|
||||
|
||||
# Write output
|
||||
if args.output:
|
||||
with open(args.output, 'w') as f:
|
||||
f.write(output)
|
||||
else:
|
||||
print(output)
|
||||
|
||||
return 0
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,476 @@
|
||||
# database-designer reference
|
||||
|
||||
## Database Design Principles
|
||||
|
||||
### Normalization Forms
|
||||
|
||||
#### First Normal Form (1NF)
|
||||
- **Atomic Values**: Each column contains indivisible values
|
||||
- **Unique Column Names**: No duplicate column names within a table
|
||||
- **Uniform Data Types**: Each column contains the same type of data
|
||||
- **Row Uniqueness**: No duplicate rows in the table
|
||||
|
||||
**Example Violation:**
|
||||
```sql
|
||||
-- BAD: Multiple phone numbers in one column
|
||||
CREATE TABLE contacts (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
phones VARCHAR(200) -- "123-456-7890, 098-765-4321"
|
||||
);
|
||||
|
||||
-- GOOD: Separate table for phone numbers
|
||||
CREATE TABLE contacts (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE TABLE contact_phones (
|
||||
id INT PRIMARY KEY,
|
||||
contact_id INT REFERENCES contacts(id),
|
||||
phone_number VARCHAR(20),
|
||||
phone_type VARCHAR(10)
|
||||
);
|
||||
```
|
||||
|
||||
#### Second Normal Form (2NF)
|
||||
- **1NF Compliance**: Must satisfy First Normal Form
|
||||
- **Full Functional Dependency**: Non-key attributes depend on the entire primary key
|
||||
- **Partial Dependency Elimination**: Remove attributes that depend on part of a composite key
|
||||
|
||||
**Example Violation:**
|
||||
```sql
|
||||
-- BAD: Student course table with partial dependencies
|
||||
CREATE TABLE student_courses (
|
||||
student_id INT,
|
||||
course_id INT,
|
||||
student_name VARCHAR(100), -- Depends only on student_id
|
||||
course_name VARCHAR(100), -- Depends only on course_id
|
||||
grade CHAR(1),
|
||||
PRIMARY KEY (student_id, course_id)
|
||||
);
|
||||
|
||||
-- GOOD: Separate tables eliminate partial dependencies
|
||||
CREATE TABLE students (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE TABLE courses (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE TABLE enrollments (
|
||||
student_id INT REFERENCES students(id),
|
||||
course_id INT REFERENCES courses(id),
|
||||
grade CHAR(1),
|
||||
PRIMARY KEY (student_id, course_id)
|
||||
);
|
||||
```
|
||||
|
||||
#### Third Normal Form (3NF)
|
||||
- **2NF Compliance**: Must satisfy Second Normal Form
|
||||
- **Transitive Dependency Elimination**: Non-key attributes should not depend on other non-key attributes
|
||||
- **Direct Dependency**: Non-key attributes depend directly on the primary key
|
||||
|
||||
**Example Violation:**
|
||||
```sql
|
||||
-- BAD: Employee table with transitive dependency
|
||||
CREATE TABLE employees (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
department_id INT,
|
||||
department_name VARCHAR(100), -- Depends on department_id, not employee id
|
||||
department_budget DECIMAL(10,2) -- Transitive dependency
|
||||
);
|
||||
|
||||
-- GOOD: Separate department information
|
||||
CREATE TABLE departments (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
budget DECIMAL(10,2)
|
||||
);
|
||||
|
||||
CREATE TABLE employees (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
department_id INT REFERENCES departments(id)
|
||||
);
|
||||
```
|
||||
|
||||
#### Boyce-Codd Normal Form (BCNF)
|
||||
- **3NF Compliance**: Must satisfy Third Normal Form
|
||||
- **Determinant Key Rule**: Every determinant must be a candidate key
|
||||
- **Stricter 3NF**: Handles anomalies not covered by 3NF
|
||||
|
||||
### Denormalization Strategies
|
||||
|
||||
#### When to Denormalize
|
||||
1. **Read-Heavy Workloads**: High query frequency with acceptable write trade-offs
|
||||
2. **Performance Bottlenecks**: Join operations causing significant latency
|
||||
3. **Aggregation Needs**: Frequent calculation of derived values
|
||||
4. **Caching Requirements**: Pre-computed results for common queries
|
||||
|
||||
#### Common Denormalization Patterns
|
||||
|
||||
**Redundant Storage**
|
||||
```sql
|
||||
-- Store calculated values to avoid expensive joins
|
||||
CREATE TABLE orders (
|
||||
id INT PRIMARY KEY,
|
||||
customer_id INT REFERENCES customers(id),
|
||||
customer_name VARCHAR(100), -- Denormalized from customers table
|
||||
order_total DECIMAL(10,2), -- Denormalized calculation
|
||||
created_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
**Materialized Aggregates**
|
||||
```sql
|
||||
-- Pre-computed summary tables
|
||||
CREATE TABLE customer_statistics (
|
||||
customer_id INT PRIMARY KEY,
|
||||
total_orders INT,
|
||||
lifetime_value DECIMAL(12,2),
|
||||
last_order_date DATE,
|
||||
updated_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
## Index Optimization Strategies
|
||||
|
||||
### B-Tree Indexes
|
||||
- **Default Choice**: Best for range queries, sorting, and equality matches
|
||||
- **Column Order**: Most selective columns first for composite indexes
|
||||
- **Prefix Matching**: Supports leading column subset queries
|
||||
- **Maintenance Cost**: Balanced tree structure with logarithmic operations
|
||||
|
||||
### Hash Indexes
|
||||
- **Equality Queries**: Optimal for exact match lookups
|
||||
- **Memory Efficiency**: Constant-time access for single-value queries
|
||||
- **Range Limitations**: Cannot support range or partial matches
|
||||
- **Use Cases**: Primary keys, unique constraints, cache keys
|
||||
|
||||
### Composite Indexes
|
||||
```sql
|
||||
-- Query pattern determines optimal column order
|
||||
-- Query: WHERE status = 'active' AND created_date > '2023-01-01' ORDER BY priority DESC
|
||||
CREATE INDEX idx_task_status_date_priority
|
||||
ON tasks (status, created_date, priority DESC);
|
||||
|
||||
-- Query: WHERE user_id = 123 AND category IN ('A', 'B') AND date_field BETWEEN '...' AND '...'
|
||||
CREATE INDEX idx_user_category_date
|
||||
ON user_activities (user_id, category, date_field);
|
||||
```
|
||||
|
||||
### Covering Indexes
|
||||
```sql
|
||||
-- Include additional columns to avoid table lookups
|
||||
CREATE INDEX idx_user_email_covering
|
||||
ON users (email)
|
||||
INCLUDE (first_name, last_name, status);
|
||||
|
||||
-- Query can be satisfied entirely from the index
|
||||
-- SELECT first_name, last_name, status FROM users WHERE email = 'user@example.com';
|
||||
```
|
||||
|
||||
### Partial Indexes
|
||||
```sql
|
||||
-- Index only relevant subset of data
|
||||
CREATE INDEX idx_active_users_email
|
||||
ON users (email)
|
||||
WHERE status = 'active';
|
||||
|
||||
-- Index for recent orders only
|
||||
CREATE INDEX idx_recent_orders_customer
|
||||
ON orders (customer_id, created_at)
|
||||
WHERE created_at > CURRENT_DATE - INTERVAL '30 days';
|
||||
```
|
||||
|
||||
## Query Analysis & Optimization
|
||||
|
||||
### Query Patterns Recognition
|
||||
1. **Equality Filters**: Single-column B-tree indexes
|
||||
2. **Range Queries**: B-tree with proper column ordering
|
||||
3. **Text Search**: Full-text indexes or trigram indexes
|
||||
4. **Join Operations**: Foreign key indexes on both sides
|
||||
5. **Sorting Requirements**: Indexes matching ORDER BY clauses
|
||||
|
||||
### Index Selection Algorithm
|
||||
```
|
||||
1. Identify WHERE clause columns
|
||||
2. Determine most selective columns first
|
||||
3. Consider JOIN conditions
|
||||
4. Include ORDER BY columns if possible
|
||||
5. Evaluate covering index opportunities
|
||||
6. Check for existing overlapping indexes
|
||||
```
|
||||
|
||||
## Data Modeling Patterns
|
||||
|
||||
### Star Schema (Data Warehousing)
|
||||
```sql
|
||||
-- Central fact table
|
||||
CREATE TABLE sales_facts (
|
||||
sale_id BIGINT PRIMARY KEY,
|
||||
product_id INT REFERENCES products(id),
|
||||
customer_id INT REFERENCES customers(id),
|
||||
date_id INT REFERENCES date_dimension(id),
|
||||
store_id INT REFERENCES stores(id),
|
||||
quantity INT,
|
||||
unit_price DECIMAL(8,2),
|
||||
total_amount DECIMAL(10,2)
|
||||
);
|
||||
|
||||
-- Dimension tables
|
||||
CREATE TABLE date_dimension (
|
||||
id INT PRIMARY KEY,
|
||||
date_value DATE,
|
||||
year INT,
|
||||
quarter INT,
|
||||
month INT,
|
||||
day_of_week INT,
|
||||
is_weekend BOOLEAN
|
||||
);
|
||||
```
|
||||
|
||||
### Snowflake Schema
|
||||
```sql
|
||||
-- Normalized dimension tables
|
||||
CREATE TABLE products (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(200),
|
||||
category_id INT REFERENCES product_categories(id),
|
||||
brand_id INT REFERENCES brands(id)
|
||||
);
|
||||
|
||||
CREATE TABLE product_categories (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
parent_category_id INT REFERENCES product_categories(id)
|
||||
);
|
||||
```
|
||||
|
||||
### Document Model (JSON Storage)
|
||||
```sql
|
||||
-- Flexible document storage with indexing
|
||||
CREATE TABLE documents (
|
||||
id UUID PRIMARY KEY,
|
||||
document_type VARCHAR(50),
|
||||
data JSONB,
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
updated_at TIMESTAMP DEFAULT NOW()
|
||||
);
|
||||
|
||||
-- Index on JSON properties
|
||||
CREATE INDEX idx_documents_user_id
|
||||
ON documents USING GIN ((data->>'user_id'));
|
||||
|
||||
CREATE INDEX idx_documents_status
|
||||
ON documents ((data->>'status'))
|
||||
WHERE document_type = 'order';
|
||||
```
|
||||
|
||||
### Graph Data Patterns
|
||||
```sql
|
||||
-- Adjacency list for hierarchical data
|
||||
CREATE TABLE categories (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
parent_id INT REFERENCES categories(id),
|
||||
level INT,
|
||||
path VARCHAR(500) -- Materialized path: "/1/5/12/"
|
||||
);
|
||||
|
||||
-- Many-to-many relationships
|
||||
CREATE TABLE relationships (
|
||||
id UUID PRIMARY KEY,
|
||||
from_entity_id UUID,
|
||||
to_entity_id UUID,
|
||||
relationship_type VARCHAR(50),
|
||||
created_at TIMESTAMP,
|
||||
INDEX (from_entity_id, relationship_type),
|
||||
INDEX (to_entity_id, relationship_type)
|
||||
);
|
||||
```
|
||||
|
||||
## Migration Strategies
|
||||
|
||||
### Zero-Downtime Migration (Expand-Contract Pattern)
|
||||
|
||||
**Phase 1: Expand**
|
||||
```sql
|
||||
-- Add new column without constraints
|
||||
ALTER TABLE users ADD COLUMN new_email VARCHAR(255);
|
||||
|
||||
-- Backfill data in batches
|
||||
UPDATE users SET new_email = email WHERE id BETWEEN 1 AND 1000;
|
||||
-- Continue in batches...
|
||||
|
||||
-- Add constraints after backfill
|
||||
ALTER TABLE users ADD CONSTRAINT users_new_email_unique UNIQUE (new_email);
|
||||
ALTER TABLE users ALTER COLUMN new_email SET NOT NULL;
|
||||
```
|
||||
|
||||
**Phase 2: Contract**
|
||||
```sql
|
||||
-- Update application to use new column
|
||||
-- Deploy application changes
|
||||
-- Verify new column is being used
|
||||
|
||||
-- Remove old column
|
||||
ALTER TABLE users DROP COLUMN email;
|
||||
-- Rename new column
|
||||
ALTER TABLE users RENAME COLUMN new_email TO email;
|
||||
```
|
||||
|
||||
### Data Type Changes
|
||||
```sql
|
||||
-- Safe string to integer conversion
|
||||
ALTER TABLE products ADD COLUMN sku_number INTEGER;
|
||||
UPDATE products SET sku_number = CAST(sku AS INTEGER) WHERE sku ~ '^[0-9]+$';
|
||||
-- Validate conversion success before dropping old column
|
||||
```
|
||||
|
||||
## Partitioning Strategies
|
||||
|
||||
### Horizontal Partitioning (Sharding)
|
||||
```sql
|
||||
-- Range partitioning by date
|
||||
CREATE TABLE sales_2023 PARTITION OF sales
|
||||
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
|
||||
|
||||
CREATE TABLE sales_2024 PARTITION OF sales
|
||||
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
|
||||
|
||||
-- Hash partitioning by user_id
|
||||
CREATE TABLE user_data_0 PARTITION OF user_data
|
||||
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
|
||||
|
||||
CREATE TABLE user_data_1 PARTITION OF user_data
|
||||
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
|
||||
```
|
||||
|
||||
### Vertical Partitioning
|
||||
```sql
|
||||
-- Separate frequently accessed columns
|
||||
CREATE TABLE users_core (
|
||||
id INT PRIMARY KEY,
|
||||
email VARCHAR(255),
|
||||
status VARCHAR(20),
|
||||
created_at TIMESTAMP
|
||||
);
|
||||
|
||||
-- Less frequently accessed profile data
|
||||
CREATE TABLE users_profile (
|
||||
user_id INT PRIMARY KEY REFERENCES users_core(id),
|
||||
bio TEXT,
|
||||
preferences JSONB,
|
||||
last_login TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
## Connection Management
|
||||
|
||||
### Connection Pooling
|
||||
- **Pool Size**: CPU cores × 2 + effective spindle count
|
||||
- **Connection Lifetime**: Rotate connections to prevent resource leaks
|
||||
- **Timeout Settings**: Connection, idle, and query timeouts
|
||||
- **Health Checks**: Regular connection validation
|
||||
|
||||
### Read Replicas Strategy
|
||||
```sql
|
||||
-- Write queries to primary
|
||||
INSERT INTO users (email, name) VALUES ('user@example.com', 'John Doe');
|
||||
|
||||
-- Read queries to replicas (with appropriate read preference)
|
||||
SELECT * FROM users WHERE status = 'active'; -- Route to read replica
|
||||
|
||||
-- Consistent reads when required
|
||||
SELECT * FROM users WHERE id = LAST_INSERT_ID(); -- Route to primary
|
||||
```
|
||||
|
||||
## Caching Layers
|
||||
|
||||
### Cache-Aside Pattern
|
||||
```python
|
||||
def get_user(user_id):
|
||||
# Try cache first
|
||||
user = cache.get(f"user:{user_id}")
|
||||
if user is None:
|
||||
# Cache miss - query database
|
||||
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
|
||||
# Store in cache
|
||||
cache.set(f"user:{user_id}", user, ttl=3600)
|
||||
return user
|
||||
```
|
||||
|
||||
### Write-Through Cache
|
||||
- **Consistency**: Always keep cache and database in sync
|
||||
- **Write Latency**: Higher due to dual writes
|
||||
- **Data Safety**: No data loss on cache failures
|
||||
|
||||
### Cache Invalidation Strategies
|
||||
1. **TTL-Based**: Time-based expiration
|
||||
2. **Event-Driven**: Invalidate on data changes
|
||||
3. **Version-Based**: Use version numbers for consistency
|
||||
4. **Tag-Based**: Group related cache entries
|
||||
|
||||
## Database Selection Guide
|
||||
|
||||
### SQL Databases
|
||||
**PostgreSQL**
|
||||
- **Strengths**: ACID compliance, complex queries, JSON support, extensibility
|
||||
- **Use Cases**: OLTP applications, data warehousing, geospatial data
|
||||
- **Scale**: Vertical scaling with read replicas
|
||||
|
||||
**MySQL**
|
||||
- **Strengths**: Performance, replication, wide ecosystem support
|
||||
- **Use Cases**: Web applications, content management, e-commerce
|
||||
- **Scale**: Horizontal scaling through sharding
|
||||
|
||||
### NoSQL Databases
|
||||
|
||||
**Document Stores (MongoDB, CouchDB)**
|
||||
- **Strengths**: Flexible schema, horizontal scaling, developer productivity
|
||||
- **Use Cases**: Content management, catalogs, user profiles
|
||||
- **Trade-offs**: Eventual consistency, complex queries limitations
|
||||
|
||||
**Key-Value Stores (Redis, DynamoDB)**
|
||||
- **Strengths**: High performance, simple model, excellent caching
|
||||
- **Use Cases**: Session storage, real-time analytics, gaming leaderboards
|
||||
- **Trade-offs**: Limited query capabilities, data modeling constraints
|
||||
|
||||
**Column-Family (Cassandra, HBase)**
|
||||
- **Strengths**: Write-heavy workloads, linear scalability, fault tolerance
|
||||
- **Use Cases**: Time-series data, IoT applications, messaging systems
|
||||
- **Trade-offs**: Query flexibility, consistency model complexity
|
||||
|
||||
**Graph Databases (Neo4j, Amazon Neptune)**
|
||||
- **Strengths**: Relationship queries, pattern matching, recommendation engines
|
||||
- **Use Cases**: Social networks, fraud detection, knowledge graphs
|
||||
- **Trade-offs**: Specialized use cases, learning curve
|
||||
|
||||
### NewSQL Databases
|
||||
**Distributed SQL (CockroachDB, TiDB, Spanner)**
|
||||
- **Strengths**: SQL compatibility with horizontal scaling
|
||||
- **Use Cases**: Global applications requiring ACID guarantees
|
||||
- **Trade-offs**: Complexity, latency for distributed transactions
|
||||
|
||||
## Tools & Scripts
|
||||
|
||||
### Schema Analyzer
|
||||
- **Input**: SQL DDL files, JSON schema definitions
|
||||
- **Analysis**: Normalization compliance, constraint validation, naming conventions
|
||||
- **Output**: Analysis report, Mermaid ERD, improvement recommendations
|
||||
|
||||
### Index Optimizer
|
||||
- **Input**: Schema definition, query patterns
|
||||
- **Analysis**: Missing indexes, redundancy detection, selectivity estimation
|
||||
- **Output**: Index recommendations, CREATE INDEX statements, performance projections
|
||||
|
||||
### Migration Generator
|
||||
- **Input**: Current and target schemas
|
||||
- **Analysis**: Schema differences, dependency resolution, risk assessment
|
||||
- **Output**: Migration scripts, rollback plans, validation queries
|
||||
@@ -0,0 +1,373 @@
|
||||
# Database Selection Decision Tree
|
||||
|
||||
## Overview
|
||||
|
||||
Choosing the right database technology is crucial for application success. This guide provides a systematic approach to database selection based on specific requirements, data patterns, and operational constraints.
|
||||
|
||||
## Decision Framework
|
||||
|
||||
### Primary Questions
|
||||
|
||||
1. **What is your primary use case?**
|
||||
- OLTP (Online Transaction Processing)
|
||||
- OLAP (Online Analytical Processing)
|
||||
- Real-time analytics
|
||||
- Content management
|
||||
- Search and discovery
|
||||
- Time-series data
|
||||
- Graph relationships
|
||||
|
||||
2. **What are your consistency requirements?**
|
||||
- Strong consistency (ACID)
|
||||
- Eventual consistency
|
||||
- Causal consistency
|
||||
- Session consistency
|
||||
|
||||
3. **What are your scalability needs?**
|
||||
- Vertical scaling sufficient
|
||||
- Horizontal scaling required
|
||||
- Global distribution needed
|
||||
- Multi-region requirements
|
||||
|
||||
4. **What is your data structure?**
|
||||
- Structured (relational)
|
||||
- Semi-structured (JSON/XML)
|
||||
- Unstructured (documents, media)
|
||||
- Graph relationships
|
||||
- Time-series data
|
||||
- Key-value pairs
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
START: What is your primary use case?
|
||||
│
|
||||
├── OLTP (Transactional Applications)
|
||||
│ │
|
||||
│ ├── Do you need strong ACID guarantees?
|
||||
│ │ ├── YES → Do you need horizontal scaling?
|
||||
│ │ │ ├── YES → Distributed SQL
|
||||
│ │ │ │ ├── CockroachDB (Global, multi-region)
|
||||
│ │ │ │ ├── TiDB (MySQL compatibility)
|
||||
│ │ │ │ └── Spanner (Google Cloud)
|
||||
│ │ │ └── NO → Traditional SQL
|
||||
│ │ │ ├── PostgreSQL (Feature-rich, extensions)
|
||||
│ │ │ ├── MySQL (Performance, ecosystem)
|
||||
│ │ │ └── SQL Server (Microsoft stack)
|
||||
│ │ └── NO → Are you primarily key-value access?
|
||||
│ │ ├── YES → Key-Value Stores
|
||||
│ │ │ ├── Redis (In-memory, caching)
|
||||
│ │ │ ├── DynamoDB (AWS managed)
|
||||
│ │ │ └── Cassandra (High availability)
|
||||
│ │ └── NO → Document Stores
|
||||
│ │ ├── MongoDB (General purpose)
|
||||
│ │ ├── CouchDB (Sync, replication)
|
||||
│ │ └── Amazon DocumentDB (MongoDB compatible)
|
||||
│ │
|
||||
├── OLAP (Analytics and Reporting)
|
||||
│ │
|
||||
│ ├── What is your data volume?
|
||||
│ │ ├── Small to Medium (< 1TB) → Traditional SQL with optimization
|
||||
│ │ │ ├── PostgreSQL with columnar extensions
|
||||
│ │ │ ├── MySQL with analytics engine
|
||||
│ │ │ └── SQL Server with columnstore
|
||||
│ │ ├── Large (1TB - 100TB) → Data Warehouse Solutions
|
||||
│ │ │ ├── Snowflake (Cloud-native)
|
||||
│ │ │ ├── BigQuery (Google Cloud)
|
||||
│ │ │ ├── Redshift (AWS)
|
||||
│ │ │ └── Synapse (Azure)
|
||||
│ │ └── Very Large (> 100TB) → Big Data Platforms
|
||||
│ │ ├── Databricks (Unified analytics)
|
||||
│ │ ├── Apache Spark on cloud
|
||||
│ │ └── Hadoop ecosystem
|
||||
│ │
|
||||
├── Real-time Analytics
|
||||
│ │
|
||||
│ ├── Do you need sub-second query responses?
|
||||
│ │ ├── YES → Stream Processing + OLAP
|
||||
│ │ │ ├── ClickHouse (Fast analytics)
|
||||
│ │ │ ├── Apache Druid (Real-time OLAP)
|
||||
│ │ │ ├── Pinot (LinkedIn's real-time DB)
|
||||
│ │ │ └── TimescaleDB (Time-series)
|
||||
│ │ └── NO → Traditional OLAP solutions
|
||||
│ │
|
||||
├── Search and Discovery
|
||||
│ │
|
||||
│ ├── What type of search?
|
||||
│ │ ├── Full-text search → Search Engines
|
||||
│ │ │ ├── Elasticsearch (Full-featured)
|
||||
│ │ │ ├── OpenSearch (AWS fork of ES)
|
||||
│ │ │ └── Solr (Apache Lucene-based)
|
||||
│ │ ├── Vector/similarity search → Vector Databases
|
||||
│ │ │ ├── Pinecone (Managed vector DB)
|
||||
│ │ │ ├── Weaviate (Open source)
|
||||
│ │ │ ├── Chroma (Embeddings)
|
||||
│ │ │ └── PostgreSQL with pgvector
|
||||
│ │ └── Faceted search → Search + SQL combination
|
||||
│ │
|
||||
├── Graph Relationships
|
||||
│ │
|
||||
│ ├── Do you need complex graph traversals?
|
||||
│ │ ├── YES → Graph Databases
|
||||
│ │ │ ├── Neo4j (Property graph)
|
||||
│ │ │ ├── Amazon Neptune (Multi-model)
|
||||
│ │ │ ├── ArangoDB (Multi-model)
|
||||
│ │ │ └── TigerGraph (Analytics focused)
|
||||
│ │ └── NO → SQL with recursive queries
|
||||
│ │ └── PostgreSQL with recursive CTEs
|
||||
│ │
|
||||
└── Time-series Data
|
||||
│
|
||||
├── What is your write volume?
|
||||
├── High (millions/sec) → Specialized Time-series
|
||||
│ ├── InfluxDB (Purpose-built)
|
||||
│ ├── TimescaleDB (PostgreSQL extension)
|
||||
│ ├── Apache Druid (Analytics focused)
|
||||
│ └── Prometheus (Monitoring)
|
||||
└── Medium → SQL with time-series optimization
|
||||
└── PostgreSQL with partitioning
|
||||
```
|
||||
|
||||
## Database Categories Deep Dive
|
||||
|
||||
### Traditional SQL Databases
|
||||
|
||||
**PostgreSQL**
|
||||
- **Best For**: Complex queries, JSON data, extensions, geospatial
|
||||
- **Strengths**: Feature-rich, reliable, strong consistency, extensible
|
||||
- **Use Cases**: OLTP, mixed workloads, JSON documents, geospatial applications
|
||||
- **Scaling**: Vertical scaling, read replicas, partitioning
|
||||
- **When to Choose**: Need SQL features, complex queries, moderate scale
|
||||
|
||||
**MySQL**
|
||||
- **Best For**: Web applications, read-heavy workloads, simple schema
|
||||
- **Strengths**: Performance, replication, large ecosystem
|
||||
- **Use Cases**: Web apps, content management, e-commerce
|
||||
- **Scaling**: Read replicas, sharding, clustering (MySQL Cluster)
|
||||
- **When to Choose**: Simple schema, performance priority, large community
|
||||
|
||||
**SQL Server**
|
||||
- **Best For**: Microsoft ecosystem, enterprise features, business intelligence
|
||||
- **Strengths**: Integration, tooling, enterprise features
|
||||
- **Use Cases**: Enterprise applications, .NET applications, BI
|
||||
- **Scaling**: Always On availability groups, partitioning
|
||||
- **When to Choose**: Microsoft stack, enterprise requirements
|
||||
|
||||
### Distributed SQL (NewSQL)
|
||||
|
||||
**CockroachDB**
|
||||
- **Best For**: Global applications, strong consistency, horizontal scaling
|
||||
- **Strengths**: ACID guarantees, automatic scaling, survival
|
||||
- **Use Cases**: Multi-region apps, financial services, global SaaS
|
||||
- **Trade-offs**: Complex setup, higher latency for global transactions
|
||||
- **When to Choose**: Need SQL + global scale + consistency
|
||||
|
||||
**TiDB**
|
||||
- **Best For**: MySQL compatibility with horizontal scaling
|
||||
- **Strengths**: MySQL protocol, HTAP (hybrid), cloud-native
|
||||
- **Use Cases**: MySQL migrations, hybrid workloads
|
||||
- **When to Choose**: Existing MySQL expertise, need scale
|
||||
|
||||
### NoSQL Document Stores
|
||||
|
||||
**MongoDB**
|
||||
- **Best For**: Flexible schema, rapid development, document-centric data
|
||||
- **Strengths**: Developer experience, flexible schema, rich queries
|
||||
- **Use Cases**: Content management, catalogs, user profiles, IoT
|
||||
- **Scaling**: Automatic sharding, replica sets
|
||||
- **When to Choose**: Schema evolution, document structure, rapid development
|
||||
|
||||
**CouchDB**
|
||||
- **Best For**: Offline-first applications, multi-master replication
|
||||
- **Strengths**: HTTP API, replication, conflict resolution
|
||||
- **Use Cases**: Mobile apps, distributed systems, offline scenarios
|
||||
- **When to Choose**: Need offline capabilities, bi-directional sync
|
||||
|
||||
### Key-Value Stores
|
||||
|
||||
**Redis**
|
||||
- **Best For**: Caching, sessions, real-time applications, pub/sub
|
||||
- **Strengths**: Performance, data structures, persistence options
|
||||
- **Use Cases**: Caching, leaderboards, real-time analytics, queues
|
||||
- **Scaling**: Clustering, sentinel for HA
|
||||
- **When to Choose**: High performance, simple data model, caching
|
||||
|
||||
**DynamoDB**
|
||||
- **Best For**: Serverless applications, predictable performance, AWS ecosystem
|
||||
- **Strengths**: Managed, auto-scaling, consistent performance
|
||||
- **Use Cases**: Web applications, gaming, IoT, mobile backends
|
||||
- **Trade-offs**: Vendor lock-in, limited querying
|
||||
- **When to Choose**: AWS ecosystem, serverless, managed solution
|
||||
|
||||
### Column-Family Stores
|
||||
|
||||
**Cassandra**
|
||||
- **Best For**: Write-heavy workloads, high availability, linear scalability
|
||||
- **Strengths**: No single point of failure, tunable consistency
|
||||
- **Use Cases**: Time-series, IoT, messaging, activity feeds
|
||||
- **Trade-offs**: Complex operations, eventual consistency
|
||||
- **When to Choose**: High write volume, availability over consistency
|
||||
|
||||
**HBase**
|
||||
- **Best For**: Big data applications, Hadoop ecosystem
|
||||
- **Strengths**: Hadoop integration, consistent reads
|
||||
- **Use Cases**: Analytics on big data, time-series at scale
|
||||
- **When to Choose**: Hadoop ecosystem, very large datasets
|
||||
|
||||
### Graph Databases
|
||||
|
||||
**Neo4j**
|
||||
- **Best For**: Complex relationships, graph algorithms, traversals
|
||||
- **Strengths**: Mature ecosystem, Cypher query language, algorithms
|
||||
- **Use Cases**: Social networks, recommendation engines, fraud detection
|
||||
- **Trade-offs**: Specialized use case, learning curve
|
||||
- **When to Choose**: Relationship-heavy data, graph algorithms
|
||||
|
||||
### Time-Series Databases
|
||||
|
||||
**InfluxDB**
|
||||
- **Best For**: Time-series data, IoT, monitoring, analytics
|
||||
- **Strengths**: Purpose-built, efficient storage, query language
|
||||
- **Use Cases**: IoT sensors, monitoring, DevOps metrics
|
||||
- **When to Choose**: High-volume time-series data
|
||||
|
||||
**TimescaleDB**
|
||||
- **Best For**: Time-series with SQL familiarity
|
||||
- **Strengths**: PostgreSQL compatibility, SQL queries, ecosystem
|
||||
- **Use Cases**: Financial data, IoT with complex queries
|
||||
- **When to Choose**: Time-series + SQL requirements
|
||||
|
||||
### Search Engines
|
||||
|
||||
**Elasticsearch**
|
||||
- **Best For**: Full-text search, log analysis, real-time search
|
||||
- **Strengths**: Powerful search, analytics, ecosystem (ELK stack)
|
||||
- **Use Cases**: Search applications, log analysis, monitoring
|
||||
- **Trade-offs**: Complex operations, resource intensive
|
||||
- **When to Choose**: Advanced search requirements, analytics
|
||||
|
||||
### Data Warehouses
|
||||
|
||||
**Snowflake**
|
||||
- **Best For**: Cloud-native analytics, data sharing, varied workloads
|
||||
- **Strengths**: Separation of compute/storage, automatic scaling
|
||||
- **Use Cases**: Data warehousing, analytics, data science
|
||||
- **When to Choose**: Cloud-native, analytics-focused, multi-cloud
|
||||
|
||||
**BigQuery**
|
||||
- **Best For**: Serverless analytics, Google ecosystem, machine learning
|
||||
- **Strengths**: Serverless, petabyte scale, ML integration
|
||||
- **Use Cases**: Analytics, data science, reporting
|
||||
- **When to Choose**: Google Cloud, serverless analytics
|
||||
|
||||
## Selection Criteria Matrix
|
||||
|
||||
| Criterion | SQL | NewSQL | Document | Key-Value | Column-Family | Graph | Time-Series |
|
||||
|-----------|-----|--------|----------|-----------|---------------|-------|-------------|
|
||||
| ACID Guarantees | ✅ Strong | ✅ Strong | ⚠️ Eventual | ⚠️ Eventual | ⚠️ Tunable | ⚠️ Varies | ⚠️ Varies |
|
||||
| Horizontal Scaling | ❌ Limited | ✅ Native | ✅ Native | ✅ Native | ✅ Native | ⚠️ Limited | ✅ Native |
|
||||
| Query Flexibility | ✅ High | ✅ High | ⚠️ Moderate | ❌ Low | ❌ Low | ✅ High | ⚠️ Specialized |
|
||||
| Schema Flexibility | ❌ Rigid | ❌ Rigid | ✅ High | ✅ High | ⚠️ Moderate | ✅ High | ⚠️ Structured |
|
||||
| Performance (Reads) | ⚠️ Good | ⚠️ Good | ✅ Excellent | ✅ Excellent | ✅ Excellent | ⚠️ Good | ✅ Excellent |
|
||||
| Performance (Writes) | ⚠️ Good | ⚠️ Good | ✅ Excellent | ✅ Excellent | ✅ Excellent | ⚠️ Good | ✅ Excellent |
|
||||
| Operational Complexity | ✅ Low | ❌ High | ⚠️ Moderate | ✅ Low | ❌ High | ⚠️ Moderate | ⚠️ Moderate |
|
||||
| Ecosystem Maturity | ✅ Mature | ⚠️ Growing | ✅ Mature | ✅ Mature | ✅ Mature | ✅ Mature | ⚠️ Growing |
|
||||
|
||||
## Decision Checklist
|
||||
|
||||
### Requirements Analysis
|
||||
- [ ] **Data Volume**: Current and projected data size
|
||||
- [ ] **Transaction Volume**: Reads per second, writes per second
|
||||
- [ ] **Consistency Requirements**: Strong vs eventual consistency needs
|
||||
- [ ] **Query Patterns**: Simple lookups vs complex analytics
|
||||
- [ ] **Schema Evolution**: How often does schema change?
|
||||
- [ ] **Geographic Distribution**: Single region vs global
|
||||
- [ ] **Availability Requirements**: Acceptable downtime
|
||||
- [ ] **Team Expertise**: Existing knowledge and learning curve
|
||||
- [ ] **Budget Constraints**: Licensing, infrastructure, operational costs
|
||||
- [ ] **Compliance Requirements**: Data residency, audit trails
|
||||
|
||||
### Technical Evaluation
|
||||
- [ ] **Performance Testing**: Benchmark with realistic data and queries
|
||||
- [ ] **Scalability Testing**: Test scaling limits and patterns
|
||||
- [ ] **Failure Scenarios**: Test backup, recovery, and failure handling
|
||||
- [ ] **Integration Testing**: APIs, connectors, ecosystem tools
|
||||
- [ ] **Migration Path**: How to migrate from current system
|
||||
- [ ] **Monitoring and Observability**: Available tooling and metrics
|
||||
|
||||
### Operational Considerations
|
||||
- [ ] **Management Complexity**: Setup, configuration, maintenance
|
||||
- [ ] **Backup and Recovery**: Built-in vs external tools
|
||||
- [ ] **Security Features**: Authentication, authorization, encryption
|
||||
- [ ] **Upgrade Path**: Version compatibility and upgrade process
|
||||
- [ ] **Support Options**: Community vs commercial support
|
||||
- [ ] **Lock-in Risk**: Portability and vendor independence
|
||||
|
||||
## Common Decision Patterns
|
||||
|
||||
### E-commerce Platform
|
||||
**Typical Choice**: PostgreSQL or MySQL
|
||||
- **Primary Data**: Product catalog, orders, users (structured)
|
||||
- **Query Patterns**: OLTP with some analytics
|
||||
- **Consistency**: Strong consistency for financial data
|
||||
- **Scale**: Moderate with read replicas
|
||||
- **Additional**: Redis for caching, Elasticsearch for product search
|
||||
|
||||
### IoT/Sensor Data Platform
|
||||
**Typical Choice**: TimescaleDB or InfluxDB
|
||||
- **Primary Data**: Time-series sensor readings
|
||||
- **Query Patterns**: Time-based aggregations, trend analysis
|
||||
- **Scale**: High write volume, moderate read volume
|
||||
- **Additional**: Kafka for ingestion, PostgreSQL for metadata
|
||||
|
||||
### Social Media Application
|
||||
**Typical Choice**: Combination approach
|
||||
- **User Profiles**: MongoDB (flexible schema)
|
||||
- **Relationships**: Neo4j (graph relationships)
|
||||
- **Activity Feeds**: Cassandra (high write volume)
|
||||
- **Search**: Elasticsearch (content discovery)
|
||||
- **Caching**: Redis (sessions, real-time data)
|
||||
|
||||
### Analytics Platform
|
||||
**Typical Choice**: Snowflake or BigQuery
|
||||
- **Primary Use**: Complex analytical queries
|
||||
- **Data Volume**: Large (TB to PB scale)
|
||||
- **Query Patterns**: Ad-hoc analytics, reporting
|
||||
- **Users**: Data analysts, data scientists
|
||||
- **Additional**: Data lake (S3/GCS) for raw data storage
|
||||
|
||||
### Global SaaS Application
|
||||
**Typical Choice**: CockroachDB or DynamoDB
|
||||
- **Requirements**: Multi-region, strong consistency
|
||||
- **Scale**: Global user base
|
||||
- **Compliance**: Data residency requirements
|
||||
- **Availability**: High availability across regions
|
||||
|
||||
## Migration Strategies
|
||||
|
||||
### From Monolithic to Distributed
|
||||
1. **Assessment**: Identify scaling bottlenecks
|
||||
2. **Data Partitioning**: Plan how to split data
|
||||
3. **Gradual Migration**: Move non-critical data first
|
||||
4. **Dual Writes**: Run both systems temporarily
|
||||
5. **Validation**: Verify data consistency
|
||||
6. **Cutover**: Switch reads and writes gradually
|
||||
|
||||
### Technology Stack Evolution
|
||||
1. **Start Simple**: Begin with PostgreSQL or MySQL
|
||||
2. **Identify Bottlenecks**: Monitor performance and scaling issues
|
||||
3. **Selective Scaling**: Move specific workloads to specialized databases
|
||||
4. **Polyglot Persistence**: Use multiple databases for different use cases
|
||||
5. **Service Boundaries**: Align database choice with service boundaries
|
||||
|
||||
## Conclusion
|
||||
|
||||
Database selection should be driven by:
|
||||
|
||||
1. **Specific Use Case Requirements**: Not all applications need the same database
|
||||
2. **Data Characteristics**: Structure, volume, and access patterns matter
|
||||
3. **Non-functional Requirements**: Consistency, availability, performance targets
|
||||
4. **Team and Organizational Factors**: Expertise, operational capacity, budget
|
||||
5. **Evolution Path**: How requirements and scale will change over time
|
||||
|
||||
The best database choice is often not a single technology, but a combination of databases that each excel at their specific use case within your application architecture.
|
||||
@@ -0,0 +1,424 @@
|
||||
# Index Strategy Patterns
|
||||
|
||||
## Overview
|
||||
|
||||
Database indexes are critical for query performance, but they come with trade-offs. This guide covers proven patterns for index design, optimization strategies, and common pitfalls to avoid.
|
||||
|
||||
## Index Types and Use Cases
|
||||
|
||||
### B-Tree Indexes (Default)
|
||||
|
||||
**Best For:**
|
||||
- Equality queries (`WHERE column = value`)
|
||||
- Range queries (`WHERE column BETWEEN x AND y`)
|
||||
- Sorting (`ORDER BY column`)
|
||||
- Pattern matching with leading wildcards (`WHERE column LIKE 'prefix%'`)
|
||||
|
||||
**Characteristics:**
|
||||
- Logarithmic lookup time O(log n)
|
||||
- Supports partial matches on composite indexes
|
||||
- Most versatile index type
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Single column B-tree index
|
||||
CREATE INDEX idx_customers_email ON customers (email);
|
||||
|
||||
-- Composite B-tree index
|
||||
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
|
||||
```
|
||||
|
||||
### Hash Indexes
|
||||
|
||||
**Best For:**
|
||||
- Exact equality matches only
|
||||
- High-cardinality columns
|
||||
- Primary key lookups
|
||||
|
||||
**Characteristics:**
|
||||
- Constant lookup time O(1) for exact matches
|
||||
- Cannot support range queries or sorting
|
||||
- Memory-efficient for equality operations
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Hash index for exact lookups (PostgreSQL)
|
||||
CREATE INDEX idx_users_id_hash ON users USING HASH (user_id);
|
||||
```
|
||||
|
||||
### Partial Indexes
|
||||
|
||||
**Best For:**
|
||||
- Filtering on subset of data
|
||||
- Reducing index size and maintenance overhead
|
||||
- Query patterns that consistently use specific filters
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Index only active users
|
||||
CREATE INDEX idx_active_users_email
|
||||
ON users (email)
|
||||
WHERE status = 'active';
|
||||
|
||||
-- Index recent orders only
|
||||
CREATE INDEX idx_recent_orders
|
||||
ON orders (customer_id, created_at)
|
||||
WHERE created_at > CURRENT_DATE - INTERVAL '90 days';
|
||||
|
||||
-- Index non-null values only
|
||||
CREATE INDEX idx_customers_phone
|
||||
ON customers (phone_number)
|
||||
WHERE phone_number IS NOT NULL;
|
||||
```
|
||||
|
||||
### Covering Indexes
|
||||
|
||||
**Best For:**
|
||||
- Eliminating table lookups for SELECT queries
|
||||
- Frequently accessed column combinations
|
||||
- Read-heavy workloads
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Covering index with INCLUDE clause (SQL Server/PostgreSQL)
|
||||
CREATE INDEX idx_orders_customer_covering
|
||||
ON orders (customer_id, order_date)
|
||||
INCLUDE (order_total, status);
|
||||
|
||||
-- Query can be satisfied entirely from index:
|
||||
-- SELECT order_total, status FROM orders
|
||||
-- WHERE customer_id = 123 AND order_date > '2024-01-01';
|
||||
```
|
||||
|
||||
### Functional/Expression Indexes
|
||||
|
||||
**Best For:**
|
||||
- Queries on transformed column values
|
||||
- Case-insensitive searches
|
||||
- Complex calculations
|
||||
|
||||
**Example:**
|
||||
```sql
|
||||
-- Case-insensitive email searches
|
||||
CREATE INDEX idx_users_email_lower
|
||||
ON users (LOWER(email));
|
||||
|
||||
-- Date part extraction
|
||||
CREATE INDEX idx_orders_month
|
||||
ON orders (EXTRACT(MONTH FROM order_date));
|
||||
|
||||
-- JSON field indexing
|
||||
CREATE INDEX idx_users_preferences_theme
|
||||
ON users ((preferences->>'theme'));
|
||||
```
|
||||
|
||||
## Composite Index Design Patterns
|
||||
|
||||
### Column Ordering Strategy
|
||||
|
||||
**Rule: Most Selective First**
|
||||
```sql
|
||||
-- Query: WHERE status = 'active' AND city = 'New York' AND age > 25
|
||||
-- Assume: status has 3 values, city has 100 values, age has 80 values
|
||||
|
||||
-- GOOD: Most selective column first
|
||||
CREATE INDEX idx_users_city_age_status ON users (city, age, status);
|
||||
|
||||
-- BAD: Least selective first
|
||||
CREATE INDEX idx_users_status_city_age ON users (status, city, age);
|
||||
```
|
||||
|
||||
**Selectivity Calculation:**
|
||||
```sql
|
||||
-- Estimate selectivity for each column
|
||||
SELECT
|
||||
'status' as column_name,
|
||||
COUNT(DISTINCT status)::float / COUNT(*) as selectivity
|
||||
FROM users
|
||||
UNION ALL
|
||||
SELECT
|
||||
'city' as column_name,
|
||||
COUNT(DISTINCT city)::float / COUNT(*) as selectivity
|
||||
FROM users
|
||||
UNION ALL
|
||||
SELECT
|
||||
'age' as column_name,
|
||||
COUNT(DISTINCT age)::float / COUNT(*) as selectivity
|
||||
FROM users;
|
||||
```
|
||||
|
||||
### Query Pattern Matching
|
||||
|
||||
**Pattern 1: Equality + Range**
|
||||
```sql
|
||||
-- Query: WHERE customer_id = 123 AND order_date BETWEEN '2024-01-01' AND '2024-03-31'
|
||||
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
|
||||
```
|
||||
|
||||
**Pattern 2: Multiple Equality Conditions**
|
||||
```sql
|
||||
-- Query: WHERE status = 'active' AND category = 'premium' AND region = 'US'
|
||||
CREATE INDEX idx_users_status_category_region ON users (status, category, region);
|
||||
```
|
||||
|
||||
**Pattern 3: Equality + Sorting**
|
||||
```sql
|
||||
-- Query: WHERE category = 'electronics' ORDER BY price DESC, created_at DESC
|
||||
CREATE INDEX idx_products_category_price_date ON products (category, price DESC, created_at DESC);
|
||||
```
|
||||
|
||||
### Prefix Optimization
|
||||
|
||||
**Efficient Prefix Usage:**
|
||||
```sql
|
||||
-- Index supports all these queries efficiently:
|
||||
CREATE INDEX idx_users_lastname_firstname_email ON users (last_name, first_name, email);
|
||||
|
||||
-- ✓ Uses index: WHERE last_name = 'Smith'
|
||||
-- ✓ Uses index: WHERE last_name = 'Smith' AND first_name = 'John'
|
||||
-- ✓ Uses index: WHERE last_name = 'Smith' AND first_name = 'John' AND email = 'john@...'
|
||||
-- ✗ Cannot use index: WHERE first_name = 'John'
|
||||
-- ✗ Cannot use index: WHERE email = 'john@...'
|
||||
```
|
||||
|
||||
## Performance Optimization Patterns
|
||||
|
||||
### Index Intersection vs Composite Indexes
|
||||
|
||||
**Scenario: Multiple single-column indexes**
|
||||
```sql
|
||||
CREATE INDEX idx_users_age ON users (age);
|
||||
CREATE INDEX idx_users_city ON users (city);
|
||||
CREATE INDEX idx_users_status ON users (status);
|
||||
|
||||
-- Query: WHERE age > 25 AND city = 'NYC' AND status = 'active'
|
||||
-- Database may use index intersection (combining multiple indexes)
|
||||
-- Performance varies by database engine and data distribution
|
||||
```
|
||||
|
||||
**Better: Purpose-built composite index**
|
||||
```sql
|
||||
-- More efficient for the specific query pattern
|
||||
CREATE INDEX idx_users_city_status_age ON users (city, status, age);
|
||||
```
|
||||
|
||||
### Index Size vs Performance Trade-off
|
||||
|
||||
**Wide Indexes (Many Columns):**
|
||||
```sql
|
||||
-- Pros: Covers many query patterns, excellent for covering queries
|
||||
-- Cons: Large index size, slower writes, more memory usage
|
||||
CREATE INDEX idx_orders_comprehensive
|
||||
ON orders (customer_id, order_date, status, total_amount, shipping_method, created_at)
|
||||
INCLUDE (order_notes, billing_address);
|
||||
```
|
||||
|
||||
**Narrow Indexes (Few Columns):**
|
||||
```sql
|
||||
-- Pros: Smaller size, faster writes, less memory
|
||||
-- Cons: May not cover all query patterns
|
||||
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
|
||||
CREATE INDEX idx_orders_status ON orders (status);
|
||||
```
|
||||
|
||||
### Maintenance Optimization
|
||||
|
||||
**Regular Index Analysis:**
|
||||
```sql
|
||||
-- PostgreSQL: Check index usage statistics
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
indexname,
|
||||
idx_scan as index_scans,
|
||||
idx_tup_read as tuples_read,
|
||||
idx_tup_fetch as tuples_fetched
|
||||
FROM pg_stat_user_indexes
|
||||
WHERE idx_scan = 0 -- Potentially unused indexes
|
||||
ORDER BY schemaname, tablename;
|
||||
|
||||
-- Check index size
|
||||
SELECT
|
||||
indexname,
|
||||
pg_size_pretty(pg_relation_size(indexname::regclass)) as index_size
|
||||
FROM pg_indexes
|
||||
WHERE schemaname = 'public'
|
||||
ORDER BY pg_relation_size(indexname::regclass) DESC;
|
||||
```
|
||||
|
||||
## Common Anti-Patterns
|
||||
|
||||
### 1. Over-Indexing
|
||||
|
||||
**Problem:**
|
||||
```sql
|
||||
-- Too many similar indexes
|
||||
CREATE INDEX idx_orders_customer ON orders (customer_id);
|
||||
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
|
||||
CREATE INDEX idx_orders_customer_status ON orders (customer_id, status);
|
||||
CREATE INDEX idx_orders_customer_date_status ON orders (customer_id, order_date, status);
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```sql
|
||||
-- One well-designed composite index can often replace several
|
||||
CREATE INDEX idx_orders_customer_date_status ON orders (customer_id, order_date, status);
|
||||
-- Drop redundant indexes: idx_orders_customer, idx_orders_customer_date, idx_orders_customer_status
|
||||
```
|
||||
|
||||
### 2. Wrong Column Order
|
||||
|
||||
**Problem:**
|
||||
```sql
|
||||
-- Query: WHERE active = true AND user_type = 'premium' AND city = 'Chicago'
|
||||
-- Bad order: boolean first (lowest selectivity)
|
||||
CREATE INDEX idx_users_active_type_city ON users (active, user_type, city);
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```sql
|
||||
-- Good order: most selective first
|
||||
CREATE INDEX idx_users_city_type_active ON users (city, user_type, active);
|
||||
```
|
||||
|
||||
### 3. Ignoring Query Patterns
|
||||
|
||||
**Problem:**
|
||||
```sql
|
||||
-- Index doesn't match common query patterns
|
||||
CREATE INDEX idx_products_name ON products (product_name);
|
||||
|
||||
-- But queries are: WHERE category = 'electronics' AND price BETWEEN 100 AND 500
|
||||
-- Index is not helpful for these queries
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```sql
|
||||
-- Match actual query patterns
|
||||
CREATE INDEX idx_products_category_price ON products (category, price);
|
||||
```
|
||||
|
||||
### 4. Function in WHERE Without Functional Index
|
||||
|
||||
**Problem:**
|
||||
```sql
|
||||
-- Query uses function but no functional index
|
||||
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
|
||||
-- Regular index on email won't help
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```sql
|
||||
-- Create functional index
|
||||
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
|
||||
```
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Multi-Column Statistics
|
||||
|
||||
**When Columns Are Correlated:**
|
||||
```sql
|
||||
-- If city and state are highly correlated, create extended statistics
|
||||
CREATE STATISTICS stats_address_correlation ON city, state FROM addresses;
|
||||
ANALYZE addresses;
|
||||
|
||||
-- Helps query planner make better decisions for:
|
||||
-- WHERE city = 'New York' AND state = 'NY'
|
||||
```
|
||||
|
||||
### Conditional Indexes for Data Lifecycle
|
||||
|
||||
**Pattern: Different indexes for different data ages**
|
||||
```sql
|
||||
-- Hot data (recent orders) - optimized for OLTP
|
||||
CREATE INDEX idx_orders_hot_customer_date
|
||||
ON orders (customer_id, order_date DESC)
|
||||
WHERE order_date > CURRENT_DATE - INTERVAL '30 days';
|
||||
|
||||
-- Warm data (older orders) - optimized for analytics
|
||||
CREATE INDEX idx_orders_warm_date_total
|
||||
ON orders (order_date, total_amount)
|
||||
WHERE order_date <= CURRENT_DATE - INTERVAL '30 days'
|
||||
AND order_date > CURRENT_DATE - INTERVAL '1 year';
|
||||
|
||||
-- Cold data (archived orders) - minimal indexing
|
||||
CREATE INDEX idx_orders_cold_date
|
||||
ON orders (order_date)
|
||||
WHERE order_date <= CURRENT_DATE - INTERVAL '1 year';
|
||||
```
|
||||
|
||||
### Index-Only Scan Optimization
|
||||
|
||||
**Design indexes to avoid table access:**
|
||||
```sql
|
||||
-- Query: SELECT order_id, total_amount, status FROM orders WHERE customer_id = ?
|
||||
CREATE INDEX idx_orders_customer_covering
|
||||
ON orders (customer_id)
|
||||
INCLUDE (order_id, total_amount, status);
|
||||
|
||||
-- Or as composite index (if database doesn't support INCLUDE)
|
||||
CREATE INDEX idx_orders_customer_covering
|
||||
ON orders (customer_id, order_id, total_amount, status);
|
||||
```
|
||||
|
||||
## Index Monitoring and Maintenance
|
||||
|
||||
### Performance Monitoring Queries
|
||||
|
||||
**Find slow queries that might benefit from indexes:**
|
||||
```sql
|
||||
-- PostgreSQL: Find queries with high cost
|
||||
SELECT
|
||||
query,
|
||||
calls,
|
||||
total_time,
|
||||
mean_time,
|
||||
rows
|
||||
FROM pg_stat_statements
|
||||
WHERE mean_time > 1000 -- Queries taking > 1 second
|
||||
ORDER BY mean_time DESC;
|
||||
```
|
||||
|
||||
**Identify missing indexes:**
|
||||
```sql
|
||||
-- Look for sequential scans on large tables
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
seq_scan,
|
||||
seq_tup_read,
|
||||
idx_scan,
|
||||
n_tup_ins + n_tup_upd + n_tup_del as write_activity
|
||||
FROM pg_stat_user_tables
|
||||
WHERE seq_scan > 100
|
||||
AND seq_tup_read > 100000 -- Large sequential scans
|
||||
AND (idx_scan = 0 OR seq_scan > idx_scan * 2)
|
||||
ORDER BY seq_tup_read DESC;
|
||||
```
|
||||
|
||||
### Index Maintenance Schedule
|
||||
|
||||
**Regular Maintenance Tasks:**
|
||||
```sql
|
||||
-- Rebuild fragmented indexes (SQL Server)
|
||||
ALTER INDEX ALL ON orders REBUILD;
|
||||
|
||||
-- Update statistics (PostgreSQL)
|
||||
ANALYZE orders;
|
||||
|
||||
-- Check for unused indexes monthly
|
||||
SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0;
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
Effective index strategy requires:
|
||||
|
||||
1. **Understanding Query Patterns**: Analyze actual application queries, not theoretical scenarios
|
||||
2. **Measuring Performance**: Use query execution plans and timing to validate index effectiveness
|
||||
3. **Balancing Trade-offs**: More indexes improve reads but slow writes and increase storage
|
||||
4. **Regular Maintenance**: Monitor index usage and performance, remove unused indexes
|
||||
5. **Iterative Improvement**: Start with essential indexes, add and optimize based on real usage
|
||||
|
||||
The goal is not to index every possible query pattern, but to create a focused set of indexes that provide maximum benefit for your application's specific workload while minimizing maintenance overhead.
|
||||
@@ -0,0 +1,354 @@
|
||||
# Database Normalization Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Database normalization is the process of organizing data to minimize redundancy and dependency issues. It involves decomposing tables to eliminate data anomalies and improve data integrity.
|
||||
|
||||
## Normal Forms
|
||||
|
||||
### First Normal Form (1NF)
|
||||
|
||||
**Requirements:**
|
||||
- Each column contains atomic (indivisible) values
|
||||
- Each column contains values of the same type
|
||||
- Each column has a unique name
|
||||
- The order of data storage doesn't matter
|
||||
|
||||
**Violations and Solutions:**
|
||||
|
||||
**Problem: Multiple values in single column**
|
||||
```sql
|
||||
-- BAD: Multiple phone numbers in one column
|
||||
CREATE TABLE customers (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100),
|
||||
phones VARCHAR(500) -- "555-1234, 555-5678, 555-9012"
|
||||
);
|
||||
|
||||
-- GOOD: Separate table for multiple phones
|
||||
CREATE TABLE customers (
|
||||
id INT PRIMARY KEY,
|
||||
name VARCHAR(100)
|
||||
);
|
||||
|
||||
CREATE TABLE customer_phones (
|
||||
id INT PRIMARY KEY,
|
||||
customer_id INT REFERENCES customers(id),
|
||||
phone VARCHAR(20),
|
||||
phone_type VARCHAR(10) -- 'mobile', 'home', 'work'
|
||||
);
|
||||
```
|
||||
|
||||
**Problem: Repeating groups**
|
||||
```sql
|
||||
-- BAD: Repeating column patterns
|
||||
CREATE TABLE orders (
|
||||
order_id INT PRIMARY KEY,
|
||||
customer_id INT,
|
||||
item1_name VARCHAR(100),
|
||||
item1_qty INT,
|
||||
item1_price DECIMAL(8,2),
|
||||
item2_name VARCHAR(100),
|
||||
item2_qty INT,
|
||||
item2_price DECIMAL(8,2),
|
||||
item3_name VARCHAR(100),
|
||||
item3_qty INT,
|
||||
item3_price DECIMAL(8,2)
|
||||
);
|
||||
|
||||
-- GOOD: Separate table for order items
|
||||
CREATE TABLE orders (
|
||||
order_id INT PRIMARY KEY,
|
||||
customer_id INT,
|
||||
order_date DATE
|
||||
);
|
||||
|
||||
CREATE TABLE order_items (
|
||||
id INT PRIMARY KEY,
|
||||
order_id INT REFERENCES orders(order_id),
|
||||
item_name VARCHAR(100),
|
||||
quantity INT,
|
||||
unit_price DECIMAL(8,2)
|
||||
);
|
||||
```
|
||||
|
||||
### Second Normal Form (2NF)
|
||||
|
||||
**Requirements:**
|
||||
- Must be in 1NF
|
||||
- All non-key attributes must be fully functionally dependent on the primary key
|
||||
- No partial dependencies (applies only to tables with composite primary keys)
|
||||
|
||||
**Violations and Solutions:**
|
||||
|
||||
**Problem: Partial dependency on composite key**
|
||||
```sql
|
||||
-- BAD: Student course enrollment with partial dependencies
|
||||
CREATE TABLE student_courses (
|
||||
student_id INT,
|
||||
course_id INT,
|
||||
student_name VARCHAR(100), -- Depends only on student_id
|
||||
student_major VARCHAR(50), -- Depends only on student_id
|
||||
course_title VARCHAR(200), -- Depends only on course_id
|
||||
course_credits INT, -- Depends only on course_id
|
||||
grade CHAR(2), -- Depends on both student_id AND course_id
|
||||
PRIMARY KEY (student_id, course_id)
|
||||
);
|
||||
|
||||
-- GOOD: Separate tables eliminate partial dependencies
|
||||
CREATE TABLE students (
|
||||
student_id INT PRIMARY KEY,
|
||||
student_name VARCHAR(100),
|
||||
student_major VARCHAR(50)
|
||||
);
|
||||
|
||||
CREATE TABLE courses (
|
||||
course_id INT PRIMARY KEY,
|
||||
course_title VARCHAR(200),
|
||||
course_credits INT
|
||||
);
|
||||
|
||||
CREATE TABLE enrollments (
|
||||
student_id INT,
|
||||
course_id INT,
|
||||
grade CHAR(2),
|
||||
enrollment_date DATE,
|
||||
PRIMARY KEY (student_id, course_id),
|
||||
FOREIGN KEY (student_id) REFERENCES students(student_id),
|
||||
FOREIGN KEY (course_id) REFERENCES courses(course_id)
|
||||
);
|
||||
```
|
||||
|
||||
### Third Normal Form (3NF)
|
||||
|
||||
**Requirements:**
|
||||
- Must be in 2NF
|
||||
- No transitive dependencies (non-key attributes should not depend on other non-key attributes)
|
||||
- All non-key attributes must depend directly on the primary key
|
||||
|
||||
**Violations and Solutions:**
|
||||
|
||||
**Problem: Transitive dependency**
|
||||
```sql
|
||||
-- BAD: Employee table with transitive dependency
|
||||
CREATE TABLE employees (
|
||||
employee_id INT PRIMARY KEY,
|
||||
employee_name VARCHAR(100),
|
||||
department_id INT,
|
||||
department_name VARCHAR(100), -- Depends on department_id, not employee_id
|
||||
department_location VARCHAR(100), -- Transitive dependency through department_id
|
||||
department_budget DECIMAL(10,2), -- Transitive dependency through department_id
|
||||
salary DECIMAL(8,2)
|
||||
);
|
||||
|
||||
-- GOOD: Separate department information
|
||||
CREATE TABLE departments (
|
||||
department_id INT PRIMARY KEY,
|
||||
department_name VARCHAR(100),
|
||||
department_location VARCHAR(100),
|
||||
department_budget DECIMAL(10,2)
|
||||
);
|
||||
|
||||
CREATE TABLE employees (
|
||||
employee_id INT PRIMARY KEY,
|
||||
employee_name VARCHAR(100),
|
||||
department_id INT,
|
||||
salary DECIMAL(8,2),
|
||||
FOREIGN KEY (department_id) REFERENCES departments(department_id)
|
||||
);
|
||||
```
|
||||
|
||||
### Boyce-Codd Normal Form (BCNF)
|
||||
|
||||
**Requirements:**
|
||||
- Must be in 3NF
|
||||
- Every determinant must be a candidate key
|
||||
- Stricter than 3NF - handles cases where 3NF doesn't eliminate all anomalies
|
||||
|
||||
**Violations and Solutions:**
|
||||
|
||||
**Problem: Determinant that's not a candidate key**
|
||||
```sql
|
||||
-- BAD: Student advisor relationship with BCNF violation
|
||||
-- Assumption: Each student has one advisor per subject,
|
||||
-- each advisor teaches only one subject, but can advise multiple students
|
||||
CREATE TABLE student_advisor (
|
||||
student_id INT,
|
||||
subject VARCHAR(50),
|
||||
advisor_id INT,
|
||||
PRIMARY KEY (student_id, subject)
|
||||
);
|
||||
-- Problem: advisor_id determines subject, but advisor_id is not a candidate key
|
||||
|
||||
-- GOOD: Separate the functional dependencies
|
||||
CREATE TABLE advisors (
|
||||
advisor_id INT PRIMARY KEY,
|
||||
subject VARCHAR(50)
|
||||
);
|
||||
|
||||
CREATE TABLE student_advisor_assignments (
|
||||
student_id INT,
|
||||
advisor_id INT,
|
||||
PRIMARY KEY (student_id, advisor_id),
|
||||
FOREIGN KEY (advisor_id) REFERENCES advisors(advisor_id)
|
||||
);
|
||||
```
|
||||
|
||||
## Denormalization Strategies
|
||||
|
||||
### When to Denormalize
|
||||
|
||||
1. **Performance Requirements**: When query performance is more critical than storage efficiency
|
||||
2. **Read-Heavy Workloads**: When data is read much more frequently than it's updated
|
||||
3. **Reporting Systems**: When complex joins negatively impact reporting performance
|
||||
4. **Caching Strategies**: When pre-computed values eliminate expensive calculations
|
||||
|
||||
### Common Denormalization Patterns
|
||||
|
||||
**1. Redundant Storage for Performance**
|
||||
```sql
|
||||
-- Store frequently accessed calculated values
|
||||
CREATE TABLE orders (
|
||||
order_id INT PRIMARY KEY,
|
||||
customer_id INT,
|
||||
order_total DECIMAL(10,2), -- Denormalized: sum of order_items.total
|
||||
item_count INT, -- Denormalized: count of order_items
|
||||
created_at TIMESTAMP
|
||||
);
|
||||
|
||||
CREATE TABLE order_items (
|
||||
item_id INT PRIMARY KEY,
|
||||
order_id INT,
|
||||
product_id INT,
|
||||
quantity INT,
|
||||
unit_price DECIMAL(8,2),
|
||||
total DECIMAL(10,2) -- quantity * unit_price (denormalized)
|
||||
);
|
||||
```
|
||||
|
||||
**2. Materialized Aggregates**
|
||||
```sql
|
||||
-- Pre-computed summary tables for reporting
|
||||
CREATE TABLE monthly_sales_summary (
|
||||
year_month VARCHAR(7), -- '2024-03'
|
||||
product_category VARCHAR(50),
|
||||
total_sales DECIMAL(12,2),
|
||||
total_units INT,
|
||||
avg_order_value DECIMAL(8,2),
|
||||
unique_customers INT,
|
||||
updated_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
**3. Historical Data Snapshots**
|
||||
```sql
|
||||
-- Store historical state to avoid complex temporal queries
|
||||
CREATE TABLE customer_status_history (
|
||||
id INT PRIMARY KEY,
|
||||
customer_id INT,
|
||||
status VARCHAR(20),
|
||||
tier VARCHAR(10),
|
||||
total_lifetime_value DECIMAL(12,2), -- Snapshot at this point in time
|
||||
snapshot_date DATE
|
||||
);
|
||||
```
|
||||
|
||||
## Trade-offs Analysis
|
||||
|
||||
### Normalization Benefits
|
||||
- **Data Integrity**: Reduced risk of inconsistent data
|
||||
- **Storage Efficiency**: Less data duplication
|
||||
- **Update Efficiency**: Changes need to be made in only one place
|
||||
- **Flexibility**: Easier to modify schema as requirements change
|
||||
|
||||
### Normalization Costs
|
||||
- **Query Complexity**: More joins required for data retrieval
|
||||
- **Performance Impact**: Joins can be expensive on large datasets
|
||||
- **Development Complexity**: More complex data access patterns
|
||||
|
||||
### Denormalization Benefits
|
||||
- **Query Performance**: Fewer joins, faster queries
|
||||
- **Simplified Queries**: Direct access to related data
|
||||
- **Read Optimization**: Optimized for data retrieval patterns
|
||||
- **Reduced Load**: Less database processing for common operations
|
||||
|
||||
### Denormalization Costs
|
||||
- **Data Redundancy**: Increased storage requirements
|
||||
- **Update Complexity**: Multiple places may need updates
|
||||
- **Consistency Risk**: Higher risk of data inconsistencies
|
||||
- **Maintenance Overhead**: Additional code to maintain derived values
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Start with Full Normalization
|
||||
- Begin with a fully normalized design
|
||||
- Identify performance bottlenecks through testing
|
||||
- Selectively denormalize based on actual performance needs
|
||||
|
||||
### 2. Use Triggers for Consistency
|
||||
```sql
|
||||
-- Trigger to maintain denormalized order_total
|
||||
CREATE TRIGGER update_order_total
|
||||
AFTER INSERT OR UPDATE OR DELETE ON order_items
|
||||
FOR EACH ROW
|
||||
BEGIN
|
||||
UPDATE orders
|
||||
SET order_total = (
|
||||
SELECT SUM(quantity * unit_price)
|
||||
FROM order_items
|
||||
WHERE order_id = NEW.order_id
|
||||
)
|
||||
WHERE order_id = NEW.order_id;
|
||||
END;
|
||||
```
|
||||
|
||||
### 3. Consider Materialized Views
|
||||
```sql
|
||||
-- Materialized view for complex aggregations
|
||||
CREATE MATERIALIZED VIEW customer_summary AS
|
||||
SELECT
|
||||
c.customer_id,
|
||||
c.customer_name,
|
||||
COUNT(o.order_id) as order_count,
|
||||
SUM(o.order_total) as lifetime_value,
|
||||
AVG(o.order_total) as avg_order_value,
|
||||
MAX(o.created_at) as last_order_date
|
||||
FROM customers c
|
||||
LEFT JOIN orders o ON c.customer_id = o.customer_id
|
||||
GROUP BY c.customer_id, c.customer_name;
|
||||
```
|
||||
|
||||
### 4. Document Denormalization Decisions
|
||||
- Clearly document why denormalization was chosen
|
||||
- Specify which data is derived and how it's maintained
|
||||
- Include performance benchmarks that justify the decision
|
||||
|
||||
### 5. Monitor and Validate
|
||||
- Implement validation checks for denormalized data
|
||||
- Regular audits to ensure data consistency
|
||||
- Performance monitoring to validate denormalization benefits
|
||||
|
||||
## Common Anti-Patterns
|
||||
|
||||
### 1. Premature Denormalization
|
||||
Starting with denormalized design without understanding actual performance requirements.
|
||||
|
||||
### 2. Over-Normalization
|
||||
Creating too many small tables that require excessive joins for simple queries.
|
||||
|
||||
### 3. Inconsistent Approach
|
||||
Mixing normalized and denormalized patterns without clear strategy.
|
||||
|
||||
### 4. Ignoring Maintenance
|
||||
Denormalizing without proper mechanisms to maintain data consistency.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Normalization and denormalization are both valuable tools in database design. The key is understanding when to apply each approach:
|
||||
|
||||
- **Use normalization** for transactional systems where data integrity is paramount
|
||||
- **Consider denormalization** for analytical systems or when performance testing reveals bottlenecks
|
||||
- **Apply selectively** based on actual usage patterns and performance requirements
|
||||
- **Maintain consistency** through proper design patterns and validation mechanisms
|
||||
|
||||
The goal is not to achieve perfect normalization or denormalization, but to create a design that best serves your application's specific needs while maintaining data quality and system performance.
|
||||
@@ -0,0 +1,982 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Database Schema Analyzer
|
||||
|
||||
Analyzes SQL DDL statements and JSON schema definitions for:
|
||||
- Normalization level compliance (1NF-BCNF)
|
||||
- Missing constraints (FK, NOT NULL, UNIQUE)
|
||||
- Data type issues and antipatterns
|
||||
- Naming convention violations
|
||||
- Missing indexes on foreign key columns
|
||||
- Table relationship mapping
|
||||
- Generates Mermaid ERD diagrams
|
||||
|
||||
Input: SQL DDL file or JSON schema definition
|
||||
Output: Analysis report + Mermaid ERD + recommendations
|
||||
|
||||
Usage:
|
||||
python schema_analyzer.py --input schema.sql --output-format json
|
||||
python schema_analyzer.py --input schema.json --output-format text
|
||||
python schema_analyzer.py --input schema.sql --generate-erd --output analysis.json
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import defaultdict, namedtuple
|
||||
from typing import Dict, List, Set, Tuple, Optional, Any
|
||||
from dataclasses import dataclass, asdict
|
||||
|
||||
|
||||
@dataclass
|
||||
class Column:
|
||||
name: str
|
||||
data_type: str
|
||||
nullable: bool = True
|
||||
primary_key: bool = False
|
||||
unique: bool = False
|
||||
foreign_key: Optional[str] = None
|
||||
default_value: Optional[str] = None
|
||||
check_constraint: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class Index:
|
||||
name: str
|
||||
table: str
|
||||
columns: List[str]
|
||||
unique: bool = False
|
||||
index_type: str = "btree"
|
||||
|
||||
|
||||
@dataclass
|
||||
class Table:
|
||||
name: str
|
||||
columns: List[Column]
|
||||
primary_key: List[str]
|
||||
foreign_keys: List[Tuple[str, str]] # (column, referenced_table.column)
|
||||
unique_constraints: List[List[str]]
|
||||
check_constraints: Dict[str, str]
|
||||
indexes: List[Index]
|
||||
|
||||
|
||||
@dataclass
|
||||
class NormalizationIssue:
|
||||
table: str
|
||||
issue_type: str
|
||||
severity: str
|
||||
description: str
|
||||
suggestion: str
|
||||
columns_affected: List[str]
|
||||
|
||||
|
||||
@dataclass
|
||||
class DataTypeIssue:
|
||||
table: str
|
||||
column: str
|
||||
current_type: str
|
||||
issue: str
|
||||
suggested_type: str
|
||||
rationale: str
|
||||
|
||||
|
||||
@dataclass
|
||||
class ConstraintIssue:
|
||||
table: str
|
||||
issue_type: str
|
||||
severity: str
|
||||
description: str
|
||||
suggestion: str
|
||||
columns_affected: List[str]
|
||||
|
||||
|
||||
@dataclass
|
||||
class NamingIssue:
|
||||
table: str
|
||||
column: Optional[str]
|
||||
issue: str
|
||||
current_name: str
|
||||
suggested_name: str
|
||||
|
||||
|
||||
class SchemaAnalyzer:
|
||||
def __init__(self):
|
||||
self.tables: Dict[str, Table] = {}
|
||||
self.normalization_issues: List[NormalizationIssue] = []
|
||||
self.datatype_issues: List[DataTypeIssue] = []
|
||||
self.constraint_issues: List[ConstraintIssue] = []
|
||||
self.naming_issues: List[NamingIssue] = []
|
||||
|
||||
# Data type antipatterns
|
||||
self.varchar_255_pattern = re.compile(r'VARCHAR\(255\)', re.IGNORECASE)
|
||||
self.bad_datetime_patterns = [
|
||||
re.compile(r'VARCHAR\(\d+\)', re.IGNORECASE),
|
||||
re.compile(r'CHAR\(\d+\)', re.IGNORECASE)
|
||||
]
|
||||
|
||||
# Naming conventions
|
||||
self.table_naming_pattern = re.compile(r'^[a-z][a-z0-9_]*[a-z0-9]$')
|
||||
self.column_naming_pattern = re.compile(r'^[a-z][a-z0-9_]*[a-z0-9]$')
|
||||
|
||||
def parse_sql_ddl(self, ddl_content: str) -> None:
|
||||
"""Parse SQL DDL statements and extract schema information."""
|
||||
# Remove comments and normalize whitespace
|
||||
ddl_content = re.sub(r'--.*$', '', ddl_content, flags=re.MULTILINE)
|
||||
ddl_content = re.sub(r'/\*.*?\*/', '', ddl_content, flags=re.DOTALL)
|
||||
ddl_content = re.sub(r'\s+', ' ', ddl_content.strip())
|
||||
|
||||
# Extract CREATE TABLE statements
|
||||
create_table_pattern = re.compile(
|
||||
r'CREATE\s+TABLE\s+(\w+)\s*\(\s*(.*?)\s*\)',
|
||||
re.IGNORECASE | re.DOTALL
|
||||
)
|
||||
|
||||
for match in create_table_pattern.finditer(ddl_content):
|
||||
table_name = match.group(1).lower()
|
||||
table_definition = match.group(2)
|
||||
|
||||
table = self._parse_table_definition(table_name, table_definition)
|
||||
self.tables[table_name] = table
|
||||
|
||||
# Extract CREATE INDEX statements
|
||||
self._parse_indexes(ddl_content)
|
||||
|
||||
def _parse_table_definition(self, table_name: str, definition: str) -> Table:
|
||||
"""Parse individual table definition."""
|
||||
columns = []
|
||||
primary_key = []
|
||||
foreign_keys = []
|
||||
unique_constraints = []
|
||||
check_constraints = {}
|
||||
|
||||
# Split by commas, but handle nested parentheses
|
||||
parts = self._split_table_parts(definition)
|
||||
|
||||
for part in parts:
|
||||
part = part.strip()
|
||||
if not part:
|
||||
continue
|
||||
|
||||
if part.upper().startswith('PRIMARY KEY'):
|
||||
primary_key = self._parse_primary_key(part)
|
||||
elif part.upper().startswith('FOREIGN KEY'):
|
||||
fk = self._parse_foreign_key(part)
|
||||
if fk:
|
||||
foreign_keys.append(fk)
|
||||
elif part.upper().startswith('UNIQUE'):
|
||||
unique = self._parse_unique_constraint(part)
|
||||
if unique:
|
||||
unique_constraints.append(unique)
|
||||
elif part.upper().startswith('CHECK'):
|
||||
check = self._parse_check_constraint(part)
|
||||
if check:
|
||||
check_constraints.update(check)
|
||||
else:
|
||||
# Column definition
|
||||
column = self._parse_column_definition(part)
|
||||
if column:
|
||||
columns.append(column)
|
||||
if column.primary_key:
|
||||
primary_key.append(column.name)
|
||||
|
||||
return Table(
|
||||
name=table_name,
|
||||
columns=columns,
|
||||
primary_key=primary_key,
|
||||
foreign_keys=foreign_keys,
|
||||
unique_constraints=unique_constraints,
|
||||
check_constraints=check_constraints,
|
||||
indexes=[]
|
||||
)
|
||||
|
||||
def _split_table_parts(self, definition: str) -> List[str]:
|
||||
"""Split table definition by commas, respecting nested parentheses."""
|
||||
parts = []
|
||||
current_part = ""
|
||||
paren_count = 0
|
||||
|
||||
for char in definition:
|
||||
if char == '(':
|
||||
paren_count += 1
|
||||
elif char == ')':
|
||||
paren_count -= 1
|
||||
elif char == ',' and paren_count == 0:
|
||||
parts.append(current_part.strip())
|
||||
current_part = ""
|
||||
continue
|
||||
|
||||
current_part += char
|
||||
|
||||
if current_part.strip():
|
||||
parts.append(current_part.strip())
|
||||
|
||||
return parts
|
||||
|
||||
def _parse_column_definition(self, definition: str) -> Optional[Column]:
|
||||
"""Parse individual column definition."""
|
||||
# Pattern for column definition
|
||||
pattern = re.compile(
|
||||
r'(\w+)\s+([A-Z]+(?:\(\d+(?:,\d+)?\))?)\s*(.*)',
|
||||
re.IGNORECASE
|
||||
)
|
||||
|
||||
match = pattern.match(definition.strip())
|
||||
if not match:
|
||||
return None
|
||||
|
||||
column_name = match.group(1).lower()
|
||||
data_type = match.group(2).upper()
|
||||
constraints = match.group(3).upper() if match.group(3) else ""
|
||||
|
||||
column = Column(
|
||||
name=column_name,
|
||||
data_type=data_type,
|
||||
nullable='NOT NULL' not in constraints,
|
||||
primary_key='PRIMARY KEY' in constraints,
|
||||
unique='UNIQUE' in constraints
|
||||
)
|
||||
|
||||
# Parse foreign key reference
|
||||
fk_pattern = re.compile(r'REFERENCES\s+(\w+)\s*\(\s*(\w+)\s*\)', re.IGNORECASE)
|
||||
fk_match = fk_pattern.search(constraints)
|
||||
if fk_match:
|
||||
column.foreign_key = f"{fk_match.group(1).lower()}.{fk_match.group(2).lower()}"
|
||||
|
||||
# Parse default value
|
||||
default_pattern = re.compile(r'DEFAULT\s+([^,\s]+)', re.IGNORECASE)
|
||||
default_match = default_pattern.search(constraints)
|
||||
if default_match:
|
||||
column.default_value = default_match.group(1)
|
||||
|
||||
return column
|
||||
|
||||
def _parse_primary_key(self, definition: str) -> List[str]:
|
||||
"""Parse PRIMARY KEY constraint."""
|
||||
pattern = re.compile(r'PRIMARY\s+KEY\s*\(\s*(.*?)\s*\)', re.IGNORECASE)
|
||||
match = pattern.search(definition)
|
||||
if match:
|
||||
columns = [col.strip().lower() for col in match.group(1).split(',')]
|
||||
return columns
|
||||
return []
|
||||
|
||||
def _parse_foreign_key(self, definition: str) -> Optional[Tuple[str, str]]:
|
||||
"""Parse FOREIGN KEY constraint."""
|
||||
pattern = re.compile(
|
||||
r'FOREIGN\s+KEY\s*\(\s*(\w+)\s*\)\s+REFERENCES\s+(\w+)\s*\(\s*(\w+)\s*\)',
|
||||
re.IGNORECASE
|
||||
)
|
||||
match = pattern.search(definition)
|
||||
if match:
|
||||
column = match.group(1).lower()
|
||||
ref_table = match.group(2).lower()
|
||||
ref_column = match.group(3).lower()
|
||||
return (column, f"{ref_table}.{ref_column}")
|
||||
return None
|
||||
|
||||
def _parse_unique_constraint(self, definition: str) -> Optional[List[str]]:
|
||||
"""Parse UNIQUE constraint."""
|
||||
pattern = re.compile(r'UNIQUE\s*\(\s*(.*?)\s*\)', re.IGNORECASE)
|
||||
match = pattern.search(definition)
|
||||
if match:
|
||||
columns = [col.strip().lower() for col in match.group(1).split(',')]
|
||||
return columns
|
||||
return None
|
||||
|
||||
def _parse_check_constraint(self, definition: str) -> Optional[Dict[str, str]]:
|
||||
"""Parse CHECK constraint."""
|
||||
pattern = re.compile(r'CHECK\s*\(\s*(.*?)\s*\)', re.IGNORECASE)
|
||||
match = pattern.search(definition)
|
||||
if match:
|
||||
constraint_name = f"check_constraint_{len(self.tables)}"
|
||||
return {constraint_name: match.group(1)}
|
||||
return None
|
||||
|
||||
def _parse_indexes(self, ddl_content: str) -> None:
|
||||
"""Parse CREATE INDEX statements."""
|
||||
index_pattern = re.compile(
|
||||
r'CREATE\s+(?:(UNIQUE)\s+)?INDEX\s+(\w+)\s+ON\s+(\w+)\s*\(\s*(.*?)\s*\)',
|
||||
re.IGNORECASE
|
||||
)
|
||||
|
||||
for match in index_pattern.finditer(ddl_content):
|
||||
unique = match.group(1) is not None
|
||||
index_name = match.group(2).lower()
|
||||
table_name = match.group(3).lower()
|
||||
columns_str = match.group(4)
|
||||
|
||||
columns = [col.strip().lower() for col in columns_str.split(',')]
|
||||
|
||||
index = Index(
|
||||
name=index_name,
|
||||
table=table_name,
|
||||
columns=columns,
|
||||
unique=unique
|
||||
)
|
||||
|
||||
if table_name in self.tables:
|
||||
self.tables[table_name].indexes.append(index)
|
||||
|
||||
def parse_json_schema(self, json_content: str) -> None:
|
||||
"""Parse JSON schema definition."""
|
||||
try:
|
||||
schema = json.loads(json_content)
|
||||
|
||||
if 'tables' not in schema:
|
||||
raise ValueError("JSON schema must contain 'tables' key")
|
||||
|
||||
for table_name, table_def in schema['tables'].items():
|
||||
table = self._parse_json_table(table_name.lower(), table_def)
|
||||
self.tables[table_name.lower()] = table
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
raise ValueError(f"Invalid JSON: {e}")
|
||||
|
||||
def _parse_json_table(self, table_name: str, table_def: Dict[str, Any]) -> Table:
|
||||
"""Parse JSON table definition."""
|
||||
columns = []
|
||||
primary_key = table_def.get('primary_key', [])
|
||||
foreign_keys = []
|
||||
unique_constraints = table_def.get('unique_constraints', [])
|
||||
check_constraints = table_def.get('check_constraints', {})
|
||||
|
||||
for col_name, col_def in table_def.get('columns', {}).items():
|
||||
column = Column(
|
||||
name=col_name.lower(),
|
||||
data_type=col_def.get('type', 'VARCHAR(255)').upper(),
|
||||
nullable=col_def.get('nullable', True),
|
||||
primary_key=col_name.lower() in [pk.lower() for pk in primary_key],
|
||||
unique=col_def.get('unique', False),
|
||||
foreign_key=col_def.get('foreign_key'),
|
||||
default_value=col_def.get('default')
|
||||
)
|
||||
|
||||
columns.append(column)
|
||||
|
||||
if column.foreign_key:
|
||||
foreign_keys.append((column.name, column.foreign_key))
|
||||
|
||||
return Table(
|
||||
name=table_name,
|
||||
columns=columns,
|
||||
primary_key=[pk.lower() for pk in primary_key],
|
||||
foreign_keys=foreign_keys,
|
||||
unique_constraints=unique_constraints,
|
||||
check_constraints=check_constraints,
|
||||
indexes=[]
|
||||
)
|
||||
|
||||
def analyze_normalization(self) -> None:
|
||||
"""Analyze normalization compliance."""
|
||||
for table_name, table in self.tables.items():
|
||||
self._check_first_normal_form(table)
|
||||
self._check_second_normal_form(table)
|
||||
self._check_third_normal_form(table)
|
||||
self._check_bcnf(table)
|
||||
|
||||
def _check_first_normal_form(self, table: Table) -> None:
|
||||
"""Check First Normal Form compliance."""
|
||||
# Check for atomic values (no arrays or delimited strings)
|
||||
for column in table.columns:
|
||||
if any(pattern in column.data_type.upper() for pattern in ['ARRAY', 'JSON', 'TEXT']):
|
||||
if 'JSON' in column.data_type.upper():
|
||||
# JSON columns can violate 1NF if storing arrays
|
||||
self.normalization_issues.append(NormalizationIssue(
|
||||
table=table.name,
|
||||
issue_type="1NF_VIOLATION",
|
||||
severity="WARNING",
|
||||
description=f"Column '{column.name}' uses JSON type which may contain non-atomic values",
|
||||
suggestion="Consider normalizing JSON arrays into separate tables",
|
||||
columns_affected=[column.name]
|
||||
))
|
||||
|
||||
# Check for potential delimited values in VARCHAR/TEXT
|
||||
if column.data_type.upper().startswith(('VARCHAR', 'CHAR', 'TEXT')):
|
||||
if any(delimiter in column.name.lower() for delimiter in ['list', 'array', 'tags', 'items']):
|
||||
self.normalization_issues.append(NormalizationIssue(
|
||||
table=table.name,
|
||||
issue_type="1NF_VIOLATION",
|
||||
severity="HIGH",
|
||||
description=f"Column '{column.name}' appears to store delimited values",
|
||||
suggestion="Create separate table for individual values with foreign key relationship",
|
||||
columns_affected=[column.name]
|
||||
))
|
||||
|
||||
def _check_second_normal_form(self, table: Table) -> None:
|
||||
"""Check Second Normal Form compliance."""
|
||||
if len(table.primary_key) <= 1:
|
||||
return # 2NF only applies to tables with composite primary keys
|
||||
|
||||
# Look for potential partial dependencies
|
||||
non_key_columns = [col for col in table.columns if col.name not in table.primary_key]
|
||||
|
||||
for column in non_key_columns:
|
||||
# Heuristic: columns that seem related to only part of the composite key
|
||||
for pk_part in table.primary_key:
|
||||
if pk_part in column.name or column.name.startswith(pk_part.split('_')[0]):
|
||||
self.normalization_issues.append(NormalizationIssue(
|
||||
table=table.name,
|
||||
issue_type="2NF_VIOLATION",
|
||||
severity="MEDIUM",
|
||||
description=f"Column '{column.name}' may have partial dependency on '{pk_part}'",
|
||||
suggestion=f"Consider moving '{column.name}' to a separate table related to '{pk_part}'",
|
||||
columns_affected=[column.name, pk_part]
|
||||
))
|
||||
break
|
||||
|
||||
def _check_third_normal_form(self, table: Table) -> None:
|
||||
"""Check Third Normal Form compliance."""
|
||||
# Look for transitive dependencies
|
||||
non_key_columns = [col for col in table.columns if col.name not in table.primary_key]
|
||||
|
||||
# Group columns by potential entities they describe
|
||||
entity_groups = defaultdict(list)
|
||||
for column in non_key_columns:
|
||||
# Simple heuristic: group by prefix before underscore
|
||||
prefix = column.name.split('_')[0]
|
||||
if prefix != column.name: # Has underscore
|
||||
entity_groups[prefix].append(column.name)
|
||||
|
||||
for entity, columns in entity_groups.items():
|
||||
if len(columns) > 1 and entity != table.name.split('_')[0]:
|
||||
# Potential entity that should be in its own table
|
||||
id_column = f"{entity}_id"
|
||||
if id_column in [col.name for col in table.columns]:
|
||||
self.normalization_issues.append(NormalizationIssue(
|
||||
table=table.name,
|
||||
issue_type="3NF_VIOLATION",
|
||||
severity="MEDIUM",
|
||||
description=f"Columns {columns} may have transitive dependency through '{id_column}'",
|
||||
suggestion=f"Consider creating separate '{entity}' table with these columns",
|
||||
columns_affected=columns + [id_column]
|
||||
))
|
||||
|
||||
def _check_bcnf(self, table: Table) -> None:
|
||||
"""Check Boyce-Codd Normal Form compliance."""
|
||||
# BCNF violations are complex to detect without functional dependencies
|
||||
# Provide general guidance for composite keys
|
||||
if len(table.primary_key) > 2:
|
||||
self.normalization_issues.append(NormalizationIssue(
|
||||
table=table.name,
|
||||
issue_type="BCNF_WARNING",
|
||||
severity="LOW",
|
||||
description=f"Table has composite primary key with {len(table.primary_key)} columns",
|
||||
suggestion="Review functional dependencies to ensure BCNF compliance",
|
||||
columns_affected=table.primary_key
|
||||
))
|
||||
|
||||
def analyze_data_types(self) -> None:
|
||||
"""Analyze data type usage for antipatterns."""
|
||||
for table_name, table in self.tables.items():
|
||||
for column in table.columns:
|
||||
self._check_varchar_255_antipattern(table.name, column)
|
||||
self._check_inappropriate_types(table.name, column)
|
||||
self._check_size_optimization(table.name, column)
|
||||
|
||||
def _check_varchar_255_antipattern(self, table_name: str, column: Column) -> None:
|
||||
"""Check for VARCHAR(255) antipattern."""
|
||||
if self.varchar_255_pattern.match(column.data_type):
|
||||
self.datatype_issues.append(DataTypeIssue(
|
||||
table=table_name,
|
||||
column=column.name,
|
||||
current_type=column.data_type,
|
||||
issue="VARCHAR(255) antipattern",
|
||||
suggested_type="Appropriately sized VARCHAR or TEXT",
|
||||
rationale="VARCHAR(255) is often used as default without considering actual data length requirements"
|
||||
))
|
||||
|
||||
def _check_inappropriate_types(self, table_name: str, column: Column) -> None:
|
||||
"""Check for inappropriate data types."""
|
||||
# Date/time stored as string
|
||||
if column.name.lower() in ['date', 'time', 'created', 'updated', 'modified', 'timestamp']:
|
||||
if column.data_type.upper().startswith(('VARCHAR', 'CHAR', 'TEXT')):
|
||||
self.datatype_issues.append(DataTypeIssue(
|
||||
table=table_name,
|
||||
column=column.name,
|
||||
current_type=column.data_type,
|
||||
issue="Date/time stored as string",
|
||||
suggested_type="TIMESTAMP, DATE, or TIME",
|
||||
rationale="Proper date/time types enable date arithmetic and indexing optimization"
|
||||
))
|
||||
|
||||
# Boolean stored as string/integer
|
||||
if column.name.lower() in ['active', 'enabled', 'deleted', 'visible', 'published']:
|
||||
if not column.data_type.upper().startswith('BOOL'):
|
||||
self.datatype_issues.append(DataTypeIssue(
|
||||
table=table_name,
|
||||
column=column.name,
|
||||
current_type=column.data_type,
|
||||
issue="Boolean value stored as non-boolean type",
|
||||
suggested_type="BOOLEAN",
|
||||
rationale="Boolean type is more explicit and can be more storage efficient"
|
||||
))
|
||||
|
||||
# Numeric IDs as VARCHAR
|
||||
if column.name.lower().endswith('_id') or column.name.lower() == 'id':
|
||||
if column.data_type.upper().startswith(('VARCHAR', 'CHAR')):
|
||||
self.datatype_issues.append(DataTypeIssue(
|
||||
table=table_name,
|
||||
column=column.name,
|
||||
current_type=column.data_type,
|
||||
issue="Numeric ID stored as string",
|
||||
suggested_type="INTEGER, BIGINT, or UUID",
|
||||
rationale="Numeric types are more efficient for ID columns and enable better indexing"
|
||||
))
|
||||
|
||||
def _check_size_optimization(self, table_name: str, column: Column) -> None:
|
||||
"""Check for size optimization opportunities."""
|
||||
# Oversized integer types
|
||||
if column.data_type.upper() == 'BIGINT':
|
||||
if not any(keyword in column.name.lower() for keyword in ['timestamp', 'big', 'large', 'count']):
|
||||
self.datatype_issues.append(DataTypeIssue(
|
||||
table=table_name,
|
||||
column=column.name,
|
||||
current_type=column.data_type,
|
||||
issue="Potentially oversized integer type",
|
||||
suggested_type="INTEGER",
|
||||
rationale="INTEGER is sufficient for most ID and count fields unless very large values are expected"
|
||||
))
|
||||
|
||||
def analyze_constraints(self) -> None:
|
||||
"""Analyze missing constraints."""
|
||||
for table_name, table in self.tables.items():
|
||||
self._check_missing_primary_key(table)
|
||||
self._check_missing_foreign_key_constraints(table)
|
||||
self._check_missing_not_null_constraints(table)
|
||||
self._check_missing_unique_constraints(table)
|
||||
self._check_missing_check_constraints(table)
|
||||
|
||||
def _check_missing_primary_key(self, table: Table) -> None:
|
||||
"""Check for missing primary key."""
|
||||
if not table.primary_key:
|
||||
self.constraint_issues.append(ConstraintIssue(
|
||||
table=table.name,
|
||||
issue_type="MISSING_PRIMARY_KEY",
|
||||
severity="HIGH",
|
||||
description="Table has no primary key defined",
|
||||
suggestion="Add a primary key column (e.g., 'id' with auto-increment)",
|
||||
columns_affected=[]
|
||||
))
|
||||
|
||||
def _check_missing_foreign_key_constraints(self, table: Table) -> None:
|
||||
"""Check for missing foreign key constraints."""
|
||||
for column in table.columns:
|
||||
if column.name.endswith('_id') and column.name != 'id':
|
||||
# Potential foreign key column
|
||||
if not column.foreign_key:
|
||||
referenced_table = column.name[:-3] # Remove '_id' suffix
|
||||
if referenced_table in self.tables or referenced_table + 's' in self.tables:
|
||||
self.constraint_issues.append(ConstraintIssue(
|
||||
table=table.name,
|
||||
issue_type="MISSING_FOREIGN_KEY",
|
||||
severity="MEDIUM",
|
||||
description=f"Column '{column.name}' appears to be a foreign key but has no constraint",
|
||||
suggestion=f"Add foreign key constraint referencing {referenced_table} table",
|
||||
columns_affected=[column.name]
|
||||
))
|
||||
|
||||
def _check_missing_not_null_constraints(self, table: Table) -> None:
|
||||
"""Check for missing NOT NULL constraints."""
|
||||
for column in table.columns:
|
||||
if column.nullable and column.name in ['email', 'name', 'title', 'status']:
|
||||
self.constraint_issues.append(ConstraintIssue(
|
||||
table=table.name,
|
||||
issue_type="MISSING_NOT_NULL",
|
||||
severity="LOW",
|
||||
description=f"Column '{column.name}' allows NULL but typically should not",
|
||||
suggestion=f"Consider adding NOT NULL constraint to '{column.name}'",
|
||||
columns_affected=[column.name]
|
||||
))
|
||||
|
||||
def _check_missing_unique_constraints(self, table: Table) -> None:
|
||||
"""Check for missing unique constraints."""
|
||||
for column in table.columns:
|
||||
if column.name in ['email', 'username', 'slug', 'code'] and not column.unique:
|
||||
if column.name not in table.primary_key:
|
||||
self.constraint_issues.append(ConstraintIssue(
|
||||
table=table.name,
|
||||
issue_type="MISSING_UNIQUE",
|
||||
severity="MEDIUM",
|
||||
description=f"Column '{column.name}' should likely have UNIQUE constraint",
|
||||
suggestion=f"Add UNIQUE constraint to '{column.name}'",
|
||||
columns_affected=[column.name]
|
||||
))
|
||||
|
||||
def _check_missing_check_constraints(self, table: Table) -> None:
|
||||
"""Check for missing check constraints."""
|
||||
for column in table.columns:
|
||||
# Email format validation
|
||||
if column.name == 'email' and 'email' not in str(table.check_constraints):
|
||||
self.constraint_issues.append(ConstraintIssue(
|
||||
table=table.name,
|
||||
issue_type="MISSING_CHECK_CONSTRAINT",
|
||||
severity="LOW",
|
||||
description=f"Email column lacks format validation",
|
||||
suggestion="Add CHECK constraint for email format validation",
|
||||
columns_affected=[column.name]
|
||||
))
|
||||
|
||||
# Positive values for counts, prices, etc.
|
||||
if column.name.lower() in ['price', 'amount', 'count', 'quantity', 'age']:
|
||||
if column.name not in str(table.check_constraints):
|
||||
self.constraint_issues.append(ConstraintIssue(
|
||||
table=table.name,
|
||||
issue_type="MISSING_CHECK_CONSTRAINT",
|
||||
severity="LOW",
|
||||
description=f"Column '{column.name}' should validate positive values",
|
||||
suggestion=f"Add CHECK constraint: {column.name} > 0",
|
||||
columns_affected=[column.name]
|
||||
))
|
||||
|
||||
def analyze_naming_conventions(self) -> None:
|
||||
"""Analyze naming convention compliance."""
|
||||
for table_name, table in self.tables.items():
|
||||
self._check_table_naming(table_name)
|
||||
for column in table.columns:
|
||||
self._check_column_naming(table_name, column.name)
|
||||
|
||||
def _check_table_naming(self, table_name: str) -> None:
|
||||
"""Check table naming conventions."""
|
||||
if not self.table_naming_pattern.match(table_name):
|
||||
suggested_name = self._suggest_table_name(table_name)
|
||||
self.naming_issues.append(NamingIssue(
|
||||
table=table_name,
|
||||
column=None,
|
||||
issue="Invalid table naming convention",
|
||||
current_name=table_name,
|
||||
suggested_name=suggested_name
|
||||
))
|
||||
|
||||
# Check for plural naming
|
||||
if not table_name.endswith('s') and table_name not in ['data', 'information']:
|
||||
self.naming_issues.append(NamingIssue(
|
||||
table=table_name,
|
||||
column=None,
|
||||
issue="Table name should be plural",
|
||||
current_name=table_name,
|
||||
suggested_name=table_name + 's'
|
||||
))
|
||||
|
||||
def _check_column_naming(self, table_name: str, column_name: str) -> None:
|
||||
"""Check column naming conventions."""
|
||||
if not self.column_naming_pattern.match(column_name):
|
||||
suggested_name = self._suggest_column_name(column_name)
|
||||
self.naming_issues.append(NamingIssue(
|
||||
table=table_name,
|
||||
column=column_name,
|
||||
issue="Invalid column naming convention",
|
||||
current_name=column_name,
|
||||
suggested_name=suggested_name
|
||||
))
|
||||
|
||||
def _suggest_table_name(self, table_name: str) -> str:
|
||||
"""Suggest corrected table name."""
|
||||
# Convert to snake_case and make plural
|
||||
name = re.sub(r'([A-Z])', r'_\1', table_name).lower().strip('_')
|
||||
return name + 's' if not name.endswith('s') else name
|
||||
|
||||
def _suggest_column_name(self, column_name: str) -> str:
|
||||
"""Suggest corrected column name."""
|
||||
# Convert to snake_case
|
||||
return re.sub(r'([A-Z])', r'_\1', column_name).lower().strip('_')
|
||||
|
||||
def check_missing_indexes(self) -> List[Dict[str, Any]]:
|
||||
"""Check for missing indexes on foreign key columns."""
|
||||
missing_indexes = []
|
||||
|
||||
for table_name, table in self.tables.items():
|
||||
existing_indexed_columns = set()
|
||||
|
||||
# Collect existing indexed columns
|
||||
for index in table.indexes:
|
||||
existing_indexed_columns.update(index.columns)
|
||||
|
||||
# Primary key columns are automatically indexed
|
||||
existing_indexed_columns.update(table.primary_key)
|
||||
|
||||
# Check foreign key columns
|
||||
for column in table.columns:
|
||||
if column.foreign_key and column.name not in existing_indexed_columns:
|
||||
missing_indexes.append({
|
||||
'table': table_name,
|
||||
'column': column.name,
|
||||
'type': 'foreign_key',
|
||||
'suggestion': f"CREATE INDEX idx_{table_name}_{column.name} ON {table_name} ({column.name});"
|
||||
})
|
||||
|
||||
return missing_indexes
|
||||
|
||||
def generate_mermaid_erd(self) -> str:
|
||||
"""Generate Mermaid ERD diagram."""
|
||||
erd_lines = ["erDiagram"]
|
||||
|
||||
# Add table definitions
|
||||
for table_name, table in self.tables.items():
|
||||
erd_lines.append(f" {table_name.upper()} {{")
|
||||
|
||||
for column in table.columns:
|
||||
data_type = column.data_type
|
||||
constraints = []
|
||||
|
||||
if column.primary_key:
|
||||
constraints.append("PK")
|
||||
if column.foreign_key:
|
||||
constraints.append("FK")
|
||||
if not column.nullable:
|
||||
constraints.append("NOT NULL")
|
||||
if column.unique:
|
||||
constraints.append("UNIQUE")
|
||||
|
||||
constraint_str = " ".join(constraints)
|
||||
if constraint_str:
|
||||
constraint_str = f" \"{constraint_str}\""
|
||||
|
||||
erd_lines.append(f" {data_type} {column.name}{constraint_str}")
|
||||
|
||||
erd_lines.append(" }")
|
||||
|
||||
# Add relationships
|
||||
relationships = set()
|
||||
for table_name, table in self.tables.items():
|
||||
for column in table.columns:
|
||||
if column.foreign_key:
|
||||
ref_table = column.foreign_key.split('.')[0]
|
||||
if ref_table in self.tables:
|
||||
relationship = f" {ref_table.upper()} ||--o{{ {table_name.upper()} : has"
|
||||
relationships.add(relationship)
|
||||
|
||||
erd_lines.extend(sorted(relationships))
|
||||
|
||||
return "\n".join(erd_lines)
|
||||
|
||||
def get_analysis_summary(self) -> Dict[str, Any]:
|
||||
"""Get comprehensive analysis summary."""
|
||||
return {
|
||||
"schema_overview": {
|
||||
"total_tables": len(self.tables),
|
||||
"total_columns": sum(len(table.columns) for table in self.tables.values()),
|
||||
"tables_with_primary_keys": len([t for t in self.tables.values() if t.primary_key]),
|
||||
"total_foreign_keys": sum(len(table.foreign_keys) for table in self.tables.values()),
|
||||
"total_indexes": sum(len(table.indexes) for table in self.tables.values())
|
||||
},
|
||||
"normalization_analysis": {
|
||||
"total_issues": len(self.normalization_issues),
|
||||
"by_severity": {
|
||||
"high": len([i for i in self.normalization_issues if i.severity == "HIGH"]),
|
||||
"medium": len([i for i in self.normalization_issues if i.severity == "MEDIUM"]),
|
||||
"low": len([i for i in self.normalization_issues if i.severity == "LOW"]),
|
||||
"warning": len([i for i in self.normalization_issues if i.severity == "WARNING"])
|
||||
},
|
||||
"issues": [asdict(issue) for issue in self.normalization_issues]
|
||||
},
|
||||
"data_type_analysis": {
|
||||
"total_issues": len(self.datatype_issues),
|
||||
"issues": [asdict(issue) for issue in self.datatype_issues]
|
||||
},
|
||||
"constraint_analysis": {
|
||||
"total_issues": len(self.constraint_issues),
|
||||
"by_severity": {
|
||||
"high": len([i for i in self.constraint_issues if i.severity == "HIGH"]),
|
||||
"medium": len([i for i in self.constraint_issues if i.severity == "MEDIUM"]),
|
||||
"low": len([i for i in self.constraint_issues if i.severity == "LOW"])
|
||||
},
|
||||
"issues": [asdict(issue) for issue in self.constraint_issues]
|
||||
},
|
||||
"naming_analysis": {
|
||||
"total_issues": len(self.naming_issues),
|
||||
"issues": [asdict(issue) for issue in self.naming_issues]
|
||||
},
|
||||
"missing_indexes": self.check_missing_indexes(),
|
||||
"recommendations": self._generate_recommendations()
|
||||
}
|
||||
|
||||
def _generate_recommendations(self) -> List[str]:
|
||||
"""Generate high-level recommendations."""
|
||||
recommendations = []
|
||||
|
||||
# High severity issues
|
||||
high_severity_issues = [
|
||||
i for i in self.normalization_issues + self.constraint_issues
|
||||
if i.severity == "HIGH"
|
||||
]
|
||||
|
||||
if high_severity_issues:
|
||||
recommendations.append(f"Address {len(high_severity_issues)} high-severity issues immediately")
|
||||
|
||||
# Missing primary keys
|
||||
tables_without_pk = [name for name, table in self.tables.items() if not table.primary_key]
|
||||
if tables_without_pk:
|
||||
recommendations.append(f"Add primary keys to tables: {', '.join(tables_without_pk)}")
|
||||
|
||||
# Data type improvements
|
||||
varchar_255_issues = [i for i in self.datatype_issues if "VARCHAR(255)" in i.issue]
|
||||
if varchar_255_issues:
|
||||
recommendations.append(f"Review {len(varchar_255_issues)} VARCHAR(255) columns for right-sizing")
|
||||
|
||||
# Missing foreign keys
|
||||
missing_fks = [i for i in self.constraint_issues if i.issue_type == "MISSING_FOREIGN_KEY"]
|
||||
if missing_fks:
|
||||
recommendations.append(f"Consider adding {len(missing_fks)} foreign key constraints for referential integrity")
|
||||
|
||||
# Normalization improvements
|
||||
normalization_issues_count = len(self.normalization_issues)
|
||||
if normalization_issues_count > 0:
|
||||
recommendations.append(f"Review {normalization_issues_count} normalization issues for schema optimization")
|
||||
|
||||
return recommendations
|
||||
|
||||
def format_text_report(self, analysis: Dict[str, Any]) -> str:
|
||||
"""Format analysis as human-readable text report."""
|
||||
lines = []
|
||||
lines.append("DATABASE SCHEMA ANALYSIS REPORT")
|
||||
lines.append("=" * 50)
|
||||
lines.append("")
|
||||
|
||||
# Overview
|
||||
overview = analysis["schema_overview"]
|
||||
lines.append("SCHEMA OVERVIEW")
|
||||
lines.append("-" * 15)
|
||||
lines.append(f"Total Tables: {overview['total_tables']}")
|
||||
lines.append(f"Total Columns: {overview['total_columns']}")
|
||||
lines.append(f"Tables with Primary Keys: {overview['tables_with_primary_keys']}")
|
||||
lines.append(f"Total Foreign Keys: {overview['total_foreign_keys']}")
|
||||
lines.append(f"Total Indexes: {overview['total_indexes']}")
|
||||
lines.append("")
|
||||
|
||||
# Recommendations
|
||||
if analysis["recommendations"]:
|
||||
lines.append("KEY RECOMMENDATIONS")
|
||||
lines.append("-" * 18)
|
||||
for i, rec in enumerate(analysis["recommendations"], 1):
|
||||
lines.append(f"{i}. {rec}")
|
||||
lines.append("")
|
||||
|
||||
# Normalization Issues
|
||||
norm_analysis = analysis["normalization_analysis"]
|
||||
if norm_analysis["total_issues"] > 0:
|
||||
lines.append(f"NORMALIZATION ISSUES ({norm_analysis['total_issues']} total)")
|
||||
lines.append("-" * 25)
|
||||
severity_counts = norm_analysis["by_severity"]
|
||||
lines.append(f"High: {severity_counts['high']}, Medium: {severity_counts['medium']}, "
|
||||
f"Low: {severity_counts['low']}, Warning: {severity_counts['warning']}")
|
||||
lines.append("")
|
||||
|
||||
for issue in norm_analysis["issues"][:5]: # Show first 5
|
||||
lines.append(f"• {issue['table']}: {issue['description']}")
|
||||
lines.append(f" Suggestion: {issue['suggestion']}")
|
||||
lines.append("")
|
||||
|
||||
# Data Type Issues
|
||||
dt_analysis = analysis["data_type_analysis"]
|
||||
if dt_analysis["total_issues"] > 0:
|
||||
lines.append(f"DATA TYPE ISSUES ({dt_analysis['total_issues']} total)")
|
||||
lines.append("-" * 20)
|
||||
for issue in dt_analysis["issues"][:5]: # Show first 5
|
||||
lines.append(f"• {issue['table']}.{issue['column']}: {issue['issue']}")
|
||||
lines.append(f" Current: {issue['current_type']} → Suggested: {issue['suggested_type']}")
|
||||
lines.append(f" Rationale: {issue['rationale']}")
|
||||
lines.append("")
|
||||
|
||||
# Constraint Issues
|
||||
const_analysis = analysis["constraint_analysis"]
|
||||
if const_analysis["total_issues"] > 0:
|
||||
lines.append(f"CONSTRAINT ISSUES ({const_analysis['total_issues']} total)")
|
||||
lines.append("-" * 20)
|
||||
severity_counts = const_analysis["by_severity"]
|
||||
lines.append(f"High: {severity_counts['high']}, Medium: {severity_counts['medium']}, "
|
||||
f"Low: {severity_counts['low']}")
|
||||
lines.append("")
|
||||
|
||||
for issue in const_analysis["issues"][:5]: # Show first 5
|
||||
lines.append(f"• {issue['table']}: {issue['description']}")
|
||||
lines.append(f" Suggestion: {issue['suggestion']}")
|
||||
lines.append("")
|
||||
|
||||
# Missing Indexes
|
||||
missing_idx = analysis["missing_indexes"]
|
||||
if missing_idx:
|
||||
lines.append(f"MISSING INDEXES ({len(missing_idx)} total)")
|
||||
lines.append("-" * 17)
|
||||
for idx in missing_idx[:5]: # Show first 5
|
||||
lines.append(f"• {idx['table']}.{idx['column']} ({idx['type']})")
|
||||
lines.append(f" SQL: {idx['suggestion']}")
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Analyze database schema for design issues and generate ERD")
|
||||
parser.add_argument("--input", "-i", required=True, help="Input file (SQL DDL or JSON schema)")
|
||||
parser.add_argument("--output", "-o", help="Output file (default: stdout)")
|
||||
parser.add_argument("--output-format", "-f", choices=["json", "text"], default="text",
|
||||
help="Output format")
|
||||
parser.add_argument("--generate-erd", "-e", action="store_true", help="Include Mermaid ERD in output")
|
||||
parser.add_argument("--erd-only", action="store_true", help="Output only the Mermaid ERD")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
# Read input file
|
||||
with open(args.input, 'r') as f:
|
||||
content = f.read()
|
||||
|
||||
# Initialize analyzer
|
||||
analyzer = SchemaAnalyzer()
|
||||
|
||||
# Parse input based on file extension
|
||||
if args.input.lower().endswith('.json'):
|
||||
analyzer.parse_json_schema(content)
|
||||
else:
|
||||
analyzer.parse_sql_ddl(content)
|
||||
|
||||
if not analyzer.tables:
|
||||
print("Error: No tables found in input file", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
if args.erd_only:
|
||||
# Output only ERD
|
||||
erd = analyzer.generate_mermaid_erd()
|
||||
if args.output:
|
||||
with open(args.output, 'w') as f:
|
||||
f.write(erd)
|
||||
else:
|
||||
print(erd)
|
||||
return 0
|
||||
|
||||
# Perform analysis
|
||||
analyzer.analyze_normalization()
|
||||
analyzer.analyze_data_types()
|
||||
analyzer.analyze_constraints()
|
||||
analyzer.analyze_naming_conventions()
|
||||
|
||||
# Generate report
|
||||
analysis = analyzer.get_analysis_summary()
|
||||
|
||||
if args.generate_erd:
|
||||
analysis["mermaid_erd"] = analyzer.generate_mermaid_erd()
|
||||
|
||||
# Output results
|
||||
if args.output_format == "json":
|
||||
output = json.dumps(analysis, indent=2)
|
||||
else:
|
||||
output = analyzer.format_text_report(analysis)
|
||||
if args.generate_erd:
|
||||
output += "\n\nMERMAID ERD\n" + "=" * 11 + "\n"
|
||||
output += analysis["mermaid_erd"]
|
||||
|
||||
if args.output:
|
||||
with open(args.output, 'w') as f:
|
||||
f.write(output)
|
||||
else:
|
||||
print(output)
|
||||
|
||||
return 0
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user