Files

2026-03-12 15:17:52 +07:00

10 KiB

Raw Blame History

Database Normalization Guide

Overview

Database normalization is the process of organizing data to minimize redundancy and dependency issues. It involves decomposing tables to eliminate data anomalies and improve data integrity.

Normal Forms

First Normal Form (1NF)

Requirements:

Each column contains atomic (indivisible) values
Each column contains values of the same type
Each column has a unique name
The order of data storage doesn't matter

Violations and Solutions:

Problem: Multiple values in single column

-- BAD: Multiple phone numbers in one column
CREATE TABLE customers (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    phones VARCHAR(500)  -- "555-1234, 555-5678, 555-9012"
);

-- GOOD: Separate table for multiple phones
CREATE TABLE customers (
    id INT PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE customer_phones (
    id INT PRIMARY KEY,
    customer_id INT REFERENCES customers(id),
    phone VARCHAR(20),
    phone_type VARCHAR(10) -- 'mobile', 'home', 'work'
);

Problem: Repeating groups

-- BAD: Repeating column patterns
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    item1_name VARCHAR(100),
    item1_qty INT,
    item1_price DECIMAL(8,2),
    item2_name VARCHAR(100),
    item2_qty INT,
    item2_price DECIMAL(8,2),
    item3_name VARCHAR(100),
    item3_qty INT,
    item3_price DECIMAL(8,2)
);

-- GOOD: Separate table for order items
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE
);

CREATE TABLE order_items (
    id INT PRIMARY KEY,
    order_id INT REFERENCES orders(order_id),
    item_name VARCHAR(100),
    quantity INT,
    unit_price DECIMAL(8,2)
);

Second Normal Form (2NF)

Requirements:

Must be in 1NF
All non-key attributes must be fully functionally dependent on the primary key
No partial dependencies (applies only to tables with composite primary keys)

Violations and Solutions:

Problem: Partial dependency on composite key

-- BAD: Student course enrollment with partial dependencies
CREATE TABLE student_courses (
    student_id INT,
    course_id INT,
    student_name VARCHAR(100),    -- Depends only on student_id
    student_major VARCHAR(50),    -- Depends only on student_id
    course_title VARCHAR(200),    -- Depends only on course_id
    course_credits INT,           -- Depends only on course_id
    grade CHAR(2),               -- Depends on both student_id AND course_id
    PRIMARY KEY (student_id, course_id)
);

-- GOOD: Separate tables eliminate partial dependencies
CREATE TABLE students (
    student_id INT PRIMARY KEY,
    student_name VARCHAR(100),
    student_major VARCHAR(50)
);

CREATE TABLE courses (
    course_id INT PRIMARY KEY,
    course_title VARCHAR(200),
    course_credits INT
);

CREATE TABLE enrollments (
    student_id INT,
    course_id INT,
    grade CHAR(2),
    enrollment_date DATE,
    PRIMARY KEY (student_id, course_id),
    FOREIGN KEY (student_id) REFERENCES students(student_id),
    FOREIGN KEY (course_id) REFERENCES courses(course_id)
);

Third Normal Form (3NF)

Requirements:

Must be in 2NF
No transitive dependencies (non-key attributes should not depend on other non-key attributes)
All non-key attributes must depend directly on the primary key

Violations and Solutions:

Problem: Transitive dependency

-- BAD: Employee table with transitive dependency
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    employee_name VARCHAR(100),
    department_id INT,
    department_name VARCHAR(100),     -- Depends on department_id, not employee_id
    department_location VARCHAR(100), -- Transitive dependency through department_id
    department_budget DECIMAL(10,2),  -- Transitive dependency through department_id
    salary DECIMAL(8,2)
);

-- GOOD: Separate department information
CREATE TABLE departments (
    department_id INT PRIMARY KEY,
    department_name VARCHAR(100),
    department_location VARCHAR(100),
    department_budget DECIMAL(10,2)
);

CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    employee_name VARCHAR(100),
    department_id INT,
    salary DECIMAL(8,2),
    FOREIGN KEY (department_id) REFERENCES departments(department_id)
);

Boyce-Codd Normal Form (BCNF)

Requirements:

Must be in 3NF
Every determinant must be a candidate key
Stricter than 3NF - handles cases where 3NF doesn't eliminate all anomalies

Violations and Solutions:

Problem: Determinant that's not a candidate key

-- BAD: Student advisor relationship with BCNF violation
-- Assumption: Each student has one advisor per subject, 
-- each advisor teaches only one subject, but can advise multiple students
CREATE TABLE student_advisor (
    student_id INT,
    subject VARCHAR(50),
    advisor_id INT,
    PRIMARY KEY (student_id, subject)
);
-- Problem: advisor_id determines subject, but advisor_id is not a candidate key

-- GOOD: Separate the functional dependencies
CREATE TABLE advisors (
    advisor_id INT PRIMARY KEY,
    subject VARCHAR(50)
);

CREATE TABLE student_advisor_assignments (
    student_id INT,
    advisor_id INT,
    PRIMARY KEY (student_id, advisor_id),
    FOREIGN KEY (advisor_id) REFERENCES advisors(advisor_id)
);

Denormalization Strategies

When to Denormalize

Performance Requirements: When query performance is more critical than storage efficiency
Read-Heavy Workloads: When data is read much more frequently than it's updated
Reporting Systems: When complex joins negatively impact reporting performance
Caching Strategies: When pre-computed values eliminate expensive calculations

Common Denormalization Patterns

1. Redundant Storage for Performance

-- Store frequently accessed calculated values
CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_total DECIMAL(10,2),     -- Denormalized: sum of order_items.total
    item_count INT,                -- Denormalized: count of order_items
    created_at TIMESTAMP
);

CREATE TABLE order_items (
    item_id INT PRIMARY KEY,
    order_id INT,
    product_id INT,
    quantity INT,
    unit_price DECIMAL(8,2),
    total DECIMAL(10,2)            -- quantity * unit_price (denormalized)
);

2. Materialized Aggregates

-- Pre-computed summary tables for reporting
CREATE TABLE monthly_sales_summary (
    year_month VARCHAR(7),         -- '2024-03'
    product_category VARCHAR(50),
    total_sales DECIMAL(12,2),
    total_units INT,
    avg_order_value DECIMAL(8,2),
    unique_customers INT,
    updated_at TIMESTAMP
);

3. Historical Data Snapshots

-- Store historical state to avoid complex temporal queries
CREATE TABLE customer_status_history (
    id INT PRIMARY KEY,
    customer_id INT,
    status VARCHAR(20),
    tier VARCHAR(10),
    total_lifetime_value DECIMAL(12,2), -- Snapshot at this point in time
    snapshot_date DATE
);

Trade-offs Analysis

Normalization Benefits

Data Integrity: Reduced risk of inconsistent data
Storage Efficiency: Less data duplication
Update Efficiency: Changes need to be made in only one place
Flexibility: Easier to modify schema as requirements change

Normalization Costs

Query Complexity: More joins required for data retrieval
Performance Impact: Joins can be expensive on large datasets
Development Complexity: More complex data access patterns

Denormalization Benefits

Query Performance: Fewer joins, faster queries
Simplified Queries: Direct access to related data
Read Optimization: Optimized for data retrieval patterns
Reduced Load: Less database processing for common operations

Denormalization Costs

Data Redundancy: Increased storage requirements
Update Complexity: Multiple places may need updates
Consistency Risk: Higher risk of data inconsistencies
Maintenance Overhead: Additional code to maintain derived values

Best Practices

1. Start with Full Normalization

Begin with a fully normalized design
Identify performance bottlenecks through testing
Selectively denormalize based on actual performance needs

2. Use Triggers for Consistency

-- Trigger to maintain denormalized order_total
CREATE TRIGGER update_order_total
AFTER INSERT OR UPDATE OR DELETE ON order_items
FOR EACH ROW
BEGIN
    UPDATE orders 
    SET order_total = (
        SELECT SUM(quantity * unit_price) 
        FROM order_items 
        WHERE order_id = NEW.order_id
    )
    WHERE order_id = NEW.order_id;
END;

3. Consider Materialized Views

-- Materialized view for complex aggregations
CREATE MATERIALIZED VIEW customer_summary AS
SELECT 
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) as order_count,
    SUM(o.order_total) as lifetime_value,
    AVG(o.order_total) as avg_order_value,
    MAX(o.created_at) as last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;

4. Document Denormalization Decisions

Clearly document why denormalization was chosen
Specify which data is derived and how it's maintained
Include performance benchmarks that justify the decision

5. Monitor and Validate

Implement validation checks for denormalized data
Regular audits to ensure data consistency
Performance monitoring to validate denormalization benefits

Common Anti-Patterns

1. Premature Denormalization

Starting with denormalized design without understanding actual performance requirements.

2. Over-Normalization

Creating too many small tables that require excessive joins for simple queries.

3. Inconsistent Approach

Mixing normalized and denormalized patterns without clear strategy.

4. Ignoring Maintenance

Denormalizing without proper mechanisms to maintain data consistency.

Conclusion

Normalization and denormalization are both valuable tools in database design. The key is understanding when to apply each approach:

Use normalization for transactional systems where data integrity is paramount
Consider denormalization for analytical systems or when performance testing reveals bottlenecks
Apply selectively based on actual usage patterns and performance requirements
Maintain consistency through proper design patterns and validation mechanisms

The goal is not to achieve perfect normalization or denormalization, but to create a design that best serves your application's specific needs while maintaining data quality and system performance.

10 KiB Raw Blame History

Database Normalization Guide

Overview

Normal Forms

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF)

Denormalization Strategies

When to Denormalize

Common Denormalization Patterns

Trade-offs Analysis

Normalization Benefits

Normalization Costs

Denormalization Benefits

Denormalization Costs

Best Practices

1. Start with Full Normalization

2. Use Triggers for Consistency

3. Consider Materialized Views

4. Document Denormalization Decisions

5. Monitor and Validate

Common Anti-Patterns

1. Premature Denormalization

2. Over-Normalization

3. Inconsistent Approach

4. Ignoring Maintenance

Conclusion

10 KiB

Raw Blame History