Database Design Principles: From Basics to Advanced Patterns
Database Design Principles: From Basics to Advanced Patterns
Database design is the foundation of every successful application. A well-designed database ensures data integrity, optimizes performance, and scales gracefully as your application grows. Poor design decisions made early can haunt projects for years, leading to performance bottlenecks, data inconsistencies, and expensive refactoring efforts.
This comprehensive guide takes you from database design fundamentals through advanced patterns used in production systems. Whether you’re designing your first database or architecting a distributed system handling millions of transactions, understanding these principles is essential for building robust, scalable applications.
Table of Contents
- Database Design Fundamentals
- Normalization Theory
- Entity-Relationship Modeling
- Indexing Strategies
- Strategic Denormalization
- SQL vs NoSQL Design Patterns
- Scalability Considerations
- Performance Optimization
- Common Design Mistakes
- Real-World Design Examples
Database Design Fundamentals
Database design begins with understanding your data, its relationships, and how it will be accessed. The core principles remain consistent whether you’re working with relational or non-relational systems.
Understanding Data Requirements
Before creating any tables or collections, invest time in requirements gathering. Identify:
- Entities: What objects or concepts does your system need to represent?
- Attributes: What information do you need to store about each entity?
- Relationships: How do entities connect to each other?
- Access patterns: How will data be queried and updated?
- Growth projections: What volume of data will you handle in 1, 5, or 10 years?
According to PostgreSQL’s database design documentation, understanding your access patterns before schema design prevents costly restructuring later.
ACID Properties and Data Integrity
Relational databases enforce ACID properties to maintain data consistency:
- Atomicity: Transactions complete fully or not at all
- Consistency: Data moves from one valid state to another
- Isolation: Concurrent transactions don’t interfere
- Durability: Committed data persists even after system failure
Design your schema to leverage these guarantees. Use constraints, foreign keys, and transactions to maintain integrity rather than relying solely on application logic.
Schema Design Principles
Strong database design follows these foundational principles:
- Single Source of Truth: Store each piece of data in exactly one place
- Minimize Redundancy: Avoid duplicating data across tables (with strategic exceptions)
- Enforce Constraints: Use database-level constraints for data validation
- Plan for Change: Design schemas that accommodate future requirements
- Document Decisions: Maintain clear documentation of schema choices and trade-offs
MySQL’s data modeling guidelines emphasize choosing appropriate data types for efficiency and accuracy.
Normalization Theory
Normalization is the systematic process of organizing data to reduce redundancy and improve data integrity. Understanding normalization forms is essential for relational database design.
First Normal Form (1NF)
A table is in 1NF when:
- Each column contains atomic (indivisible) values
- Each column contains values of a single type
- Each column has a unique name
- The order of rows doesn’t matter
Example violation:
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_name VARCHAR(100),
products VARCHAR(500) -- "Laptop, Mouse, Keyboard" - NOT ATOMIC
);
1NF compliant:
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_name VARCHAR(100)
);
CREATE TABLE order_items (
order_id INT,
product_name VARCHAR(100),
PRIMARY KEY (order_id, product_name),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
Second Normal Form (2NF)
A table is in 2NF when:
- It’s in 1NF
- All non-key attributes are fully dependent on the entire primary key
This applies to tables with composite primary keys. No non-key column should depend on only part of the key.
Example violation:
CREATE TABLE order_items (
order_id INT,
product_id INT,
product_name VARCHAR(100), -- Depends only on product_id
quantity INT,
PRIMARY KEY (order_id, product_id)
);
2NF compliant:
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100)
);
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT,
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (product_id) REFERENCES products(product_id)
);
Third Normal Form (3NF)
A table is in 3NF when:
- It’s in 2NF
- No non-key column depends on another non-key column (no transitive dependencies)
Example violation:
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
department_name VARCHAR(100), -- Depends on department_id, not employee_id
department_location VARCHAR(100) -- Transitive dependency
);
3NF compliant:
CREATE TABLE departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(100),
department_location VARCHAR(100)
);
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department_id INT,
FOREIGN KEY (department_id) REFERENCES departments(department_id)
);
Boyce-Codd Normal Form (BCNF)
A stricter version of 3NF. A table is in BCNF when every determinant is a candidate key. BCNF eliminates all anomalies that 3NF misses in rare edge cases involving overlapping candidate keys.
For detailed normalization theory, consult Database Normalization Explained and academic resources on relational theory.
Fourth and Fifth Normal Form (4NF and 5NF)
4NF addresses multi-valued dependencies, ensuring that independent facts are stored in separate tables.
5NF (Project-Join Normal Form) handles cases where information can be reconstructed from smaller tables, eliminating all redundancy.
In practice, 3NF or BCNF suffices for most applications. Higher normal forms apply to complex scenarios with specific multi-valued or join dependencies.
Entity-Relationship Modeling
Entity-Relationship (ER) modeling provides a visual approach to database design, helping you conceptualize entities, attributes, and relationships before implementation.
Core ER Components
Entities represent objects or concepts in your domain:
- Strong entities exist independently (e.g., Customer, Product)
- Weak entities depend on strong entities (e.g., OrderItem depends on Order)
Attributes describe entity properties:
- Simple vs composite (Address vs Street/City/ZIP)
- Single-valued vs multi-valued
- Derived attributes calculated from others
Relationships connect entities:
- One-to-One (1:1): Employee to ParkingSpace
- One-to-Many (1:N): Customer to Orders
- Many-to-Many (M:N): Students to Courses
Cardinality and Participation
Cardinality specifies how many instances participate:
- Minimum cardinality (0 = optional, 1 = mandatory)
- Maximum cardinality (1, N)
Example: A customer can have zero or many orders (0:N), but each order must belong to exactly one customer (1:1).
Translating ER Models to Schema
One-to-Many relationships: Add foreign key to the “many” side
CREATE TABLE customers (customer_id INT PRIMARY KEY);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
Many-to-Many relationships: Create junction table
CREATE TABLE students (student_id INT PRIMARY KEY);
CREATE TABLE courses (course_id INT PRIMARY KEY);
CREATE TABLE enrollments (
student_id INT,
course_id INT,
enrollment_date DATE,
PRIMARY KEY (student_id, course_id)
);
Tools like Lucidchart and dbdiagram.io help visualize ER models before implementation.
Indexing Strategies
Indexes are critical for query performance but come with trade-offs. Strategic indexing can make queries thousands of times faster, while excessive indexing slows writes and wastes storage.
How Indexes Work
Most databases use B-tree or B+ tree indexes that organize data in a sorted, hierarchical structure. This reduces search complexity from O(n) to O(log n).
Index structure example (simplified):
[50]
/ \
[25] [75]
/ \ / \
[10][35] [60][85]
Searching for value 60 requires checking only 3 nodes instead of scanning all rows.
Index Types
Primary Key Indexes: Automatically created, enforce uniqueness and provide fastest access
Unique Indexes: Enforce uniqueness on non-primary columns
CREATE UNIQUE INDEX idx_email ON users(email);
Composite Indexes: Cover multiple columns
CREATE INDEX idx_name_date ON orders(customer_id, order_date);
Composite indexes follow the leftmost prefix rule. An index on (A, B, C) accelerates queries filtering on:
- A
- A and B
- A, B, and C
But NOT queries filtering only B or C.
Covering Indexes: Include all columns needed by a query
CREATE INDEX idx_covering ON orders(customer_id, order_date)
INCLUDE (order_total, status);
Partial Indexes: Index only rows meeting specific criteria
CREATE INDEX idx_active_users ON users(email) WHERE active = true;
Full-Text Indexes: Optimize text search operations
CREATE FULLTEXT INDEX idx_content ON articles(title, body);
PostgreSQL indexing documentation provides deep insights into index types and their use cases.
When to Create Indexes
Create indexes for columns that:
- Appear frequently in WHERE clauses
- Join columns between tables
- ORDER BY and GROUP BY columns
- Columns with high cardinality (many distinct values)
Avoid indexing:
- Small tables (sequential scans are faster)
- Columns with low cardinality (boolean, status fields with few values)
- Frequently updated columns in high-write tables
- Columns rarely used in queries
Index Maintenance
Indexes require maintenance:
- Rebuild fragmented indexes periodically
- Update statistics so the query optimizer makes good decisions
- Monitor unused indexes and remove them
-- PostgreSQL: Find unused indexes
SELECT schemaname, tablename, indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0;
Use tools like pgAdmin for PostgreSQL or MySQL Workbench to monitor index performance.
Strategic Denormalization
While normalization reduces redundancy, denormalization strategically introduces redundancy to improve query performance. The key is knowing when and how to denormalize.
When to Denormalize
Consider denormalization when:
- Query performance is critical and joins are expensive
- Read-to-write ratio is very high (data rarely changes)
- Complex aggregations run frequently
- Data is naturally duplicate across systems
- Historical snapshots need preservation
Denormalization Patterns
Storing Calculated Values
-- Instead of calculating order total each time
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_total DECIMAL(10,2), -- Denormalized
order_date DATE
);
-- Update via trigger or application logic
CREATE TRIGGER update_order_total
AFTER INSERT OR UPDATE ON order_items
FOR EACH ROW
EXECUTE FUNCTION calculate_order_total();
Duplicating Frequently Joined Data
-- Store customer name in orders to avoid join
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
customer_name VARCHAR(100), -- Denormalized from customers table
order_date DATE
);
Materialized Views
-- PostgreSQL materialized view for expensive aggregation
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(order_total) as total_sales,
COUNT(*) as order_count
FROM orders
GROUP BY DATE_TRUNC('month', order_date);
-- Refresh periodically
REFRESH MATERIALIZED VIEW monthly_sales;
Snapshot Tables
-- Preserve historical state
CREATE TABLE order_snapshots (
snapshot_id INT PRIMARY KEY,
order_id INT,
snapshot_date DATE,
customer_name VARCHAR(100), -- Historical value
product_names TEXT,
order_total DECIMAL(10,2)
);
Managing Denormalized Data
Denormalization creates consistency challenges:
- Use triggers to keep denormalized data synchronized
- Implement caching layers instead of database denormalization
- Document clearly what data is denormalized and why
- Monitor data drift between original and denormalized copies
- Consider eventual consistency in distributed systems
High Performance MySQL discusses advanced denormalization strategies for performance optimization.
SQL vs NoSQL Design Patterns
Database design varies dramatically between SQL and NoSQL systems. Understanding when to use each paradigm is crucial.
Relational (SQL) Database Design
Strengths:
- Strong consistency and ACID guarantees
- Complex querying with JOIN operations
- Mature ecosystem and tooling
- Standardized query language
Ideal for:
- Financial transactions
- Complex reporting requirements
- Data with clear relationships
- Applications requiring strict consistency
Design patterns:
- Normalize to 3NF by default
- Use foreign keys to enforce relationships
- Leverage transactions for multi-step operations
- Create views for complex queries
Popular systems: PostgreSQL, MySQL, Microsoft SQL Server, Oracle
Document Database Design (MongoDB, CouchDB)
Strengths:
- Flexible schema for evolving data models
- Natural representation of object hierarchies
- Horizontal scaling built-in
- Fast reads for document retrieval
Design patterns:
- Embed related data in documents
{ "_id": "order123", "customer": { "name": "John Doe", "email": "john@example.com" }, "items": [ {"product": "Laptop", "quantity": 1, "price": 999.99}, {"product": "Mouse", "quantity": 2, "price": 29.99} ], "total": 1059.97 } - Reference when data changes frequently
{ "_id": "order123", "customer_id": "cust456", // Reference instead of embedding "items": [...] } - Denormalize for read performance
- Use arrays for one-to-many relationships
MongoDB’s data modeling guide offers comprehensive design patterns.
Key-Value Store Design (Redis, DynamoDB)
Strengths:
- Extremely fast lookups
- Simple data model
- Excellent for caching
- High scalability
Design patterns:
- Use meaningful key naming conventions (user:1234:profile)
- Store serialized complex objects as values
- Leverage TTL for automatic expiration
- Combine with relational databases (Redis as cache layer)
Ideal for:
- Session storage
- Caching layer
- Real-time leaderboards
- Rate limiting
Column-Family Databases (Cassandra, HBase)
Strengths:
- Massive scalability
- Optimized for write-heavy workloads
- Time-series data
- Distributed architecture
Design patterns:
- Design tables based on queries (query-first design)
- Denormalize extensively
- Duplicate data across tables for different access patterns
- Use partition keys carefully for even distribution
Graph Databases (Neo4j, Amazon Neptune)
Strengths:
- Complex relationship queries
- Pattern matching
- Recommendation engines
- Social networks
Design patterns:
- Model entities as nodes
- Model relationships as edges with properties
- Index nodes by frequently queried properties
- Use graph algorithms for traversal
Amazon’s NoSQL design guide covers design considerations for various NoSQL patterns.
Scalability Considerations
Designing for scalability from the start prevents painful migrations later. As data volume and traffic grow, your database architecture must evolve.
Vertical vs Horizontal Scaling
Vertical Scaling (Scale Up):
- Add more CPU, RAM, or storage to existing server
- Simpler to implement
- Limited by hardware constraints
- Single point of failure
Horizontal Scaling (Scale Out):
- Add more database servers
- Better fault tolerance
- Nearly unlimited scaling potential
- Complex implementation
Database Replication
Replication creates copies of your database across multiple servers for redundancy and load distribution.
Primary-Replica (Master-Slave) Replication:
[Primary DB] --writes--> [Replica 1]
| [Replica 2]
| [Replica 3]
+---reads---------> (load balanced)
- All writes go to primary
- Reads distributed across replicas
- Async replication creates slight lag
- Increases read capacity
Multi-Primary (Master-Master) Replication:
- Multiple nodes accept writes
- Conflict resolution required
- Higher complexity
- Better write scaling
PostgreSQL replication documentation details various replication configurations.
Database Sharding
Sharding distributes data across multiple databases based on a shard key.
Horizontal Sharding (by rows):
Users 1-1000 -> Shard 1
Users 1001-2000 -> Shard 2
Users 2001-3000 -> Shard 3
Vertical Sharding (by columns/tables):
User profile data -> Shard A
User transaction data -> Shard B
User analytics data -> Shard C
Sharding strategies:
- Range-based: Shard by ranges (user_id 1-1000, 1001-2000)
- Simple implementation
- Risk of uneven distribution
- Hash-based: Apply hash function to shard key
- Even distribution
- Adding shards requires rebalancing
- Geographic: Shard by location
- Reduces latency
- Regulatory compliance (data residency)
- Directory-based: Lookup table maps keys to shards
- Flexible
- Lookup table becomes bottleneck
Sharding challenges:
- Cross-shard queries are expensive
- Distributed transactions are complex
- Rebalancing data when adding shards
- Application-level routing logic
Vitess provides sharding middleware for MySQL databases.
Partitioning
Partitioning divides a large table into smaller pieces within a single database.
Range partitioning:
CREATE TABLE orders (
order_id INT,
order_date DATE,
customer_id INT
) PARTITION BY RANGE (YEAR(order_date)) (
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION p2025 VALUES LESS THAN (2026)
);
List partitioning:
PARTITION BY LIST (region) (
PARTITION p_north VALUES IN ('NY', 'MA', 'CT'),
PARTITION p_south VALUES IN ('TX', 'FL', 'GA'),
PARTITION p_west VALUES IN ('CA', 'WA', 'OR')
);
Hash partitioning:
PARTITION BY HASH (customer_id) PARTITIONS 4;
Benefits:
- Query performance (partition pruning)
- Easier maintenance (drop old partitions)
- Improved manageability
Caching Strategies
Caching reduces database load by storing frequently accessed data in memory.
Application-level caching:
# Redis caching example
def get_user(user_id):
cache_key = f"user:{user_id}"
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
user = database.query("SELECT * FROM users WHERE id = ?", user_id)
redis.setex(cache_key, 3600, json.dumps(user)) # Cache for 1 hour
return user
Query result caching:
- MySQL query cache (deprecated in 8.0)
- Application-level result caching
- Materialized views
Cache invalidation strategies:
- TTL (Time To Live): Expire after time period
- Write-through: Update cache on every write
- Write-behind: Async cache updates
- Cache-aside: Application manages cache
Redis documentation covers caching patterns comprehensively.
Performance Optimization
Database performance optimization requires understanding query execution, identifying bottlenecks, and applying targeted improvements.
Query Optimization
Use EXPLAIN to analyze queries:
EXPLAIN ANALYZE
SELECT o.order_id, c.customer_name, SUM(oi.quantity * p.price)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2025-01-01'
GROUP BY o.order_id, c.customer_name;
Look for:
- Sequential scans on large tables (add indexes)
- High estimated costs
- Nested loop joins (may need better indexes)
- Large row counts filtered late in execution
Optimization techniques:
- Select only needed columns: Avoid SELECT * ```sql – Bad SELECT * FROM users WHERE id = 123;
– Good SELECT user_id, username, email FROM users WHERE id = 123;
2. **Use appropriate JOIN types**:
```sql
-- INNER JOIN when you need matching rows only
SELECT * FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
-- LEFT JOIN when you need all rows from left table
SELECT * FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
- Filter early with WHERE instead of HAVING: ```sql – Bad: HAVING filters after aggregation SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id HAVING customer_id = 123;
– Good: WHERE filters before aggregation SELECT customer_id, COUNT(*) FROM orders WHERE customer_id = 123 GROUP BY customer_id;
4. **Use EXISTS instead of COUNT for existence checks**:
```sql
-- Bad
SELECT CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END
FROM orders WHERE customer_id = 123;
-- Good
SELECT CASE WHEN EXISTS(SELECT 1 FROM orders WHERE customer_id = 123)
THEN 1 ELSE 0 END;
- Batch operations instead of row-by-row: ```sql – Bad: Multiple single inserts INSERT INTO users (name) VALUES (‘Alice’); INSERT INTO users (name) VALUES (‘Bob’); INSERT INTO users (name) VALUES (‘Charlie’);
– Good: Batch insert INSERT INTO users (name) VALUES (‘Alice’), (‘Bob’), (‘Charlie’);
[Use The Index, Luke](https://use-the-index-luke.com/) is an excellent resource for SQL query optimization.
### Connection Pooling
Opening database connections is expensive. Connection pooling reuses connections across requests.
```python
# Example with psycopg2 connection pool
from psycopg2 import pool
connection_pool = pool.SimpleConnectionPool(
minconn=5,
maxconn=20,
host='localhost',
database='mydb'
)
# Get connection from pool
conn = connection_pool.getconn()
# Use connection
cursor = conn.cursor()
cursor.execute("SELECT * FROM users")
# Return to pool
connection_pool.putconn(conn)
Configure pool size based on:
- Application concurrency
- Database connection limits
- Hardware resources
Database Configuration Tuning
PostgreSQL configuration examples:
# Memory settings
shared_buffers = 4GB # 25% of RAM
effective_cache_size = 12GB # 50-75% of RAM
work_mem = 50MB # Per operation memory
# Checkpoint settings
checkpoint_completion_target = 0.9
wal_buffers = 16MB
# Query planner
random_page_cost = 1.1 # For SSD storage
effective_io_concurrency = 200
MySQL/InnoDB tuning:
innodb_buffer_pool_size = 4G # 70-80% of RAM
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 2
max_connections = 200
PostgreSQL performance tuning provides detailed configuration guidance.
Monitoring and Profiling
Continuous monitoring identifies performance degradation before it impacts users.
Key metrics to monitor:
- Query response time (p50, p95, p99 percentiles)
- Queries per second (QPS)
- Connection count
- Cache hit ratio
- Lock wait time
- Disk I/O
- Replication lag
Tools:
- pgBadger - PostgreSQL log analyzer
- Percona Monitoring and Management - MySQL/PostgreSQL monitoring
- DataDog, New Relic - Application performance monitoring
- Database-specific tools (pg_stat_statements, MySQL slow query log)
Slow query logging:
-- PostgreSQL: Enable slow query logging
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1 second
SELECT pg_reload_conf();
-- MySQL: Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
Common Design Mistakes
Learning from common mistakes helps you avoid them in your own designs.
Mistake 1: Using GUIDs as Primary Keys Everywhere
Problem: GUIDs (UUIDs) create large indexes and reduce cache efficiency.
-- Inefficient for high-volume tables
CREATE TABLE orders (
order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID,
order_date TIMESTAMP
);
Better approach: Use auto-incrementing integers for internal references, GUIDs only when needed for external APIs or distributed systems.
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
public_id UUID UNIQUE DEFAULT gen_random_uuid(), -- For API exposure
customer_id INT,
order_date TIMESTAMP
);
Mistake 2: Not Using Foreign Keys
Problem: Orphaned records and data inconsistencies.
-- No referential integrity
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT -- No guarantee customer exists
);
Solution: Enforce relationships with foreign keys.
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
ON DELETE CASCADE
ON UPDATE CASCADE
);
Mistake 3: Generic JSON/BLOB Columns
Problem: Unqueryable data, no type safety, no indexing.
-- Anti-pattern
CREATE TABLE entities (
id INT PRIMARY KEY,
data JSON -- Everything stuffed into JSON
);
Better: Structure data properly, use JSON only for truly dynamic attributes.
CREATE TABLE products (
product_id INT PRIMARY KEY,
name VARCHAR(200),
price DECIMAL(10,2),
category_id INT,
metadata JSON -- Only for optional, dynamic attributes
);
Mistake 4: Premature Optimization
Problem: Denormalizing before you have performance data.
Don’t denormalize based on assumptions. Start with a normalized design, measure performance, then optimize specific bottlenecks.
Mistake 5: Insufficient Indexing or Over-Indexing
Too few indexes: Slow queries Too many indexes: Slow writes, wasted storage
Solution: Index based on actual query patterns, monitor index usage, remove unused indexes.
Mistake 6: Ignoring NULL Semantics
Problem: NULL behaves differently than other values.
-- This won't work as expected
SELECT * FROM users WHERE email != 'test@example.com';
-- Excludes rows where email IS NULL
Solution: Handle NULLs explicitly.
SELECT * FROM users
WHERE email != 'test@example.com' OR email IS NULL;
Mistake 7: Not Planning for Time Zones
Problem: Storing local time without timezone information.
-- Ambiguous: What timezone?
CREATE TABLE events (
event_id INT PRIMARY KEY,
event_time TIMESTAMP
);
Solution: Store UTC, convert to local for display.
CREATE TABLE events (
event_id INT PRIMARY KEY,
event_time TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
Mistake 8: Poor Naming Conventions
Use clear, consistent naming:
- Tables: plural nouns (users, orders, products)
- Columns: descriptive names (created_at, not created)
- Avoid reserved words
- Use snake_case or camelCase consistently
Mistake 9: No Soft Delete Strategy
Hard deleting data makes recovery impossible. Consider soft deletes for critical data:
CREATE TABLE users (
user_id INT PRIMARY KEY,
email VARCHAR(255),
deleted_at TIMESTAMP NULL,
INDEX idx_active_users (email) WHERE deleted_at IS NULL
);
Mistake 10: Not Versioning Schema Changes
Use migration tools to track schema evolution:
Real-World Design Examples
Let’s apply these principles to realistic scenarios.
Example 1: E-Commerce Database
Requirements:
- Customers place orders containing multiple products
- Products belong to categories
- Track inventory
- Support reviews and ratings
- Handle multiple shipping addresses per customer
Schema design:
-- Customers
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
first_name VARCHAR(100),
last_name VARCHAR(100),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Shipping addresses
CREATE TABLE shipping_addresses (
address_id SERIAL PRIMARY KEY,
customer_id INT NOT NULL,
street VARCHAR(255),
city VARCHAR(100),
state VARCHAR(50),
zip_code VARCHAR(20),
country VARCHAR(50),
is_default BOOLEAN DEFAULT FALSE,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
-- Categories
CREATE TABLE categories (
category_id SERIAL PRIMARY KEY,
category_name VARCHAR(100) NOT NULL,
parent_category_id INT,
FOREIGN KEY (parent_category_id) REFERENCES categories(category_id)
);
-- Products
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
product_name VARCHAR(200) NOT NULL,
description TEXT,
category_id INT,
price DECIMAL(10,2) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (category_id) REFERENCES categories(category_id),
INDEX idx_category (category_id),
INDEX idx_price (price)
);
-- Inventory
CREATE TABLE inventory (
product_id INT PRIMARY KEY,
quantity INT NOT NULL DEFAULT 0,
reserved_quantity INT NOT NULL DEFAULT 0,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (product_id) REFERENCES products(product_id),
CONSTRAINT check_quantity CHECK (quantity >= 0),
CONSTRAINT check_reserved CHECK (reserved_quantity >= 0 AND reserved_quantity <= quantity)
);
-- Orders
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT NOT NULL,
shipping_address_id INT NOT NULL,
order_status VARCHAR(50) NOT NULL,
order_total DECIMAL(10,2) NOT NULL, -- Denormalized for performance
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
shipped_date TIMESTAMP,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
FOREIGN KEY (shipping_address_id) REFERENCES shipping_addresses(address_id),
INDEX idx_customer_date (customer_id, order_date),
INDEX idx_status (order_status)
);
-- Order items
CREATE TABLE order_items (
order_id INT,
product_id INT,
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL, -- Price at time of order
subtotal DECIMAL(10,2) NOT NULL, -- quantity * unit_price
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (order_id) REFERENCES orders(order_id),
FOREIGN KEY (product_id) REFERENCES products(product_id)
);
-- Reviews
CREATE TABLE reviews (
review_id SERIAL PRIMARY KEY,
product_id INT NOT NULL,
customer_id INT NOT NULL,
rating INT NOT NULL,
review_text TEXT,
review_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (product_id) REFERENCES products(product_id),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
CONSTRAINT check_rating CHECK (rating BETWEEN 1 AND 5),
UNIQUE (product_id, customer_id) -- One review per customer per product
);
-- Materialized view for product statistics
CREATE MATERIALIZED VIEW product_stats AS
SELECT
p.product_id,
p.product_name,
COALESCE(AVG(r.rating), 0) as avg_rating,
COUNT(r.review_id) as review_count,
COALESCE(SUM(oi.quantity), 0) as total_sold
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id
LEFT JOIN order_items oi ON p.product_id = oi.product_id
GROUP BY p.product_id, p.product_name;
-- Refresh periodically or on-demand
CREATE UNIQUE INDEX ON product_stats (product_id);
Design decisions:
- Normalized to 3NF for data integrity
- Denormalized order_total and subtotal for query performance
- Store unit_price in order_items (historical price at time of order)
- Materialized view for expensive aggregations (ratings, sales)
- Indexes on foreign keys and common query columns
- Constraints ensure data validity
Example 2: Social Media Platform
Requirements:
- Users create posts and comment on posts
- Follow relationships between users
- Like posts and comments
- Tag users in posts
- Real-time feed generation
Schema design (optimized for read-heavy workload):
-- Users
CREATE TABLE users (
user_id SERIAL PRIMARY KEY,
username VARCHAR(50) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
bio TEXT,
follower_count INT DEFAULT 0, -- Denormalized for performance
following_count INT DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_username (username)
);
-- Followers (many-to-many)
CREATE TABLE followers (
follower_id INT,
following_id INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (follower_id, following_id),
FOREIGN KEY (follower_id) REFERENCES users(user_id),
FOREIGN KEY (following_id) REFERENCES users(user_id),
INDEX idx_following (following_id) -- For "who follows this user"
);
-- Posts
CREATE TABLE posts (
post_id SERIAL PRIMARY KEY,
user_id INT NOT NULL,
content TEXT NOT NULL,
like_count INT DEFAULT 0, -- Denormalized
comment_count INT DEFAULT 0, -- Denormalized
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id),
INDEX idx_user_date (user_id, created_at DESC)
);
-- Comments
CREATE TABLE comments (
comment_id SERIAL PRIMARY KEY,
post_id INT NOT NULL,
user_id INT NOT NULL,
parent_comment_id INT, -- For nested comments
content TEXT NOT NULL,
like_count INT DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (post_id) REFERENCES posts(post_id),
FOREIGN KEY (user_id) REFERENCES users(user_id),
FOREIGN KEY (parent_comment_id) REFERENCES comments(comment_id),
INDEX idx_post_date (post_id, created_at)
);
-- Likes (separate tables for scalability)
CREATE TABLE post_likes (
post_id INT,
user_id INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (post_id, user_id),
FOREIGN KEY (post_id) REFERENCES posts(post_id),
FOREIGN KEY (user_id) REFERENCES users(user_id),
INDEX idx_user_likes (user_id)
);
CREATE TABLE comment_likes (
comment_id INT,
user_id INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (comment_id, user_id),
FOREIGN KEY (comment_id) REFERENCES comments(comment_id),
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
-- User mentions
CREATE TABLE post_mentions (
post_id INT,
mentioned_user_id INT,
PRIMARY KEY (post_id, mentioned_user_id),
FOREIGN KEY (post_id) REFERENCES posts(post_id),
FOREIGN KEY (mentioned_user_id) REFERENCES users(user_id),
INDEX idx_mentioned_user (mentioned_user_id)
);
-- Feed cache (Redis or separate table for performance)
-- Store pre-computed feeds for active users
CREATE TABLE feed_cache (
user_id INT,
post_id INT,
score DECIMAL(10,2), -- For ranking algorithm
created_at TIMESTAMP,
PRIMARY KEY (user_id, post_id),
INDEX idx_user_score (user_id, score DESC)
);
Design decisions:
- Denormalized counts (follower_count, like_count) updated via triggers or async jobs
- Separate like tables prevent massive joins
- Feed cache table or Redis for pre-computed feeds
- Indexes optimized for timeline queries (user_id, created_at DESC)
- Parent_comment_id enables nested comment threads
- Partitioning posts by date range for large-scale deployment
Example 3: Analytics Platform (Time-Series Data)
For event tracking and analytics, use time-series optimized design:
-- Events table (partitioned by date)
CREATE TABLE events (
event_id BIGSERIAL,
user_id INT,
event_type VARCHAR(50),
event_data JSONB,
event_time TIMESTAMP NOT NULL,
session_id UUID,
PRIMARY KEY (event_time, event_id)
) PARTITION BY RANGE (event_time);
-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
-- Indexes on partitions
CREATE INDEX idx_events_2025_01_user ON events_2025_01 (user_id, event_time);
CREATE INDEX idx_events_2025_01_type ON events_2025_01 (event_type);
-- Aggregated statistics (pre-computed)
CREATE TABLE daily_user_stats (
stat_date DATE,
user_id INT,
event_count INT,
session_count INT,
total_duration_seconds INT,
PRIMARY KEY (stat_date, user_id)
);
Design decisions:
- Time-based partitioning for efficient querying and data management
- Drop old partitions instead of DELETE for performance
- JSONB for flexible event attributes
- Pre-aggregated tables for dashboards
- Consider TimescaleDB or ClickHouse for large-scale time-series workloads
Conclusion
Database design is both an art and a science. Strong fundamentals, normalization theory, and entity-relationship modeling provide the foundation. Strategic denormalization, proper indexing, and understanding SQL versus NoSQL patterns optimize performance. Planning for scalability through replication, sharding, and caching ensures your system grows gracefully.
The key principles to remember:
- Understand your data and access patterns before designing schema
- Normalize first, denormalize strategically based on measured performance
- Use appropriate indexes for query patterns without over-indexing
- Choose the right database paradigm (SQL vs NoSQL) for your use case
- Plan for scale with replication, partitioning, and caching strategies
- Optimize based on metrics, not assumptions
- Enforce integrity with constraints and foreign keys
- Document your decisions and maintain schema version control
- Monitor continuously and iterate on design
- Learn from mistakes and avoid common anti-patterns
Database design is rarely perfect on the first attempt. Build in flexibility, measure performance, gather real-world usage data, and refine your schema over time. The best database designs evolve with the application while maintaining data integrity and performance.
Whether you’re building a simple application or a distributed system serving millions of users, these principles will guide you toward robust, scalable database architectures that serve your application reliably for years to come.
Further Reading:
- Database Design and Relational Theory by C.J. Date
- Designing Data-Intensive Applications by Martin Kleppmann
- PostgreSQL Documentation
- MongoDB Schema Design Best Practices
- AWS Database Blog