14 KiB
Dashboard Best Practices: Design for Insight and Action
Introduction
A well-designed dashboard is like a good story - it guides you through the data with purpose and clarity. This guide provides practical patterns for creating dashboards that inform decisions and enable quick troubleshooting.
Design Principles
The Hierarchy of Information
Primary Information (Top Third)
- Service health status
- SLO achievement
- Critical alerts
- Business KPIs
Secondary Information (Middle Third)
- Golden signals (latency, traffic, errors, saturation)
- Resource utilization
- Throughput and performance metrics
Tertiary Information (Bottom Third)
- Detailed breakdowns
- Historical trends
- Dependency status
- Debug information
Visual Design Principles
Rule of 7±2
- Maximum 7±2 panels per screen
- Group related information together
- Use sections to organize complexity
Color Psychology
- Red: Critical issues, danger, immediate attention needed
- Yellow/Orange: Warnings, caution, degraded state
- Green: Healthy, normal operation, success
- Blue: Information, neutral metrics, capacity
- Gray: Disabled, unknown, or baseline states
Chart Selection Guide
- Line charts: Time series, trends, comparisons over time
- Bar charts: Categorical comparisons, top N lists
- Gauges: Single value with defined good/bad ranges
- Stat panels: Key metrics, percentages, counts
- Heatmaps: Distribution data, correlation analysis
- Tables: Detailed breakdowns, multi-dimensional data
Dashboard Archetypes
The Overview Dashboard
Purpose: High-level health check and business metrics Audience: Executives, managers, cross-team stakeholders Update Frequency: 5-15 minutes
sections:
- title: "Business Health"
panels:
- service_availability_summary
- revenue_per_hour
- active_users
- conversion_rate
- title: "System Health"
panels:
- critical_alerts_count
- slo_achievement_summary
- error_budget_remaining
- deployment_status
The SRE Operational Dashboard
Purpose: Real-time monitoring and incident response Audience: SRE, on-call engineers Update Frequency: 15-30 seconds
sections:
- title: "Service Status"
panels:
- service_up_status
- active_incidents
- recent_deployments
- title: "Golden Signals"
panels:
- latency_percentiles
- request_rate
- error_rate
- resource_saturation
- title: "Infrastructure"
panels:
- cpu_memory_utilization
- network_io
- disk_space
The Developer Debug Dashboard
Purpose: Deep-dive troubleshooting and performance analysis Audience: Development teams Update Frequency: 30 seconds - 2 minutes
sections:
- title: "Application Performance"
panels:
- endpoint_latency_breakdown
- database_query_performance
- cache_hit_rates
- queue_depths
- title: "Errors and Logs"
panels:
- error_rate_by_endpoint
- log_volume_by_level
- exception_types
- slow_queries
Layout Patterns
The F-Pattern Layout
Based on eye-tracking studies, users scan in an F-pattern:
[Critical Status] [SLO Summary ] [Error Budget ]
[Latency ] [Traffic ] [Errors ]
[Saturation ] [Resource Use ] [Detailed View]
[Historical ] [Dependencies ] [Debug Info ]
The Z-Pattern Layout
For executive dashboards, follow the Z-pattern:
[Business KPIs ] → [System Status]
↓ ↓
[Trend Analysis ] ← [Key Metrics ]
Responsive Design
Desktop (1920x1080)
- 24-column grid
- Panels can be 6, 8, 12, or 24 units wide
- 4-6 rows visible without scrolling
Laptop (1366x768)
- Stack wider panels vertically
- Reduce panel heights
- Prioritize most critical information
Mobile (768px width)
- Single column layout
- Simplified panels
- Touch-friendly controls
Effective Panel Design
Stat Panels
# Good: Clear value with context
- title: "API Availability"
type: stat
targets:
- expr: avg(up{service="api"}) * 100
field_config:
unit: percent
thresholds:
steps:
- color: red
value: 0
- color: yellow
value: 99
- color: green
value: 99.9
options:
color_mode: background
text_mode: value_and_name
Time Series Panels
# Good: Multiple related metrics with clear legend
- title: "Request Latency"
type: timeseries
targets:
- expr: histogram_quantile(0.50, rate(http_duration_bucket[5m]))
legend: "P50"
- expr: histogram_quantile(0.95, rate(http_duration_bucket[5m]))
legend: "P95"
- expr: histogram_quantile(0.99, rate(http_duration_bucket[5m]))
legend: "P99"
field_config:
unit: ms
custom:
draw_style: line
fill_opacity: 10
options:
legend:
display_mode: table
placement: bottom
values: [min, max, mean, last]
Table Panels
# Good: Top N with relevant columns
- title: "Slowest Endpoints"
type: table
targets:
- expr: topk(10, histogram_quantile(0.95, sum by (handler)(rate(http_duration_bucket[5m]))))
format: table
instant: true
transformations:
- id: organize
options:
exclude_by_name:
Time: true
rename_by_name:
Value: "P95 Latency (ms)"
handler: "Endpoint"
Color and Visualization Best Practices
Threshold Configuration
# Traffic light system with meaningful boundaries
thresholds:
steps:
- color: green # Good performance
value: null # Default
- color: yellow # Degraded performance
value: 95 # 95th percentile of historical normal
- color: orange # Poor performance
value: 99 # 99th percentile of historical normal
- color: red # Critical performance
value: 99.9 # Worst case scenario
Color Blind Friendly Palettes
# Use patterns and shapes in addition to color
field_config:
overrides:
- matcher:
id: byName
options: "Critical"
properties:
- id: color
value:
mode: fixed
fixed_color: "#d73027" # Red-orange for protanopia
- id: custom.draw_style
value: "points" # Different shape
Consistent Color Semantics
- Success/Health: Green (#28a745)
- Warning/Degraded: Yellow (#ffc107)
- Error/Critical: Red (#dc3545)
- Information: Blue (#007bff)
- Neutral: Gray (#6c757d)
Time Range Strategy
Default Time Ranges by Dashboard Type
Real-time Operational
- Default: Last 15 minutes
- Quick options: 5m, 15m, 1h, 4h
- Auto-refresh: 15-30 seconds
Troubleshooting
- Default: Last 1 hour
- Quick options: 15m, 1h, 4h, 12h, 1d
- Auto-refresh: 1 minute
Business Review
- Default: Last 24 hours
- Quick options: 1d, 7d, 30d, 90d
- Auto-refresh: 5 minutes
Capacity Planning
- Default: Last 7 days
- Quick options: 7d, 30d, 90d, 1y
- Auto-refresh: 15 minutes
Time Range Annotations
# Add context for time-based events
annotations:
- name: "Deployments"
datasource: "Prometheus"
expr: "deployment_timestamp"
title_format: "Deploy {{ version }}"
text_format: "Deployed version {{ version }} to {{ environment }}"
- name: "Incidents"
datasource: "Incident API"
query: "incidents.json?service={{ service }}"
color: "red"
Interactive Features
Template Variables
# Service selector
- name: service
type: query
query: label_values(up, service)
current:
text: All
value: $__all
include_all: true
multi: true
# Environment selector
- name: environment
type: query
query: label_values(up{service="$service"}, environment)
current:
text: production
value: production
Drill-Down Links
# Panel-level drill-downs
- title: "Error Rate"
type: timeseries
# ... other config ...
options:
data_links:
- title: "View Error Logs"
url: "/d/logs-dashboard?var-service=${__field.labels.service}&from=${__from}&to=${__to}"
- title: "Error Traces"
url: "/d/traces-dashboard?var-service=${__field.labels.service}"
Dynamic Panel Titles
- title: "${service} - Request Rate" # Uses template variable
type: timeseries
# Title updates automatically when service variable changes
Performance Optimization
Query Optimization
Use Recording Rules
# Instead of complex queries in dashboards
groups:
- name: http_requests
rules:
- record: http_request_rate_5m
expr: sum(rate(http_requests_total[5m])) by (service, method, handler)
- record: http_request_latency_p95_5m
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
Limit Data Points
# Good: Reasonable resolution for dashboard
- expr: http_request_rate_5m[1h]
interval: 15s # One point every 15 seconds
# Bad: Too many points for visualization
- expr: http_request_rate_1s[1h] # 3600 points!
Dashboard Performance
Panel Limits
- Maximum panels per dashboard: 20-30
- Maximum queries per panel: 10
- Maximum time series per panel: 50
Caching Strategy
# Use appropriate cache headers
cache_timeout: 30 # Cache for 30 seconds on fast-changing panels
cache_timeout: 300 # Cache for 5 minutes on slow-changing panels
Accessibility
Screen Reader Support
# Provide text alternatives for visual elements
- title: "Service Health Status"
type: stat
options:
text_mode: value_and_name # Includes both value and description
field_config:
mappings:
- options:
"1":
text: "Healthy"
color: "green"
"0":
text: "Unhealthy"
color: "red"
Keyboard Navigation
- Ensure all interactive elements are keyboard accessible
- Provide logical tab order
- Include skip links for complex dashboards
High Contrast Mode
# Test dashboards work in high contrast mode
theme: high_contrast
colors:
- "#000000" # Pure black
- "#ffffff" # Pure white
- "#ffff00" # Pure yellow
- "#ff0000" # Pure red
Testing and Validation
Dashboard Testing Checklist
Functional Testing
- All panels load without errors
- Template variables filter correctly
- Time range changes update all panels
- Drill-down links work as expected
- Auto-refresh functions properly
Visual Testing
- Dashboard renders correctly on different screen sizes
- Colors are distinguishable and meaningful
- Text is readable at normal zoom levels
- Legends and labels are clear
Performance Testing
- Dashboard loads in < 5 seconds
- No queries timeout under normal load
- Auto-refresh doesn't cause browser lag
- Memory usage remains reasonable
Usability Testing
- New team members can understand the dashboard
- Action items are clear during incidents
- Key information is quickly discoverable
- Dashboard supports common troubleshooting workflows
Maintenance and Governance
Dashboard Lifecycle
Creation
- Define dashboard purpose and audience
- Identify key metrics and success criteria
- Design layout following established patterns
- Implement with consistent styling
- Test with real data and user scenarios
Maintenance
- Weekly: Check for broken panels or queries
- Monthly: Review dashboard usage analytics
- Quarterly: Gather user feedback and iterate
- Annually: Major review and potential redesign
Retirement
- Archive dashboards that are no longer used
- Migrate users to replacement dashboards
- Document lessons learned
Dashboard Standards
# Organization dashboard standards
standards:
naming_convention: "[Team] [Service] - [Purpose]"
tags: [team, service_type, environment, purpose]
refresh_intervals: [15s, 30s, 1m, 5m, 15m]
time_ranges: [5m, 15m, 1h, 4h, 1d, 7d, 30d]
color_scheme: "company_standard"
max_panels_per_dashboard: 25
Advanced Patterns
Composite Dashboards
# Dashboard that includes panels from other dashboards
- title: "Service Overview"
type: dashlist
targets:
- "service-health"
- "service-performance"
- "service-business-metrics"
options:
show_headings: true
max_items: 10
Dynamic Dashboard Generation
# Generate dashboards from service definitions
def generate_service_dashboard(service_config):
panels = []
# Always include golden signals
panels.extend(generate_golden_signals_panels(service_config))
# Add service-specific panels
if service_config.type == 'database':
panels.extend(generate_database_panels(service_config))
elif service_config.type == 'queue':
panels.extend(generate_queue_panels(service_config))
return {
'title': f"{service_config.name} - Operational Dashboard",
'panels': panels,
'variables': generate_variables(service_config)
}
A/B Testing for Dashboards
# Test different dashboard designs with different teams
experiment:
name: "dashboard_layout_test"
variants:
- name: "traditional_layout"
weight: 50
config: "dashboard_v1.json"
- name: "f_pattern_layout"
weight: 50
config: "dashboard_v2.json"
success_metrics:
- "time_to_insight"
- "user_satisfaction"
- "troubleshooting_efficiency"
Remember: A dashboard should tell a story about your system's health and guide users toward the right actions. Focus on clarity over complexity, and always optimize for the person who will use it during a stressful incident.