Test Reliability Metrics Every QA Team Should Track (FlakeScore and Beyond)

You cannot improve what you do not measure. This principle applies to test reliability as much as it does to application performance, revenue, or customer satisfaction. Yet most QA teams fly blind when it comes to understanding the health of their test suite. They know that "tests are flaky sometimes" but cannot tell you which tests, how often, or whether things are getting better or worse.

This guide defines the essential metrics for measuring test reliability, explains how to calculate each one, provides industry benchmarks, and shows you how to build dashboards that drive actual improvement. Whether you are a QA lead trying to justify investment in test reliability, or an engineer trying to prioritize which flaky tests to fix first, these metrics give you the data to make informed decisions.

Why Test Reliability Metrics Matter

Before diving into specific metrics, let us establish why measurement matters for test reliability.

The Hidden Cost Problem

Most organizations have no idea how much flaky tests cost them. Without metrics, the cost is invisible -- spread across dozens of developers, thousands of CI runs, and countless hours of lost productivity. Metrics make the cost visible and actionable.

Consider a team that runs their test suite 50 times per day across all branches. If 3% of those runs fail due to flaky tests, that is 1.5 false failures per day. Each false failure costs an average of 30 minutes of developer time (investigating, re-running, waiting). That is 45 minutes per day, or roughly 16 hours per month of engineering time wasted on phantom failures from a single developer's perspective. Multiply by team size, and you are looking at hundreds of hours per month.

The Trust Problem

Test reliability directly affects developer trust in the test suite. Research consistently shows that when test suite reliability drops below 95%, developers start ignoring test failures. Once that happens, real bugs slip through alongside the false failures, and the entire testing investment is undermined.

Metrics create accountability and visibility. When the team can see that test reliability has dropped from 98% to 93%, they can take action before trust erodes.

The Prioritization Problem

A team might have 50 flaky tests. Which ones should they fix first? Without metrics, the answer is usually "whichever one annoyed someone most recently." With metrics, you can prioritize by impact: fix the test that fails most often, blocks the most pipelines, or wastes the most developer time.

Core Metric 1: Flake Rate

The flake rate is the most fundamental test reliability metric. It measures the percentage of test executions that produce non-deterministic results.

Definition

Flake Rate = (Number of non-deterministic test results) / (Total number of test executions) x 100

A "non-deterministic result" means the test produced a different result (pass vs. fail) across multiple runs with no code changes.

How to Calculate It

There are two approaches to calculating flake rate:

Approach 1: Binary detection (simple)

Run each test N times. If it ever produces different results across those N runs, it is flaky. The flake rate is the percentage of tests that are flaky.

# Simple flake rate calculation
def calculate_flake_rate(test_results):
    """
    test_results: dict mapping test_name -> list of booleans (True=pass, False=fail)
    Returns: float between 0 and 1
    """
    flaky_count = 0
    total_tests = len(test_results)

    for test_name, results in test_results.items():
        unique_results = set(results)
        if len(unique_results) > 1:  # Both True and False appear
            flaky_count += 1

    return flaky_count / total_tests if total_tests > 0 else 0

Approach 2: Weighted by frequency (more accurate)

Calculate the flake rate for each test individually, then aggregate:

def calculate_per_test_flake_rate(results):
    """
    results: list of booleans (True=pass, False=fail)
    Returns: float between 0 and 1
    """
    if len(results) < 2:
        return 0.0

    # A test that always passes or always fails is not flaky
    pass_rate = sum(results) / len(results)

    if pass_rate == 0.0 or pass_rate == 1.0:
        return 0.0

    # Flake rate is how far from deterministic the test is
    # A test that fails 50% of the time has the highest flake rate
    return 2 * min(pass_rate, 1 - pass_rate)

def calculate_suite_flake_rate(all_test_results):
    """
    all_test_results: dict mapping test_name -> list of booleans
    Returns: float between 0 and 1
    """
    rates = [calculate_per_test_flake_rate(results)
             for results in all_test_results.values()]
    return sum(rates) / len(rates) if rates else 0

Industry Benchmarks

| Flake Rate | Assessment | Action |

|-----------|-----------|--------|

| < 0.5% | Excellent | Maintain current practices |

| 0.5% - 2% | Acceptable | Monitor trends, fix high-impact flakes |

| 2% - 5% | Concerning | Prioritize reliability improvements |

| 5% - 10% | Critical | Dedicated sprint for reliability |

| > 10% | Emergency | Stop feature work, fix test suite |

Common Pitfalls

Sample size matters: Running a test 3 times is insufficient to detect a 5% flake rate. Use at least 20 runs for reliable detection.

Time dependency: Some flaky tests only fail at certain times of day. Measure across different times.

Environment dependency: A test might be flaky only in CI, not locally. Always measure in the environment that matters most.

Core Metric 2: FlakeScore

FlakeScore is a composite metric that goes beyond simple flake rate to capture the severity and impact of flakiness. While flake rate tells you "how often," FlakeScore tells you "how much does it hurt."

Definition

FlakeScore combines multiple factors into a single 0-100 score:

FlakeScore = w1 FlakeFrequency + w2 FlakeImpact + w3 FlakeRecency + w4 FlakePersistence

Where:

FlakeFrequency (0-100): How often the test flakes, normalized to a 0-100 scale

FlakeImpact (0-100): How much damage a flake causes (blocks deployments, wastes CI time)

FlakeRecency (0-100): How recently the test flaked (recent flakes score higher)

FlakePersistence (0-100): How long the test has been flaky

The weights (w1-w4) are configurable but commonly set to:

w1 (Frequency) = 0.35
w2 (Impact) = 0.30
w3 (Recency) = 0.20
w4 (Persistence) = 0.15

Calculating FlakeScore

from datetime import datetime, timedelta

def calculate_flake_score(test_data):
    """
    test_data: {
        'results': [(timestamp, pass/fail), ...],  # Last 30 days of results
        'blocks_deployment': bool,
        'ci_minutes_per_run': float,
        'first_flake_date': datetime,
        'test_file': str,
    }
    """
    results = test_data['results']

    # 1. Frequency Score (0-100)
    if len(results) < 2:
        return 0

    total_runs = len(results)
    pass_count = sum(1 for _, result in results if result == 'pass')
    fail_count = total_runs - pass_count

    # Flake frequency is highest when pass/fail is 50/50
    if fail_count == 0 or pass_count == 0:
        frequency_score = 0
    else:
        raw_flake_rate = min(pass_count, fail_count) / total_runs
        frequency_score = min(100, raw_flake_rate * 200)  # Scale so 50% = 100

    # 2. Impact Score (0-100)
    impact_score = 0
    if test_data.get('blocks_deployment'):
        impact_score += 50

    ci_minutes = test_data.get('ci_minutes_per_run', 5)
    impact_score += min(50, ci_minutes * 2)  # 25 minutes = max impact

    # 3. Recency Score (0-100)
    # Recent flakes score higher
    now = datetime.now()
    recent_flakes = [ts for ts, result in results
                     if result == 'fail' and (now - ts) < timedelta(days=7)]

    if recent_flakes:
        most_recent = max(recent_flakes)
        days_since = (now - most_recent).days
        recency_score = max(0, 100 - (days_since * 15))  # Decays over ~7 days
    else:
        recency_score = 0

    # 4. Persistence Score (0-100)
    first_flake = test_data.get('first_flake_date')
    if first_flake:
        days_flaky = (now - first_flake).days
        persistence_score = min(100, days_flaky * 3)  # Max at ~33 days
    else:
        persistence_score = 0

    # Weighted composite
    flake_score = (
        0.35 * frequency_score +
        0.30 * impact_score +
        0.20 * recency_score +
        0.15 * persistence_score
    )

    return round(flake_score, 1)

Using FlakeScore for Prioritization

FlakeScore enables data-driven prioritization. Instead of fixing flaky tests based on gut feeling, sort them by FlakeScore and work from the top:

|------|-----------|-----------|--------|--------|

DeFlaky calculates FlakeScore automatically for every test in your suite, updating it after each CI run. This gives you a continuously updated priority list that reflects the current state of your test suite, not a snapshot from the last time someone manually investigated.

Core Metric 3: MTTR for Flaky Tests (Mean Time to Resolve)

MTTR (Mean Time to Resolve) is a well-known DevOps metric, typically applied to production incidents. Applying it to flaky tests gives you insight into how quickly your team responds to test reliability issues.

Definition

MTTR = Average time from when a flaky test is detected to when it is fixed and stable

How to Track It

def calculate_flaky_mttr(flaky_tests):
    """
    flaky_tests: list of dicts with 'detected_at' and 'resolved_at' timestamps
    Returns: average resolution time in hours
    """
    resolved = [t for t in flaky_tests if t.get('resolved_at')]

    if not resolved:
        return None

    total_hours = sum(
        (t['resolved_at'] - t['detected_at']).total_seconds() / 3600
        for t in resolved
    )

    return total_hours / len(resolved)

Breaking Down MTTR

MTTR can be decomposed into sub-metrics that reveal where time is being spent:

MTTD (Mean Time to Detect): How long before a flaky test is identified as flaky (not just a one-time failure)

MTTA (Mean Time to Acknowledge): How long before someone takes ownership of fixing it

MTTF (Mean Time to Fix): How long the actual fix takes

MTTV (Mean Time to Verify): How long to confirm the fix resolved the flakiness

MTTR = MTTD + MTTA + MTTF + MTTV

In most teams, MTTD is the largest component. Developers often do not realize a test is flaky until it has been failing intermittently for weeks. Automated detection tools like DeFlaky dramatically reduce MTTD by continuously monitoring test results and flagging non-deterministic patterns.

Industry Benchmarks

| MTTR | Assessment | Notes |

|------|-----------|-------|

| < 24 hours | Excellent | Rapid response, strong ownership culture |

| 1-3 days | Good | Reasonable for most teams |

| 3-7 days | Acceptable | Might indicate resourcing issues |

| 1-2 weeks | Concerning | Flaky tests are accumulating |

| > 2 weeks | Critical | Test reliability is not a priority, trust is eroding |

Core Metric 4: Test Suite Health Score

The Test Suite Health Score is a single number that represents the overall reliability of your test suite. It is useful for executive reporting and trend tracking.

Definition

Health Score = 100 - (Penalties)

Penalties are calculated from:

def calculate_health_score(suite_data):
    """
    suite_data: {
        'total_tests': int,
        'flaky_tests': int,
        'quarantined_tests': int,
        'disabled_tests': int,
        'avg_flake_rate': float,  # 0-1
        'suite_pass_rate': float,  # 0-1 (across last 30 days of CI runs)
        'avg_execution_time': float,  # seconds
        'flaky_mttr_hours': float,
    }
    """
    score = 100.0

    # Penalty for flaky tests (max -30 points)
    flaky_ratio = suite_data['flaky_tests'] / suite_data['total_tests']
    score -= min(30, flaky_ratio * 500)  # 6% flaky = -30

    # Penalty for quarantined tests (max -15 points)
    quarantine_ratio = suite_data['quarantined_tests'] / suite_data['total_tests']
    score -= min(15, quarantine_ratio * 300)  # 5% quarantined = -15

    # Penalty for disabled tests (max -10 points)
    disabled_ratio = suite_data['disabled_tests'] / suite_data['total_tests']
    score -= min(10, disabled_ratio * 200)  # 5% disabled = -10

    # Penalty for low suite pass rate (max -25 points)
    if suite_data['suite_pass_rate'] < 1.0:
        score -= min(25, (1 - suite_data['suite_pass_rate']) * 500)

    # Penalty for slow MTTR (max -10 points)
    if suite_data['flaky_mttr_hours'] > 24:
        score -= min(10, (suite_data['flaky_mttr_hours'] - 24) / 24 * 5)

    # Penalty for slow execution (max -10 points)
    if suite_data['avg_execution_time'] > 600:  # > 10 minutes
        score -= min(10, (suite_data['avg_execution_time'] - 600) / 120)

    return max(0, round(score, 1))

Health Score Grades

| Score | Grade | Description |

|-------|-------|-------------|

| 90-100 | A | Excellent -- test suite is highly reliable |

| 80-89 | B | Good -- minor issues that should be addressed |

| 70-79 | C | Acceptable -- noticeable reliability issues |

| 60-69 | D | Poor -- significant investment needed |

| < 60 | F | Critical -- test suite is not trustworthy |

Core Metric 5: Pipeline Reliability Rate

This metric measures the percentage of CI pipeline runs that succeed without requiring retries due to test flakiness.

Definition

Pipeline Reliability Rate = (Runs that pass on first attempt) / (Total runs) x 100

Calculation

def calculate_pipeline_reliability(runs):
    """
    runs: list of dicts with 'attempts' (int) and 'final_result' (pass/fail)
    """
    total = len(runs)
    first_attempt_passes = sum(1 for r in runs if r['attempts'] == 1 and r['final_result'] == 'pass')

    return (first_attempt_passes / total * 100) if total > 0 else 100

Why It Matters

Pipeline Reliability Rate is the metric that most directly correlates with developer experience. When this number drops, developers feel the pain in the form of longer wait times, more re-runs, and decreased confidence in the pipeline.

| Rate | Assessment |

|------|-----------|

| > 98% | Excellent -- minimal disruption |

| 95-98% | Good -- occasional retries |

| 90-95% | Concerning -- noticeable friction |

| 80-90% | Poor -- significant productivity impact |

| < 80% | Critical -- pipeline is unreliable |

Advanced Metric 6: Flake Correlation Score

Some flaky tests always fail together, suggesting a shared root cause. The Flake Correlation Score identifies these clusters.

Definition

The Flake Correlation Score measures how strongly two tests' failure patterns are correlated. A score of 1.0 means they always fail together; a score of 0.0 means their failures are independent.

Calculation

import numpy as np

def calculate_flake_correlation(test_a_results, test_b_results):
    """
    Both inputs: list of booleans (True=pass, False=fail) from the same runs
    Returns: correlation coefficient (-1 to 1)
    """
    if len(test_a_results) != len(test_b_results):
        raise ValueError("Both tests must have the same number of runs")

    a = np.array([1 if r else 0 for r in test_a_results])
    b = np.array([1 if r else 0 for r in test_b_results])

    # Handle constant arrays (no variation)
    if np.std(a) == 0 or np.std(b) == 0:
        return 0.0

    correlation = np.corrcoef(a, b)[0, 1]
    return round(correlation, 3)

def find_correlated_flakes(all_results, threshold=0.5):
    """
    all_results: dict mapping test_name -> list of booleans
    Returns: list of correlated test pairs
    """
    test_names = list(all_results.keys())
    correlations = []

    for i in range(len(test_names)):
        for j in range(i + 1, len(test_names)):
            corr = calculate_flake_correlation(
                all_results[test_names[i]],
                all_results[test_names[j]]
            )
            if abs(corr) >= threshold:
                correlations.append({
                    'test_a': test_names[i],
                    'test_b': test_names[j],
                    'correlation': corr,
                })

    return sorted(correlations, key=lambda x: abs(x['correlation']), reverse=True)

Practical Application

Correlated flakes usually share a root cause:

High positive correlation (both fail together): Shared dependency, database state, or service

High negative correlation (one fails when the other passes): Resource contention, ordering dependency

Cluster of correlated tests: Likely a shared infrastructure issue (database connection pool, file system, network)

Fixing one test in a correlated cluster often fixes all of them, making this metric extremely valuable for prioritization.

Advanced Metric 7: Cost per Flake

This metric translates flaky test impact into dollars, making it easy to communicate with non-technical stakeholders.

Calculation

def calculate_cost_per_flake(config):
    """
    config: {
        'avg_developer_hourly_rate': float,  # e.g., 75.0
        'avg_investigation_minutes': float,  # e.g., 30
        'avg_ci_cost_per_minute': float,  # e.g., 0.05
        'avg_pipeline_duration_minutes': float,  # e.g., 20
        'retries_per_flake': float,  # e.g., 1.5
    }
    """
    # Developer time cost
    dev_cost = (
        config['avg_developer_hourly_rate'] / 60 *
        config['avg_investigation_minutes']
    )

    # CI compute cost
    ci_cost = (
        config['avg_ci_cost_per_minute'] *
        config['avg_pipeline_duration_minutes'] *
        config['retries_per_flake']
    )

    # Opportunity cost (blocked deployments)
    # Rough estimate: 15 minutes of blocked team productivity
    opportunity_cost = config['avg_developer_hourly_rate'] / 60 * 15

    total = dev_cost + ci_cost + opportunity_cost

    return {
        'developer_time': round(dev_cost, 2),
        'ci_compute': round(ci_cost, 2),
        'opportunity_cost': round(opportunity_cost, 2),
        'total_per_flake': round(total, 2),
    }

Example calculation
cost = calculate_cost_per_flake({
    'avg_developer_hourly_rate': 75.0,
    'avg_investigation_minutes': 30,
    'avg_ci_cost_per_minute': 0.05,
    'avg_pipeline_duration_minutes': 20,
    'retries_per_flake': 1.5,
})

Result:
developer_time: $37.50
ci_compute: $1.50
opportunity_cost: $18.75
total_per_flake: $57.75

Multiply the cost per flake by the number of daily flakes to get the monthly cost. For a team experiencing 5 flakes per day at $57.75 each, that is roughly $8,663 per month -- money that justifies investment in flaky test detection and resolution.

Building a Test Reliability Dashboard

Metrics are only useful if they are visible. Here is how to build a dashboard that drives action.

What to Include

A good test reliability dashboard has three sections:

1. Executive Summary (for managers and directors)

Test Suite Health Score (single number, trending)
Pipeline Reliability Rate (percentage, trending)
Monthly cost of flaky tests (dollars)
Flaky test count trend (last 90 days)

2. Operational View (for QA leads and team leads)

Top 10 flakiest tests by FlakeScore
Quarantine status (count, age, trend)
MTTR breakdown (MTTD, MTTA, MTTF, MTTV)
Flake rate by test category (unit, integration, E2E)

3. Detail View (for engineers)

Individual test FlakeScores with history
Failure pattern analysis (time-based, ordering-based)
Correlated test clusters
Fix recommendations

Implementation

For teams using DeFlaky, the dashboard is built in and updates automatically after every CI run. For teams building their own, here is a data model:

-- Test results table
CREATE TABLE test_results (
    id SERIAL PRIMARY KEY,
    test_name VARCHAR(500) NOT NULL,
    test_file VARCHAR(500) NOT NULL,
    suite_name VARCHAR(200),
    status VARCHAR(10) NOT NULL,  -- 'pass', 'fail', 'skip'
    duration_ms INTEGER,
    error_message TEXT,
    run_id VARCHAR(100) NOT NULL,
    branch VARCHAR(200),
    commit_sha VARCHAR(40),
    ci_platform VARCHAR(50),
    worker_id VARCHAR(50),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Flaky test tracking
CREATE TABLE flaky_tests (
    id SERIAL PRIMARY KEY,
    test_name VARCHAR(500) NOT NULL,
    test_file VARCHAR(500) NOT NULL,
    flake_score DECIMAL(5,2),
    flake_rate DECIMAL(5,4),
    first_detected TIMESTAMP,
    last_flake TIMESTAMP,
    status VARCHAR(20) DEFAULT 'active',  -- 'active', 'quarantined', 'fixed'
    assigned_to VARCHAR(100),
    resolved_at TIMESTAMP,
    root_cause VARCHAR(50),  -- 'timing', 'isolation', 'environment', 'data', 'unknown'
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Pipeline runs
CREATE TABLE pipeline_runs (
    id SERIAL PRIMARY KEY,
    run_id VARCHAR(100) NOT NULL,
    branch VARCHAR(200),
    commit_sha VARCHAR(40),
    status VARCHAR(10),  -- 'pass', 'fail'
    attempts INTEGER DEFAULT 1,
    total_tests INTEGER,
    passed_tests INTEGER,
    failed_tests INTEGER,
    flaky_tests INTEGER,
    duration_seconds INTEGER,
    ci_platform VARCHAR(50),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Useful views
CREATE VIEW flaky_test_summary AS
SELECT
    ft.test_name,
    ft.test_file,
    ft.flake_score,
    ft.status,
    ft.first_detected,
    ft.last_flake,
    EXTRACT(EPOCH FROM (COALESCE(ft.resolved_at, NOW()) - ft.first_detected)) / 3600 AS hours_open,
    COUNT(tr.id) AS total_runs,
    SUM(CASE WHEN tr.status = 'fail' THEN 1 ELSE 0 END) AS failure_count,
    ROUND(
        SUM(CASE WHEN tr.status = 'fail' THEN 1 ELSE 0 END)::DECIMAL / COUNT(tr.id) * 100,
        2
    ) AS failure_percentage
FROM flaky_tests ft
LEFT JOIN test_results tr ON ft.test_name = tr.test_name
WHERE tr.created_at > NOW() - INTERVAL '30 days'
GROUP BY ft.id
ORDER BY ft.flake_score DESC;

Querying for Dashboard Panels

-- Executive: Health score trend (weekly)
SELECT
    DATE_TRUNC('week', created_at) AS week,
    COUNT(*) AS total_runs,
    SUM(CASE WHEN attempts = 1 AND status = 'pass' THEN 1 ELSE 0 END) AS clean_passes,
    ROUND(
        SUM(CASE WHEN attempts = 1 AND status = 'pass' THEN 1 ELSE 0 END)::DECIMAL / COUNT()  100,
        1
    ) AS reliability_rate
FROM pipeline_runs
WHERE created_at > NOW() - INTERVAL '90 days'
GROUP BY week
ORDER BY week;

-- Operational: Top flaky tests
SELECT
    test_name,
    test_file,
    flake_score,
    flake_rate,
    status,
    first_detected,
    last_flake,
    assigned_to
FROM flaky_tests
WHERE status IN ('active', 'quarantined')
ORDER BY flake_score DESC
LIMIT 20;

-- Detail: Test failure pattern (time-of-day analysis)
SELECT
    test_name,
    EXTRACT(HOUR FROM created_at) AS hour_of_day,
    COUNT(*) AS total_runs,
    SUM(CASE WHEN status = 'fail' THEN 1 ELSE 0 END) AS failures,
    ROUND(
        SUM(CASE WHEN status = 'fail' THEN 1 ELSE 0 END)::DECIMAL / COUNT()  100,
        1
    ) AS failure_rate
FROM test_results
WHERE test_name = 'should process payment successfully'
AND created_at > NOW() - INTERVAL '30 days'
GROUP BY test_name, hour_of_day
ORDER BY hour_of_day;

Setting Targets and SLAs

Metrics without targets are just numbers. Here is how to set realistic targets for your team.

Starting Point Assessment

Before setting targets, measure your current baseline over 2-4 weeks. This gives you a realistic starting point and prevents setting targets that are either too easy or impossibly aggressive.

def assess_baseline(historical_data):
    """
    Determine your team's current test reliability baseline
    """
    current_metrics = {
        'flake_rate': calculate_suite_flake_rate(historical_data['test_results']),
        'pipeline_reliability': calculate_pipeline_reliability(historical_data['pipeline_runs']),
        'flaky_test_count': len([t for t in historical_data['tests'] if t['is_flaky']]),
        'mttr_hours': calculate_flaky_mttr(historical_data['flaky_tests']),
        'health_score': calculate_health_score(historical_data['suite_data']),
    }

    return current_metrics

Progressive Targets

Set targets that improve gradually over quarters:

|--------|----------|----------|----------|----------|

| Flake Rate | < 5% | < 3% | < 1% | < 0.5% |

| Pipeline Reliability | > 90% | > 95% | > 97% | > 99% |

| MTTR (hours) | < 120 | < 72 | < 48 | < 24 |

| Health Score | > 60 | > 70 | > 80 | > 90 |

| Quarantine Size | < 10% | < 5% | < 2% | < 1% |

Team SLAs

Define response SLAs based on FlakeScore severity:

|-----------|---------|-------------|---------------|

| 20-40 | Low | 1 week | 2 weeks |

Reporting and Communication

Different stakeholders need different levels of detail.

Weekly Team Report

## Test Reliability Weekly Report

Week of: April 7 - April 13, 2026

Summary
Health Score: 82/100 (up from 78)
Pipeline Reliability: 96.2% (up from 94.1%)
Flaky Tests: 12 active, 3 quarantined (down from 15 active)

Wins
Fixed 4 flaky tests (checkout, search, auth, notifications)
MTTR improved from 5.2 days to 3.1 days
Zero deployment blockages from flaky tests this week

Action Items
[ ] Investigate payment.test.js cluster (FlakeScore: 73)
[ ] Review quarantined tests for possible unquarantine
[ ] Add DeFlaky monitoring to new microservice repos

Trends (30-day)
Flake rate: 3.2% -> 1.8% (improving)
Pipeline reliability: 93% -> 96.2% (improving)
MTTR: 6.4 days -> 3.1 days (improving)

Monthly Executive Report

## Test Reliability Monthly Report - March 2026

Business Impact
Engineering hours saved: 45 hours (vs. February baseline)
CI cost reduction: $1,200/month from fewer retries
Deployment frequency: Increased from 3/week to 5/week
Production incidents from test gaps: 0 (down from 2)

Key Metrics
| Metric | February | March | Target | Status |
|--------|---------|-------|--------|--------|
| Health Score | 71 | 82 | 80 | Met |
| Pipeline Reliability | 91% | 96% | 95% | Met |
| Flake Rate | 4.1% | 1.8% | 3% | Met |
| MTTR | 6.4 days | 3.1 days | 5 days | Met |

Investment Recommendation
Current trajectory suggests we will reach our Q3 targets ahead of schedule.
Recommend maintaining current investment level.

Tools and Automation

DeFlaky for Automated Metrics

DeFlaky automates the collection, calculation, and visualization of all the metrics described in this guide. After integrating with your CI pipeline, it automatically:

Tracks every test execution result
Calculates FlakeScore, flake rate, and correlation scores
Updates the dashboard in real time
Sends alerts when metrics exceed thresholds
Generates weekly and monthly reports
Recommends which tests to fix based on impact analysis

# Set up DeFlaky metrics tracking
deflaky init --ci github-actions
deflaky config set --metric-targets '{
  "flake_rate": 0.02,
  "pipeline_reliability": 0.95,
  "mttr_hours": 72,
  "health_score": 80
}'
deflaky config set --alerts '{
  "slack_webhook": "https://hooks.slack.com/...",
  "alert_on": ["health_score_drop", "new_flaky_test", "sla_breach"]
}'

Building Your Own Metrics Pipeline

If you prefer to build your own, use this architecture:

Collection: Parse test results from CI (JUnit XML, JSON reports) and store in a time-series database

Calculation: Run metric calculations daily or after each CI run

Storage: PostgreSQL for relational queries, or InfluxDB/TimescaleDB for time-series

Visualization: Grafana, Metabase, or a custom dashboard

Alerting: PagerDuty, Slack, or email based on threshold breaches

Anti-Patterns in Test Reliability Measurement

Anti-Pattern 1: Measuring Only Pass Rate

Suite pass rate (percentage of tests that pass) is not enough. A suite with 100% pass rate might have 20 quarantined tests and 10 disabled tests. The pass rate looks great, but the suite is actually unreliable.

Always measure pass rate alongside quarantine size, disabled test count, and overall coverage.

Anti-Pattern 2: Counting Flaky Tests Without Weighting

Not all flaky tests are equal. A test that fails 1% of the time on a non-critical path is less important than a test that fails 10% of the time and blocks deployments. Use FlakeScore or a similar weighted metric.

Anti-Pattern 3: Ignoring Trends

A flake rate of 2% is concerning, but a flake rate of 2% that was 5% last month is excellent progress. Always show metrics as trends, not point-in-time snapshots.

Anti-Pattern 4: Setting Unrealistic Targets

A team with a 10% flake rate cannot reach 0.5% in one quarter. Set progressive targets that are challenging but achievable. Celebrate progress at each milestone.

Anti-Pattern 5: Measuring Without Acting

Dashboards are not solutions. If you track flaky test metrics but never prioritize fixing them, the dashboard becomes demoralizing rather than motivating. Connect metrics to action: every metric should have a clear owner and response plan.

Conclusion

Test reliability metrics transform flaky tests from an invisible, frustrating problem into a visible, manageable one. By tracking FlakeScore, flake rate, MTTR, pipeline reliability, and test suite health, you give your team the data they need to prioritize effectively and demonstrate progress to stakeholders.

Start with the basics: measure your current flake rate and pipeline reliability. Then add FlakeScore for prioritization and MTTR for tracking your response time. Build a simple dashboard (or use DeFlaky to get one out of the box), set progressive targets, and report regularly.

The teams that take test reliability metrics seriously consistently outperform those that do not. They deploy faster, catch more real bugs, waste less time on false failures, and maintain higher developer satisfaction. The metrics described in this guide are your roadmap to joining them.

Test Reliability Metrics Every QA Team Should Track (FlakeScore and Beyond)

Why Test Reliability Metrics Matter

The Hidden Cost Problem

The Trust Problem

The Prioritization Problem

Core Metric 1: Flake Rate

Definition

How to Calculate It

Industry Benchmarks

Common Pitfalls

Core Metric 2: FlakeScore

Definition

Calculating FlakeScore

Using FlakeScore for Prioritization

Core Metric 3: MTTR for Flaky Tests (Mean Time to Resolve)

Definition

How to Track It

Breaking Down MTTR

Industry Benchmarks

Core Metric 4: Test Suite Health Score

Definition

Health Score Grades

Core Metric 5: Pipeline Reliability Rate

Definition

Calculation

Why It Matters

Advanced Metric 6: Flake Correlation Score

Definition

Calculation

Practical Application

Advanced Metric 7: Cost per Flake

Calculation

Example calculation

Result:

developer_time: $37.50

ci_compute: $1.50

opportunity_cost: $18.75

total_per_flake: $57.75

Building a Test Reliability Dashboard

What to Include

Implementation

Querying for Dashboard Panels

Setting Targets and SLAs

Starting Point Assessment

Progressive Targets

Team SLAs

Reporting and Communication

Weekly Team Report

Summary

Wins

Action Items

Trends (30-day)

Monthly Executive Report

Business Impact

Key Metrics

Investment Recommendation

Tools and Automation

DeFlaky for Automated Metrics

Building Your Own Metrics Pipeline

Anti-Patterns in Test Reliability Measurement

Anti-Pattern 1: Measuring Only Pass Rate

Anti-Pattern 2: Counting Flaky Tests Without Weighting

Anti-Pattern 3: Ignoring Trends

Anti-Pattern 4: Setting Unrealistic Targets

Anti-Pattern 5: Measuring Without Acting

Conclusion

Stop guessing. DeFlaky your tests.