How to Detect Flaky Tests in Your CI Pipeline (Automated Detection Guide)
You cannot fix what you cannot see. The biggest challenge with flaky tests is not fixing them -- it is finding them in the first place. A test that fails once every twenty runs is easy to dismiss. A test that fails once every hundred runs might go unnoticed for months. Meanwhile, these invisible failures erode pipeline trust, waste compute, and train your team to ignore red builds.
This guide focuses specifically on detection -- the systematic methods for identifying flaky tests inside your CI pipeline before they become entrenched problems.
Why Manual Detection Fails
Most teams discover flaky tests the hard way: a developer sees a failed build, investigates, finds nothing wrong, reruns the pipeline, and it passes. They shrug and move on. This reactive approach has three critical problems.
First, it only catches tests that fail during someone's active work. If a flaky test fails at 2 AM during a scheduled build, nobody investigates. The rerun passes and the flakiness is invisible.
Second, it depends on individual memory. Developer A sees a test fail on Monday. Developer B sees the same test fail on Thursday. Neither connects the two events because there is no centralized tracking.
Third, it biases detection toward frequently flaky tests. Tests with a 1% failure rate might never be noticed by any individual developer, but across a team of fifty engineers running the pipeline hundreds of times per day, that 1% failure rate causes multiple wasted investigations per week.
Method 1: Rerun-Based Detection
The simplest detection method runs each test multiple times and checks for inconsistent results. If a test passes on some runs and fails on others with zero code changes, it is flaky by definition.
Implementing Rerun Detection in GitHub Actions
# .github/workflows/flaky-detection.yml
name: Flaky Test Detection
on:
schedule:
- cron: '0 3 1' # Every Monday at 3 AM
workflow_dispatch:
jobs:
detect-flaky:
runs-on: ubuntu-latest
strategy:
matrix:
run_number: [1, 2, 3, 4, 5]
steps:
- uses: actions/checkout@v4
- name: Install dependencies
run: npm ci
- name: Run tests (attempt ${{ matrix.run_number }})
run: npx jest --json --outputFile=results-${{ matrix.run_number }}.json
continue-on-error: true
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: test-results-${{ matrix.run_number }}
path: results-${{ matrix.run_number }}.json
This workflow runs your entire test suite five times in parallel. By comparing the results across runs, you can identify tests that produced different outcomes.
Rerun Detection with pytest
pytest has a plugin specifically designed for this.
# Install pytest-repeat
pip install pytest-repeat
Run every test 10 times
pytest --count=10 -x --tb=short
Run every test 10 times but don't stop on first failure
pytest --count=10 --tb=short
The -x flag stops on the first failure, which is useful for interactive debugging. Without it, you get a complete picture of which tests are flaky across all ten runs.
Rerun Detection with Jest
// jest.config.js
module.exports = {
// Run each test file 5 times
// Note: This is not a built-in Jest feature.
// Use a script to run Jest multiple times instead.
};
#!/bin/bash
detect-flaky.sh - Run Jest multiple times and compare results
RUNS=5
FAILURES=""
for i in $(seq 1 $RUNS); do
echo "=== Run $i of $RUNS ==="
npx jest --json --outputFile="results-run-${i}.json" 2>/dev/null
if [ $? -ne 0 ]; then
# Extract failed test names
FAILED=$(node -e "
const r = require('./results-run-${i}.json');
r.testResults.forEach(suite => {
suite.testResults.filter(t => t.status === 'failed')
.forEach(t => console.log(t.fullName));
});
")
FAILURES="${FAILURES}\nRun ${i}: ${FAILED}"
fi
done
echo ""
echo "=== Flaky Test Analysis ==="
echo -e "$FAILURES" | sort | uniq -c | sort -rn
This script runs Jest five times and aggregates which tests failed in which runs. Tests that failed in some runs but not others are flaky.
Method 2: Historical Result Analysis
Rerun-based detection is thorough but expensive -- it multiplies your CI compute by the number of reruns. Historical analysis achieves the same goal by analyzing results you have already collected.
How It Works
Every CI pipeline run produces test results. By storing these results and analyzing them over time, you can identify tests whose pass/fail status varies without corresponding code changes.
The algorithm is straightforward:
- For each test, collect the last N results (e.g., last 50 runs).
- Group results by the commit hash that was being tested.
- If a test has both pass and fail results for the same commit, it is flaky.
flaky_runs / total_runs.Implementing Historical Analysis
# analyze_history.py
import json
import sys
from collections import defaultdict
def analyze_flakiness(result_files):
"""Analyze test results across multiple runs to detect flakiness."""
test_results = defaultdict(lambda: {"pass": 0, "fail": 0, "commits": set()})
for filepath in result_files:
with open(filepath) as f:
data = json.load(f)
commit = data.get("commit_sha", "unknown")
for suite in data.get("testResults", []):
for test in suite.get("testResults", []):
name = test["fullName"]
status = test["status"]
test_results[name]["commits"].add(commit)
if status == "passed":
test_results[name]["pass"] += 1
elif status == "failed":
test_results[name]["fail"] += 1
# Identify flaky tests
flaky_tests = []
for name, results in test_results.items():
total = results["pass"] + results["fail"]
if results["pass"] > 0 and results["fail"] > 0:
flake_rate = results["fail"] / total
flaky_tests.append({
"name": name,
"flake_rate": flake_rate,
"total_runs": total,
"failures": results["fail"],
"unique_commits": len(results["commits"]),
})
# Sort by flake rate descending
flaky_tests.sort(key=lambda x: x["flake_rate"], reverse=True)
return flaky_tests
if __name__ == "__main__":
flaky = analyze_flakiness(sys.argv[1:])
for test in flaky:
print(f" {test['flake_rate']:.1%} flaky | {test['name']}")
print(f" {test['failures']}/{test['total_runs']} runs failed")
Using DeFlaky for Historical Analysis
DeFlaky automates this entire process. Point it at your test results and it handles the rest.
# Analyze results from the current run
deflaky analyze --input test-results.xml --format junit
Push results to the DeFlaky dashboard for historical tracking
deflaky push --input test-results.xml --project my-app
View flakiness trends
deflaky dashboard --open
The DeFlaky Dashboard tracks every test across every run and computes FlakeScore -- a weighted reliability metric that accounts for failure frequency, recency, and impact. Tests that fail frequently and recently get higher FlakeScores, helping you prioritize fixes.
Method 3: Differential Detection
Differential detection identifies flaky tests by comparing test results between runs that have identical code. This method is particularly effective in CI environments where the same commit is tested multiple times (e.g., when a pipeline is rerun after a flaky failure).
Implementation in CI
# Add to your CI pipeline
- name: Check for flaky failures
if: failure()
run: |
# Compare current failures against known flaky tests
deflaky check \
--input test-results.xml \
--threshold 0.05 \
--exit-code
The deflaky check command compares the current test failures against historical data. If a failed test has previously passed and failed on the same codebase, DeFlaky flags it as likely flaky rather than a genuine regression. This helps your team distinguish between "this test is broken because of my code change" and "this test is flaky and my code change is fine."
Method 4: Parallel Execution Comparison
Running the same tests in parallel on different machines exposes environment-sensitive flakiness. If a test passes on Worker A but fails on Worker B with the same code and configuration, the test depends on something that varies between environments.
# GitHub Actions: Run tests on multiple runners simultaneously
jobs:
test-matrix:
strategy:
fail-fast: false
matrix:
runner: [ubuntu-latest, ubuntu-22.04]
shard: [1, 2, 3]
runs-on: ${{ matrix.runner }}
steps:
- uses: actions/checkout@v4
- run: npm ci
- name: Run shard ${{ matrix.shard }}
run: |
npx jest --shard=${{ matrix.shard }}/3 \
--json --outputFile=results-${{ matrix.runner }}-${{ matrix.shard }}.json
continue-on-error: true
- uses: actions/upload-artifact@v4
with:
name: results-${{ matrix.runner }}-${{ matrix.shard }}
path: results-*.json
Method 5: Smart Alerting on Failure Patterns
Instead of treating every test failure as equal, build alerting that recognizes flaky patterns.
# flaky_alert.py - Alert only on likely-real failures
def should_alert(test_name, current_status, history):
"""Determine if a test failure warrants an alert."""
if current_status == "passed":
return False
# Check historical flakiness
recent_results = history.get_recent(test_name, count=20)
if not recent_results:
return True # New test, always alert on failure
flake_rate = sum(1 for r in recent_results if r == "fail") / len(recent_results)
if flake_rate > 0.3:
# Known highly flaky test -- suppress alert, log for tracking
log_flaky_occurrence(test_name)
return False
elif flake_rate > 0.05:
# Moderately flaky -- alert but tag as possibly flaky
return True # Alert with "[Possibly Flaky]" prefix
else:
# Rarely or never flaky -- this is likely a real failure
return True
Setting Up Continuous Flaky Test Monitoring
The most effective approach combines multiple detection methods into a continuous monitoring pipeline.
Step 1: Instrument Your CI Pipeline
Add test result collection to every CI run, not just special detection runs.
# Add to every CI pipeline run
npx jest --json --outputFile=test-results.json
deflaky push --input test-results.json --project $PROJECT_NAME --commit $GITHUB_SHA
Step 2: Configure Alerting Thresholds
Set thresholds for when flaky tests require attention.
# deflaky.config.yml
thresholds:
flake_rate_warning: 0.05 # 5% failure rate triggers warning
flake_rate_critical: 0.15 # 15% failure rate triggers critical alert
new_flaky_test_alert: true # Alert when a previously stable test becomes flaky
resolution_sla_hours: 48 # SLA for fixing critical flaky tests
Step 3: Review the Dashboard Weekly
Schedule a weekly review of your DeFlaky Dashboard to catch trends before they become crises. Look for:
- Tests whose flake rate is increasing over time
- New tests that were added with high flake rates
- Tests that were fixed but are becoming flaky again
- Clusters of flaky tests that share a common root cause
Step 4: Integrate with Your Workflow
Connect flaky test detection to your team's existing workflow tools.
# Create a Jira ticket for each new flaky test
deflaky report --format jira --threshold 0.05
Post flaky test summary to Slack
deflaky report --format slack --webhook $SLACK_WEBHOOK_URL
Block PR merges if the PR introduces new flaky tests
deflaky check --input test-results.xml --baseline main --exit-code
Detection Metrics to Track
Once your detection system is running, track these metrics to measure its effectiveness.
Mean Time to Detection (MTTD): How long between when a test first becomes flaky and when your system flags it. Target: under 48 hours. Detection Coverage: What percentage of your test suite is monitored for flakiness. Target: 100%. False Positive Rate: How often your system flags a test as flaky when it is actually failing due to a real bug. Target: under 5%. Flaky Test Inventory Size: The total number of known flaky tests at any given time. This should trend downward as you fix them.Conclusion
Detecting flaky tests is the prerequisite to fixing them. Without systematic detection, flaky tests accumulate silently until they reach a critical mass that makes your CI pipeline unreliable.
Start with the method that fits your current infrastructure. If you already store test results, historical analysis gives you immediate value with no additional CI compute. If you are starting fresh, rerun-based detection is simple to implement and highly accurate.
For a complete solution that combines all these methods, try DeFlaky. The CLI integrates with your existing test runner in minutes, and the dashboard gives your team visibility into test reliability across every pipeline run.
The goal is not zero flaky tests overnight. The goal is continuous visibility -- knowing exactly which tests are flaky, how flaky they are, and whether the trend is improving or worsening. With that visibility, your team can make informed decisions about where to invest their fix efforts for maximum impact.