10 Battle-Tested Strategies to Eliminate Flaky Tests from Your Codebase
Flaky tests are the silent productivity killer in software engineering. They are tests that pass sometimes and fail sometimes, without any changes to the code under test. Every engineering team encounters them, but few address them systematically. Instead, teams develop coping mechanisms: re-running pipelines, adding retries, or worse, ignoring test failures entirely.
This guide presents 10 battle-tested strategies for eliminating flaky tests. These are not theoretical suggestions -- they are techniques drawn from production codebases at organizations ranging from startups to Fortune 500 companies. Each strategy includes concrete implementation examples, common pitfalls, and guidance on when to apply it.
The strategies are ordered roughly by impact and ease of implementation. Start with Strategy 1 and work your way through as needed.
Strategy 1: Enforce Strict Test Isolation
Test isolation is the foundation of test reliability. If your tests share state, you will have flaky tests. Period. No amount of retries, clever waits, or debugging will fix a fundamental isolation problem.
What Test Isolation Means
A test is properly isolated when:
- It can run in any order and produce the same result
- It can run concurrently with any other test
- It can run independently without any other test running before or after it
- It leaves no side effects that affect other tests
How to Enforce Isolation
Database isolation:// BAD: Tests share the same database state
describe('User management', () => {
test('creates a user', async () => {
await db.users.create({ name: 'Alice', email: 'alice@test.com' });
const user = await db.users.findByEmail('alice@test.com');
expect(user.name).toBe('Alice');
});
test('lists all users', async () => {
const users = await db.users.findAll();
expect(users).toHaveLength(0); // FAILS: Alice exists from previous test
});
});
// GOOD: Each test gets a clean state
describe('User management', () => {
beforeEach(async () => {
await db.users.deleteAll();
});
test('creates a user', async () => {
await db.users.create({ name: 'Alice', email: 'alice@test.com' });
const user = await db.users.findByEmail('alice@test.com');
expect(user.name).toBe('Alice');
});
test('lists all users', async () => {
const users = await db.users.findAll();
expect(users).toHaveLength(0); // Passes: database was cleaned
});
});
// BEST: Use transactions for automatic rollback
describe('User management', () => {
let transaction;
beforeEach(async () => {
transaction = await db.beginTransaction();
});
afterEach(async () => {
await transaction.rollback();
});
test('creates a user', async () => {
await db.users.create({ name: 'Alice' }, { transaction });
const user = await db.users.findByEmail('alice@test.com', { transaction });
expect(user.name).toBe('Alice');
// Transaction automatically rolls back after test
});
});
File system isolation:
# BAD: Tests use shared file paths
def test_write_config():
write_config("/tmp/config.json", {"key": "value"})
config = read_config("/tmp/config.json")
assert config["key"] == "value"
def test_default_config():
config = read_config("/tmp/config.json")
assert config == {} # FAILS: previous test wrote to this file
GOOD: Use unique paths per test
import tempfile
import os
def test_write_config():
with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as f:
path = f.name
try:
write_config(path, {"key": "value"})
config = read_config(path)
assert config["key"] == "value"
finally:
os.unlink(path)
ALSO GOOD: Use pytest's tmp_path fixture
def test_write_config(tmp_path):
config_file = tmp_path / "config.json"
write_config(str(config_file), {"key": "value"})
config = read_config(str(config_file))
assert config["key"] == "value"
In-memory state isolation:
// BAD: Singleton retains state between tests
// cache.js
const cache = new Map();
export function setCache(key, value) { cache.set(key, value); }
export function getCache(key) { return cache.get(key); }
// cache.test.js
test('caches a value', () => {
setCache('user', { name: 'Alice' });
expect(getCache('user')).toEqual({ name: 'Alice' });
});
test('returns undefined for missing key', () => {
expect(getCache('user')).toBeUndefined(); // FAILS: 'user' was set in previous test
});
// GOOD: Reset the module between tests
beforeEach(() => {
jest.resetModules();
});
// OR: Export a clear function
export function clearCache() { cache.clear(); }
beforeEach(() => {
clearCache();
});
Detecting Isolation Violations
Run your tests in random order to surface isolation issues:
# Jest
npx jest --randomize
pytest
pytest -p randomly
RSpec
rspec --order random
If tests fail in random order but pass in the default order, you have an isolation problem. Tools like DeFlaky can automatically detect ordering dependencies by running your test suite with different orderings and correlating failures with test sequences.
Strategy 2: Use Deterministic Test Data
Non-deterministic test data is one of the most common causes of flaky tests. This includes random values, auto-incrementing IDs, timestamps, and data from shared environments.
The Problem with Non-Deterministic Data
// FLAKY: Test depends on auto-increment ID
test('fetches the first user', async () => {
const user = await createUser({ name: 'Test User' });
// Assumes user.id will be 1, but it depends on database state
expect(user.id).toBe(1);
});
// FLAKY: Test depends on current time
test('displays relative time', () => {
const post = { createdAt: new Date('2026-04-06T10:00:00Z') };
render( );
// "1 day ago" is only correct on April 7th
expect(screen.getByText('1 day ago')).toBeInTheDocument();
});
// FLAKY: Test uses Math.random()
test('generates a valid token', () => {
const token = generateToken();
expect(token).toMatch(/^[a-f0-9]{32}$/);
// What if generateToken() produces a value that doesn't match?
// The assertion might be wrong, not the code
});
Making Data Deterministic
Control time:// Use fake timers to control the clock
test('displays relative time', () => {
jest.useFakeTimers();
jest.setSystemTime(new Date('2026-04-07T10:00:00Z'));
const post = { createdAt: new Date('2026-04-06T10:00:00Z') };
render( );
expect(screen.getByText('1 day ago')).toBeInTheDocument();
jest.useRealTimers();
});
Control randomness:
// Seed your random number generator
test('generates consistent shuffled array', () => {
const rng = seedrandom('test-seed-42');
const shuffled = shuffleArray([1, 2, 3, 4, 5], rng);
expect(shuffled).toEqual([3, 1, 5, 2, 4]); // Deterministic!
});
// Or mock the random function
test('generates a predictable token', () => {
jest.spyOn(Math, 'random').mockReturnValue(0.5);
const token = generateToken();
expect(token).toBe('expected-token-value');
Math.random.mockRestore();
});
Use factory functions for test data:
// test-factories.js
let idCounter = 0;
export function resetFactories() {
idCounter = 0;
}
export function createTestUser(overrides = {}) {
idCounter++;
return {
id: test-user-${idCounter},
name: Test User ${idCounter},
email: user${idCounter}@test.example.com,
createdAt: new Date('2026-01-01T00:00:00Z'),
role: 'user',
...overrides,
};
}
export function createTestOrder(overrides = {}) {
idCounter++;
return {
id: test-order-${idCounter},
userId: test-user-1,
items: [
{ productId: 'prod-1', quantity: 1, price: 29.99 },
],
total: 29.99,
status: 'pending',
createdAt: new Date('2026-01-01T00:00:00Z'),
...overrides,
};
}
// Usage in tests
beforeEach(() => {
resetFactories();
});
test('calculates order total', () => {
const order = createTestOrder({
items: [
{ productId: 'prod-1', quantity: 2, price: 10.00 },
{ productId: 'prod-2', quantity: 1, price: 25.00 },
],
});
expect(calculateTotal(order)).toBe(45.00);
});
Never use production data in tests:
// BAD: Fetches data from a shared staging environment
test('displays product list', async () => {
const products = await fetch('https://staging-api.example.com/products');
// Product data changes constantly, test is flaky by design
});
// GOOD: Use fixed test data
test('displays product list', async () => {
const products = [
{ id: 1, name: 'Widget', price: 9.99 },
{ id: 2, name: 'Gadget', price: 19.99 },
];
render( );
expect(screen.getByText('Widget')).toBeInTheDocument();
expect(screen.getByText('Gadget')).toBeInTheDocument();
});
Strategy 3: Implement Smart Wait Strategies
Timing issues are the most visible form of flakiness. Tests that fail because they check for a result before the system has finished processing are extremely common, especially in E2E and integration tests.
The Wait Strategy Hierarchy
From best to worst:
Event-Based Waits
// BEST: Wait for a specific event
test('processes payment', async () => {
const paymentPromise = waitForEvent(paymentProcessor, 'complete');
paymentProcessor.charge({ amount: 99.99, card: testCard });
const result = await paymentPromise;
expect(result.status).toBe('success');
});
// For DOM: Wait for specific element state
test('submits form', async () => {
const user = userEvent.setup();
render( );
await user.type(screen.getByLabelText('Email'), 'test@example.com');
await user.click(screen.getByRole('button', { name: 'Submit' }));
// Wait for the success message to appear
expect(await screen.findByText('Message sent!')).toBeInTheDocument();
});
Polling Waits with Timeout
// Utility: Wait for condition with timeout
async function waitFor(conditionFn, { timeout = 5000, interval = 100 } = {}) {
const startTime = Date.now();
while (Date.now() - startTime < timeout) {
try {
const result = await conditionFn();
if (result) return result;
} catch (e) {
// Condition not met yet, continue polling
}
await new Promise(resolve => setTimeout(resolve, interval));
}
throw new Error(Condition not met within ${timeout}ms);
}
// Usage
test('processes background job', async () => {
const jobId = await submitJob({ type: 'report', params: { userId: 1 } });
const result = await waitFor(async () => {
const job = await getJob(jobId);
return job.status === 'completed' ? job : null;
}, { timeout: 10000, interval: 500 });
expect(result.output).toBeDefined();
});
Waiting for Network Requests
// Cypress: Wait for specific API calls
cy.intercept('POST', '/api/orders').as('createOrder');
cy.get('.submit-order').click();
cy.wait('@createOrder').its('response.statusCode').should('eq', 201);
// Playwright: Wait for response
const responsePromise = page.waitForResponse('**/api/orders');
await page.click('.submit-order');
const response = await responsePromise;
expect(response.status()).toBe(201);
// MSW (Mock Service Worker): Wait for request handler
test('creates order', async () => {
let capturedRequest;
server.use(
rest.post('/api/orders', async (req, res, ctx) => {
capturedRequest = await req.json();
return res(ctx.json({ id: 'order-123', status: 'created' }));
})
);
render( );
await userEvent.click(screen.getByRole('button', { name: 'Place Order' }));
await waitFor(() => {
expect(capturedRequest).toBeDefined();
expect(capturedRequest.items).toHaveLength(2);
});
});
What to Avoid
// NEVER: Fixed sleep in tests
await new Promise(resolve => setTimeout(resolve, 3000));
// NEVER: Arbitrary retry loops without timeout
while (!element.isVisible()) {
await sleep(100); // Could run forever
}
// NEVER: Waiting longer than necessary "just to be safe"
await sleep(10000); // 10 seconds "because sometimes it's slow"
Strategy 4: Implement Retry Patterns Correctly
Retries are not a fix for flaky tests -- they are a mitigation strategy that buys you time while you address root causes. But when used correctly, they prevent flaky tests from blocking your pipeline.
Test-Level vs. Suite-Level Retries
Test-level retries rerun only the failed test:// Jest: Using jest-retries
// jest.config.js
module.exports = {
// Not built-in, but some frameworks support this
};
// Cypress: Built-in test retries
// cypress.config.js
module.exports = defineConfig({
retries: {
runMode: 2, // Retry up to 2 times in CI
openMode: 0, // No retries locally
},
});
// Playwright: Built-in test retries
// playwright.config.ts
export default defineConfig({
retries: process.env.CI ? 2 : 0,
});
// pytest: Using pytest-rerunfailures
pytest.ini
[pytest]
reruns = 2
reruns_delay = 1
Suite-level retries rerun the entire test file. Use these sparingly:
# GitHub Actions: Retry the entire job
jobs:
test:
strategy:
matrix:
attempt: [1, 2]
steps:
- run: npm test
continue-on-error: ${{ matrix.attempt == 1 }}
Smart Retry with Exponential Backoff
For integration tests that hit real services, use exponential backoff:
async function retryWithBackoff(fn, { maxRetries = 3, baseDelay = 1000 } = {}) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) throw error;
const delay = baseDelay * Math.pow(2, attempt - 1);
console.log(Attempt ${attempt} failed, retrying in ${delay}ms...);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
test('fetches data from external API', async () => {
const data = await retryWithBackoff(async () => {
const response = await fetch('https://api.example.com/data');
if (!response.ok) throw new Error(HTTP ${response.status});
return response.json();
});
expect(data.items).toBeDefined();
});
Tracking Retries
Every retry should be tracked. A test that consistently needs retries is still flaky and should be fixed:
// Wrapper that tracks retry usage
function trackRetries(testFn, testName) {
return async (...args) => {
const startTime = Date.now();
let attempts = 0;
let lastError;
for (let i = 0; i < 3; i++) {
attempts++;
try {
const result = await testFn(...args);
if (attempts > 1) {
// Log that this test needed a retry
console.warn([FLAKY] ${testName} passed on attempt ${attempts});
// Report to your metrics system
reportFlakyTest(testName, attempts, Date.now() - startTime);
}
return result;
} catch (error) {
lastError = error;
}
}
throw lastError;
};
}
DeFlaky automatically tracks retry patterns across your test suite, identifying tests that rely on retries to pass and calculating the true failure rate without retry masking.
Strategy 5: Achieve Environment Parity
A significant percentage of flaky tests are caused by differences between development, CI, and production environments. When a test passes locally but fails in CI (or vice versa), the environment is usually the culprit.
Common Environment Differences
Achieving Parity with Containers
Docker containers are the most reliable way to achieve environment parity:
# Dockerfile.test
FROM node:20-slim
Install exact browser versions
RUN npx playwright install --with-deps chromium
Install exact tool versions
RUN apt-get update && apt-get install -y \
postgresql-client=15.* \
redis-tools=7:7.* \
&& rm -rf /var/lib/apt/lists/*
Set consistent locale and timezone
ENV LANG=en_US.UTF-8
ENV TZ=UTC
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
CMD ["npm", "test"]
# docker-compose.test.yml
services:
test:
build:
context: .
dockerfile: Dockerfile.test
environment:
- NODE_ENV=test
- DATABASE_URL=postgres://test:test@db:5432/testdb
- REDIS_URL=redis://redis:6379
- TZ=UTC
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
db:
image: postgres:15
environment:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
healthcheck:
test: ["CMD-SHELL", "pg_isready -U test"]
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
# Run tests in a consistent environment
docker compose -f docker-compose.test.yml run --rm test
Handling Resource Constraints
CI runners typically have fewer resources than developer machines. Account for this:
// Increase timeouts in CI
const timeout = process.env.CI ? 30000 : 10000;
// Reduce parallelism in CI
const maxWorkers = process.env.CI ? 2 : os.cpus().length;
// jest.config.js
module.exports = {
testTimeout: process.env.CI ? 30000 : 10000,
maxWorkers: process.env.CI ? '50%' : '75%',
};
Handling Time Zone Issues
// Force UTC in all test environments
// jest.setup.js or test bootstrap
process.env.TZ = 'UTC';
// For browser tests
test('displays correct date', () => {
jest.useFakeTimers();
jest.setSystemTime(new Date('2026-04-07T12:00:00Z'));
render( );
// This will now be consistent regardless of the developer's time zone
expect(screen.getByText('April 7, 2026')).toBeInTheDocument();
jest.useRealTimers();
});
Handling File System Differences
// BAD: Hardcoded path separators
const configPath = 'config/settings.json';
// GOOD: Use path.join for cross-platform compatibility
const configPath = path.join('config', 'settings.json');
// BAD: Assuming case sensitivity
const fileExists = fs.existsSync('README.md');
// On macOS (case-insensitive), 'readme.md' would also match
// GOOD: Normalize file names in tests
const normalizedName = fileName.toLowerCase();
Strategy 6: Use Contract Testing to Replace Brittle Integration Tests
Integration tests that hit real services are inherently flaky because they depend on the availability, performance, and behavior of those services. Contract testing replaces these brittle tests with reliable, fast alternatives.
What Is Contract Testing?
Contract testing verifies that two services (a consumer and a provider) agree on the interface between them, without requiring both services to be running simultaneously.
Implementing Contract Tests with Pact
// Consumer test (runs without the real provider)
const { PactV3 } = require('@pact-foundation/pact');
describe('User API Consumer', () => {
const provider = new PactV3({
consumer: 'WebApp',
provider: 'UserService',
});
test('fetches a user by ID', async () => {
// Define the expected interaction
provider
.given('a user with ID 1 exists')
.uponReceiving('a request for user 1')
.withRequest({
method: 'GET',
path: '/api/users/1',
})
.willRespondWith({
status: 200,
headers: { 'Content-Type': 'application/json' },
body: {
id: 1,
name: like('John Doe'),
email: like('john@example.com'),
},
});
await provider.executeTest(async (mockServer) => {
// Point your API client at the mock server
const client = new UserClient(mockServer.url);
const user = await client.getUser(1);
expect(user.id).toBe(1);
expect(user.name).toBeDefined();
expect(user.email).toContain('@');
});
});
});
// Provider verification test (runs on the provider side)
const { Verifier } = require('@pact-foundation/pact');
describe('User Service Provider', () => {
test('verifies the contract', async () => {
const verifier = new Verifier({
providerBaseUrl: 'http://localhost:3000',
pactUrls: ['./pacts/webapp-userservice.json'],
stateHandlers: {
'a user with ID 1 exists': async () => {
await db.users.create({ id: 1, name: 'John Doe', email: 'john@example.com' });
},
},
});
await verifier.verifyProvider();
});
});
When to Use Contract Tests vs. Integration Tests
| Scenario | Use Contract Tests | Use Integration Tests |
|----------|-------------------|----------------------|
| API request/response format | Yes | No |
| Business logic across services | No | Yes |
| Error handling between services | Yes | Supplementary |
| Performance under load | No | Yes |
| Data consistency | No | Yes |
Contract tests eliminate an entire category of flakiness (external service availability) while still verifying that services can communicate correctly.
Strategy 7: Implement Visual Regression Testing Correctly
Visual regression tests compare screenshots to detect unintended UI changes. They are valuable but notoriously flaky when implemented poorly.
Sources of Visual Test Flakiness
Making Visual Tests Reliable
// Playwright example with visual comparison,test('homepage renders correctly', async ({ page }) => {
await page.goto('/');
// Wait for all content to load
await page.waitForLoadState('networkidle');
// Disable animations
await page.addStyleTag({
content:
, ::before, *::after {
animation-duration: 0s !important;
animation-delay: 0s !important;
transition-duration: 0s !important;
transition-delay: 0s !important;
}
});
// Hide dynamic content
await page.evaluate(() => {
// Hide timestamps
document.querySelectorAll('[data-testid="timestamp"]').forEach(el => {
el.textContent = '2026-01-01';
});
// Hide ads
document.querySelectorAll('.ad-container').forEach(el => {
el.style.display = 'none';
});
});
// Use a tolerance threshold
await expect(page).toHaveScreenshot('homepage.png', {
maxDiffPixels: 100, // Allow up to 100 different pixels
maxDiffPixelRatio: 0.01, // Or 1% of total pixels
threshold: 0.2, // Per-pixel color difference threshold
});
});
Platform-Consistent Screenshots
Always capture screenshots in a consistent environment:
# Use Docker for consistent rendering
visual-tests:
image: mcr.microsoft.com/playwright:v1.42.0-focal
script:
- npx playwright test --project=visual
artifacts:
paths:
- test-results/
when: always
Component-Level Visual Tests
Test individual components rather than full pages to reduce flakiness surface area:
// Storybook + Chromatic for component-level visual testing
// More reliable than full-page screenshots
test('Button renders correctly in all states', async ({ page }) => {
await page.goto('/storybook/iframe.html?id=button--default');
await expect(page.locator('.storybook-button')).toHaveScreenshot('button-default.png');
await page.goto('/storybook/iframe.html?id=button--disabled');
await expect(page.locator('.storybook-button')).toHaveScreenshot('button-disabled.png');
await page.goto('/storybook/iframe.html?id=button--loading');
// Wait for loading animation to reach a deterministic state
await page.evaluate(() => {
document.querySelectorAll('.spinner').forEach(el => {
el.style.animationPlayState = 'paused';
});
});
await expect(page.locator('.storybook-button')).toHaveScreenshot('button-loading.png');
});
Strategy 8: Build a Test Monitoring and Alerting System
Prevention is better than cure. A monitoring system catches flaky tests early, before they erode team trust.
What to Monitor
Building a Simple Monitoring Pipeline
// test-monitor.js
// Run after each CI build to collect and analyze test data
const fs = require('fs');
const { parseJunitXml } = require('./junit-parser');
async function monitorTestResults(junitPath) {
const results = await parseJunitXml(junitPath);
const alerts = [];
for (const test of results.tests) {
// Check for new failures
const history = await getTestHistory(test.name, { days: 7 });
if (test.status === 'fail' && history.recentPassRate > 0.95) {
alerts.push({
type: 'new_failure',
test: test.name,
message: ${test.name} failed but has 95%+ pass rate -- possible flake,
});
}
// Check for duration anomalies
if (history.avgDuration > 0 && test.duration > history.avgDuration * 3) {
alerts.push({
type: 'slow_test',
test: test.name,
message: ${test.name} took ${test.duration}s (avg: ${history.avgDuration}s),
});
}
// Check for emerging flakiness
if (history.passRate < 0.95 && history.passRate > 0.05) {
const isNew = history.firstFailure > Date.now() - 7 24 60 60 1000;
if (isNew) {
alerts.push({
type: 'emerging_flake',
test: test.name,
message: ${test.name} has become flaky (${(history.passRate * 100).toFixed(0)}% pass rate),
});
}
}
}
return alerts;
}
Integrating with DeFlaky
DeFlaky provides built-in monitoring and alerting that handles all of the above automatically:
# Configure DeFlaky monitoring
deflaky monitor configure \
--watch "test-results/*.xml" \
--alert-channel slack \
--alert-webhook "$SLACK_WEBHOOK" \
--threshold-new-flake 0.05 \
--threshold-duration-increase 3x \
--digest-schedule "daily@9am"
DeFlaky's monitoring goes beyond simple pass/fail tracking. It performs statistical analysis to distinguish between genuine flakiness and one-time failures, reducing alert noise while catching real problems early. When a test first shows signs of flakiness (even before it would be obvious to a developer), DeFlaky alerts the team and provides diagnostic information about the likely root cause.
Strategy 9: Apply the Test Pyramid with Reliability in Mind
The test pyramid (many unit tests, fewer integration tests, fewest E2E tests) is well-known, but few teams consider reliability when designing their pyramid.
Reliability by Test Type
| Test Type | Typical Flake Rate | Root Causes |
|-----------|-------------------|-------------|
| Unit | < 0.1% | Timer mocks, module state, randomness |
| Integration | 1-5% | Database state, service availability, timing |
| E2E | 5-15% | Browser rendering, network timing, animations |
Designing for Reliability
Push tests down the pyramid whenever possible. A flaky E2E test can often be replaced by a reliable unit test plus a contract test:// INSTEAD OF: Flaky E2E test for form validation
// e2e/checkout.test.js
test('validates credit card number', async ({ page }) => {
await page.goto('/checkout');
await page.fill('#card-number', '1234');
await page.click('#submit');
await expect(page.locator('.error')).toHaveText('Invalid card number');
});
// USE: Reliable unit test for validation logic
// unit/validators.test.js
test('rejects invalid card numbers', () => {
expect(validateCardNumber('1234')).toEqual({
valid: false,
error: 'Invalid card number',
});
expect(validateCardNumber('4111111111111111')).toEqual({ valid: true });
});
// PLUS: Component test for error display
// component/CardInput.test.js
test('displays validation error', async () => {
render( );
await userEvent.type(screen.getByLabelText('Card Number'), '1234');
await userEvent.tab(); // Trigger blur validation
expect(await screen.findByText('Invalid card number')).toBeInTheDocument();
});
// PLUS: Contract test for payment API
// contract/payment.test.js
test('payment API accepts valid card', async () => {
provider
.uponReceiving('a payment with a valid card')
.withRequest({ method: 'POST', path: '/api/payment', body: { cardNumber: '4111111111111111' } })
.willRespondWith({ status: 200 });
// ...
});
This approach gives you better coverage with far less flakiness.
The Reliability-Adjusted Pyramid
Consider adjusting your test distribution based on reliability:
Strategy 10: Establish a Flaky Test Culture and Process
Technical strategies are necessary but not sufficient. Without the right culture and process, flaky tests will keep accumulating no matter how many you fix.
The Flaky Test Policy
Every team should have a written policy for handling flaky tests. Here is a template:
## Flaky Test Policy
Definition
A test is "flaky" if it produces different results (pass/fail) across multiple
runs without any code changes.
Detection
- DeFlaky runs nightly and flags new flaky tests
Any developer who encounters a flaky test reports it by adding the flaky label
- CI pipeline tracks retry rates and alerts on increases
Response SLAs
- Critical (blocks deployment): Fix within 24 hours or quarantine
- High (fails >10% of the time): Fix within 3 days
- Medium (fails 2-10% of the time): Fix within 1 sprint
Low (fails <2% of the time): Fix within 2 sprints
Quarantine Rules
- Quarantined tests must have a tracking issue with an assignee
- Maximum quarantine duration: 2 weeks
- If not fixed within 2 weeks, escalate to team lead
- Quarantined tests still run nightly (non-blocking) to check if fixed
Prevention
- All new tests must pass 10 consecutive runs before merging
- Code review includes test reliability review
No sleep() or fixed waits in test code without justification
- Test isolation is mandatory (enforced by random ordering in CI)
Metrics
- Dashboard: [link to DeFlaky dashboard]
- Weekly report: Sent to #qa-channel every Monday
- Monthly review: Test reliability is reviewed in engineering all-hands
The Flaky Test Retrospective
Hold monthly retrospectives focused on test reliability:
Agenda:- Review metrics (flake rate trend, MTTR, pipeline reliability)
- Discuss root causes of fixed flaky tests (what patterns emerge?)
- Identify systemic issues (are the same types of flakiness recurring?)
- Update testing guidelines based on lessons learned
- Celebrate improvements (show the before/after metrics)
Code Review for Test Reliability
Train your team to review test code for reliability red flags:
Red Flags Checklist:sleep() or fixed waits without justificationawait on async operations- [ ] Tests that depend on specific ordering
- [ ] Shared mutable state between tests
- [ ] Hardcoded IDs, timestamps, or ports
afterEach/afterAll- [ ] Direct database manipulation without transaction rollback
- [ ] Missing mock cleanup (mock.restore, mock.reset)
Ownership and Accountability
Assign flaky test ownership:
Track ownership in your flaky test tracking system:
// flaky-test-registry.json
{
"flaky_tests": [
{
"test": "src/__tests__/checkout.test.js::processes payment correctly",
"status": "quarantined",
"flake_score": 73.2,
"owner": "jane@company.com",
"detected": "2026-03-15",
"deadline": "2026-04-01",
"root_cause": "timing",
"notes": "cy.intercept race condition with payment gateway callback"
}
]
}
Preventing Regressions
Once you fix a flaky test, prevent it from regressing:
# Run the fixed test 50 times to verify the fix is solid
for i in {1..50}; do
npx jest path/to/fixed.test.js --forceExit 2>&1 | tail -1
done | sort | uniq -c
Expected output:
50 PASS path/to/fixed.test.js
Putting It All Together: A 90-Day Reliability Plan
Here is how to implement all 10 strategies over 90 days:
Days 1-7: Assessment and Foundation
- Set up DeFlaky or manual flaky test tracking
- Measure baseline metrics (flake rate, pipeline reliability)
- Write and communicate the flaky test policy
- Identify the top 10 flakiest tests
Days 8-30: Quick Wins
- Fix the top 10 flakiest tests (Strategies 1-3)
- Implement framework-level retries in CI (Strategy 4)
- Set up a quarantine workflow (Strategy 4)
- Add random test ordering in CI (Strategy 1)
Days 31-60: Infrastructure
- Containerize the test environment (Strategy 5)
- Replace brittle integration tests with contract tests (Strategy 6)
- Set up monitoring and alerting (Strategy 8)
- Build or deploy the test reliability dashboard
Days 61-90: Culture and Prevention
- Conduct the first flaky test retrospective (Strategy 10)
- Add reliability review to code review process (Strategy 10)
- Implement visual regression testing properly (Strategy 7)
- Review test pyramid distribution (Strategy 9)
- Measure improvement against baseline
Expected Results
Teams that follow this plan typically see:
Conclusion
Eliminating flaky tests is not about finding a silver bullet. It requires a combination of technical strategies (isolation, deterministic data, smart waits, environment parity), process improvements (quarantine workflows, retries, monitoring), and cultural changes (policies, retrospectives, code review).
The 10 strategies in this guide address every major category of test flakiness. Start with the highest-impact, easiest-to-implement strategies (isolation and deterministic data), and progressively adopt more advanced techniques (contract testing, visual regression, monitoring) as your team's reliability maturity grows.
Tools like DeFlaky accelerate this journey by automating detection, tracking, and prioritization. But even without specialized tools, a team that commits to these strategies and measures their progress will see dramatic improvements in test reliability, developer productivity, and software quality.
The most important thing is to start. Every flaky test you fix is a step toward a test suite your team can actually trust. And a trusted test suite is the foundation of confident, fast software delivery.