10 Battle-Tested Strategies to Eliminate Flaky Tests from Your Codebase

Flaky tests are the silent productivity killer in software engineering. They are tests that pass sometimes and fail sometimes, without any changes to the code under test. Every engineering team encounters them, but few address them systematically. Instead, teams develop coping mechanisms: re-running pipelines, adding retries, or worse, ignoring test failures entirely.

This guide presents 10 battle-tested strategies for eliminating flaky tests. These are not theoretical suggestions -- they are techniques drawn from production codebases at organizations ranging from startups to Fortune 500 companies. Each strategy includes concrete implementation examples, common pitfalls, and guidance on when to apply it.

The strategies are ordered roughly by impact and ease of implementation. Start with Strategy 1 and work your way through as needed.

Strategy 1: Enforce Strict Test Isolation

Test isolation is the foundation of test reliability. If your tests share state, you will have flaky tests. Period. No amount of retries, clever waits, or debugging will fix a fundamental isolation problem.

What Test Isolation Means

A test is properly isolated when:

It can run in any order and produce the same result
It can run concurrently with any other test
It can run independently without any other test running before or after it
It leaves no side effects that affect other tests

How to Enforce Isolation

Database isolation:

// BAD: Tests share the same database state
describe('User management', () => {
  test('creates a user', async () => {
    await db.users.create({ name: 'Alice', email: 'alice@test.com' });
    const user = await db.users.findByEmail('alice@test.com');
    expect(user.name).toBe('Alice');
  });

  test('lists all users', async () => {
    const users = await db.users.findAll();
    expect(users).toHaveLength(0); // FAILS: Alice exists from previous test
  });
});

// GOOD: Each test gets a clean state
describe('User management', () => {
  beforeEach(async () => {
    await db.users.deleteAll();
  });

  test('creates a user', async () => {
    await db.users.create({ name: 'Alice', email: 'alice@test.com' });
    const user = await db.users.findByEmail('alice@test.com');
    expect(user.name).toBe('Alice');
  });

  test('lists all users', async () => {
    const users = await db.users.findAll();
    expect(users).toHaveLength(0); // Passes: database was cleaned
  });
});

// BEST: Use transactions for automatic rollback
describe('User management', () => {
  let transaction;

  beforeEach(async () => {
    transaction = await db.beginTransaction();
  });

  afterEach(async () => {
    await transaction.rollback();
  });

  test('creates a user', async () => {
    await db.users.create({ name: 'Alice' }, { transaction });
    const user = await db.users.findByEmail('alice@test.com', { transaction });
    expect(user.name).toBe('Alice');
    // Transaction automatically rolls back after test
  });
});

File system isolation:

# BAD: Tests use shared file paths
def test_write_config():
    write_config("/tmp/config.json", {"key": "value"})
    config = read_config("/tmp/config.json")
    assert config["key"] == "value"

def test_default_config():
    config = read_config("/tmp/config.json")
    assert config == {}  # FAILS: previous test wrote to this file

GOOD: Use unique paths per test
import tempfile
import os

def test_write_config():
    with tempfile.NamedTemporaryFile(suffix='.json', delete=False) as f:
        path = f.name

    try:
        write_config(path, {"key": "value"})
        config = read_config(path)
        assert config["key"] == "value"
    finally:
        os.unlink(path)

ALSO GOOD: Use pytest's tmp_path fixture
def test_write_config(tmp_path):
    config_file = tmp_path / "config.json"
    write_config(str(config_file), {"key": "value"})
    config = read_config(str(config_file))
    assert config["key"] == "value"

In-memory state isolation:

// BAD: Singleton retains state between tests
// cache.js
const cache = new Map();
export function setCache(key, value) { cache.set(key, value); }
export function getCache(key) { return cache.get(key); }

// cache.test.js
test('caches a value', () => {
  setCache('user', { name: 'Alice' });
  expect(getCache('user')).toEqual({ name: 'Alice' });
});

test('returns undefined for missing key', () => {
  expect(getCache('user')).toBeUndefined(); // FAILS: 'user' was set in previous test
});

// GOOD: Reset the module between tests
beforeEach(() => {
  jest.resetModules();
});

// OR: Export a clear function
export function clearCache() { cache.clear(); }

beforeEach(() => {
  clearCache();
});

Detecting Isolation Violations

Run your tests in random order to surface isolation issues:

# Jest
npx jest --randomize

pytest
pytest -p randomly

RSpec
rspec --order random

If tests fail in random order but pass in the default order, you have an isolation problem. Tools like DeFlaky can automatically detect ordering dependencies by running your test suite with different orderings and correlating failures with test sequences.

Strategy 2: Use Deterministic Test Data

Non-deterministic test data is one of the most common causes of flaky tests. This includes random values, auto-incrementing IDs, timestamps, and data from shared environments.

The Problem with Non-Deterministic Data

// FLAKY: Test depends on auto-increment ID
test('fetches the first user', async () => {
  const user = await createUser({ name: 'Test User' });
  // Assumes user.id will be 1, but it depends on database state
  expect(user.id).toBe(1);
});

// FLAKY: Test depends on current time
test('displays relative time', () => {
  const post = { createdAt: new Date('2026-04-06T10:00:00Z') };
  render();
  // "1 day ago" is only correct on April 7th
  expect(screen.getByText('1 day ago')).toBeInTheDocument();
});

// FLAKY: Test uses Math.random()
test('generates a valid token', () => {
  const token = generateToken();
  expect(token).toMatch(/^[a-f0-9]{32}$/);
  // What if generateToken() produces a value that doesn't match?
  // The assertion might be wrong, not the code
});

Making Data Deterministic

Control time:

// Use fake timers to control the clock
test('displays relative time', () => {
  jest.useFakeTimers();
  jest.setSystemTime(new Date('2026-04-07T10:00:00Z'));

  const post = { createdAt: new Date('2026-04-06T10:00:00Z') };
  render();
  expect(screen.getByText('1 day ago')).toBeInTheDocument();

  jest.useRealTimers();
});

Control randomness:

// Seed your random number generator
test('generates consistent shuffled array', () => {
  const rng = seedrandom('test-seed-42');
  const shuffled = shuffleArray([1, 2, 3, 4, 5], rng);
  expect(shuffled).toEqual([3, 1, 5, 2, 4]); // Deterministic!
});

// Or mock the random function
test('generates a predictable token', () => {
  jest.spyOn(Math, 'random').mockReturnValue(0.5);
  const token = generateToken();
  expect(token).toBe('expected-token-value');
  Math.random.mockRestore();
});

Use factory functions for test data:

// test-factories.js
let idCounter = 0;

export function resetFactories() {
  idCounter = 0;
}

export function createTestUser(overrides = {}) {
  idCounter++;
  return {
    id: test-user-${idCounter},
    name: Test User ${idCounter},
    email: user${idCounter}@test.example.com,
    createdAt: new Date('2026-01-01T00:00:00Z'),
    role: 'user',
    ...overrides,
  };
}

export function createTestOrder(overrides = {}) {
  idCounter++;
  return {
    id: test-order-${idCounter},
    userId: test-user-1,
    items: [
      { productId: 'prod-1', quantity: 1, price: 29.99 },
    ],
    total: 29.99,
    status: 'pending',
    createdAt: new Date('2026-01-01T00:00:00Z'),
    ...overrides,
  };
}

// Usage in tests
beforeEach(() => {
  resetFactories();
});

test('calculates order total', () => {
  const order = createTestOrder({
    items: [
      { productId: 'prod-1', quantity: 2, price: 10.00 },
      { productId: 'prod-2', quantity: 1, price: 25.00 },
    ],
  });

  expect(calculateTotal(order)).toBe(45.00);
});

Never use production data in tests:

// BAD: Fetches data from a shared staging environment
test('displays product list', async () => {
  const products = await fetch('https://staging-api.example.com/products');
  // Product data changes constantly, test is flaky by design
});

// GOOD: Use fixed test data
test('displays product list', async () => {
  const products = [
    { id: 1, name: 'Widget', price: 9.99 },
    { id: 2, name: 'Gadget', price: 19.99 },
  ];

  render();
  expect(screen.getByText('Widget')).toBeInTheDocument();
  expect(screen.getByText('Gadget')).toBeInTheDocument();
});

Strategy 3: Implement Smart Wait Strategies

Timing issues are the most visible form of flakiness. Tests that fail because they check for a result before the system has finished processing are extremely common, especially in E2E and integration tests.

The Wait Strategy Hierarchy

From best to worst:

Event-based waits (wait for a specific event or condition)

Polling waits (check a condition repeatedly until it is true)

Implicit waits (framework-provided automatic waiting)

Explicit fixed waits (sleep for a fixed duration -- avoid if possible)

Event-Based Waits

// BEST: Wait for a specific event
test('processes payment', async () => {
  const paymentPromise = waitForEvent(paymentProcessor, 'complete');

  paymentProcessor.charge({ amount: 99.99, card: testCard });

  const result = await paymentPromise;
  expect(result.status).toBe('success');
});

// For DOM: Wait for specific element state
test('submits form', async () => {
  const user = userEvent.setup();
  render();

  await user.type(screen.getByLabelText('Email'), 'test@example.com');
  await user.click(screen.getByRole('button', { name: 'Submit' }));

  // Wait for the success message to appear
  expect(await screen.findByText('Message sent!')).toBeInTheDocument();
});

Polling Waits with Timeout

// Utility: Wait for condition with timeout
async function waitFor(conditionFn, { timeout = 5000, interval = 100 } = {}) {
  const startTime = Date.now();

  while (Date.now() - startTime < timeout) {
    try {
      const result = await conditionFn();
      if (result) return result;
    } catch (e) {
      // Condition not met yet, continue polling
    }
    await new Promise(resolve => setTimeout(resolve, interval));
  }

  throw new Error(Condition not met within ${timeout}ms);
}

// Usage
test('processes background job', async () => {
  const jobId = await submitJob({ type: 'report', params: { userId: 1 } });

  const result = await waitFor(async () => {
    const job = await getJob(jobId);
    return job.status === 'completed' ? job : null;
  }, { timeout: 10000, interval: 500 });

  expect(result.output).toBeDefined();
});

Waiting for Network Requests

// Cypress: Wait for specific API calls
cy.intercept('POST', '/api/orders').as('createOrder');
cy.get('.submit-order').click();
cy.wait('@createOrder').its('response.statusCode').should('eq', 201);

// Playwright: Wait for response
const responsePromise = page.waitForResponse('**/api/orders');
await page.click('.submit-order');
const response = await responsePromise;
expect(response.status()).toBe(201);

// MSW (Mock Service Worker): Wait for request handler
test('creates order', async () => {
  let capturedRequest;

  server.use(
    rest.post('/api/orders', async (req, res, ctx) => {
      capturedRequest = await req.json();
      return res(ctx.json({ id: 'order-123', status: 'created' }));
    })
  );

  render();
  await userEvent.click(screen.getByRole('button', { name: 'Place Order' }));

  await waitFor(() => {
    expect(capturedRequest).toBeDefined();
    expect(capturedRequest.items).toHaveLength(2);
  });
});

What to Avoid

// NEVER: Fixed sleep in tests
await new Promise(resolve => setTimeout(resolve, 3000));

// NEVER: Arbitrary retry loops without timeout
while (!element.isVisible()) {
  await sleep(100); // Could run forever
}

// NEVER: Waiting longer than necessary "just to be safe"
await sleep(10000); // 10 seconds "because sometimes it's slow"

Strategy 4: Implement Retry Patterns Correctly

Retries are not a fix for flaky tests -- they are a mitigation strategy that buys you time while you address root causes. But when used correctly, they prevent flaky tests from blocking your pipeline.

Test-Level vs. Suite-Level Retries

Test-level retries rerun only the failed test:

// Jest: Using jest-retries
// jest.config.js
module.exports = {
  // Not built-in, but some frameworks support this
};

// Cypress: Built-in test retries
// cypress.config.js
module.exports = defineConfig({
  retries: {
    runMode: 2,    // Retry up to 2 times in CI
    openMode: 0,   // No retries locally
  },
});

// Playwright: Built-in test retries
// playwright.config.ts
export default defineConfig({
  retries: process.env.CI ? 2 : 0,
});

// pytest: Using pytest-rerunfailures
pytest.ini
[pytest]
reruns = 2
reruns_delay = 1

Suite-level retries rerun the entire test file. Use these sparingly:

# GitHub Actions: Retry the entire job
jobs:
  test:
    strategy:
      matrix:
        attempt: [1, 2]
    steps:
      - run: npm test
        continue-on-error: ${{ matrix.attempt == 1 }}

Smart Retry with Exponential Backoff

For integration tests that hit real services, use exponential backoff:

async function retryWithBackoff(fn, { maxRetries = 3, baseDelay = 1000 } = {}) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      const delay = baseDelay * Math.pow(2, attempt - 1);
      console.log(Attempt ${attempt} failed, retrying in ${delay}ms...);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

test('fetches data from external API', async () => {
  const data = await retryWithBackoff(async () => {
    const response = await fetch('https://api.example.com/data');
    if (!response.ok) throw new Error(HTTP ${response.status});
    return response.json();
  });

  expect(data.items).toBeDefined();
});

Tracking Retries

Every retry should be tracked. A test that consistently needs retries is still flaky and should be fixed:

// Wrapper that tracks retry usage
function trackRetries(testFn, testName) {
  return async (...args) => {
    const startTime = Date.now();
    let attempts = 0;
    let lastError;

    for (let i = 0; i < 3; i++) {
      attempts++;
      try {
        const result = await testFn(...args);

        if (attempts > 1) {
          // Log that this test needed a retry
          console.warn([FLAKY] ${testName} passed on attempt ${attempts});
          // Report to your metrics system
          reportFlakyTest(testName, attempts, Date.now() - startTime);
        }

        return result;
      } catch (error) {
        lastError = error;
      }
    }

    throw lastError;
  };
}

DeFlaky automatically tracks retry patterns across your test suite, identifying tests that rely on retries to pass and calculating the true failure rate without retry masking.

Strategy 5: Achieve Environment Parity

A significant percentage of flaky tests are caused by differences between development, CI, and production environments. When a test passes locally but fails in CI (or vice versa), the environment is usually the culprit.

Common Environment Differences

Operating system: macOS vs. Linux (file path separators, case sensitivity, line endings)

Hardware resources: Local machine has 32GB RAM; CI runner has 4GB

Network: Local has fast network; CI has restricted network access

Time zone: Developer is in IST; CI server is in UTC

Installed software: Different versions of browsers, databases, or CLI tools

Environment variables: Missing or different values in CI

Achieving Parity with Containers

Docker containers are the most reliable way to achieve environment parity:

# Dockerfile.test
FROM node:20-slim

Install exact browser versions
RUN npx playwright install --with-deps chromium

Install exact tool versions
RUN apt-get update && apt-get install -y \
    postgresql-client=15.* \
    redis-tools=7:7.* \
    && rm -rf /var/lib/apt/lists/*

Set consistent locale and timezone
ENV LANG=en_US.UTF-8
ENV TZ=UTC

WORKDIR /app

COPY package*.json ./
RUN npm ci

COPY . .

CMD ["npm", "test"]

# docker-compose.test.yml
services:
  test:
    build:
      context: .
      dockerfile: Dockerfile.test
    environment:
      - NODE_ENV=test
      - DATABASE_URL=postgres://test:test@db:5432/testdb
      - REDIS_URL=redis://redis:6379
      - TZ=UTC
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started

  db:
    image: postgres:15
    environment:
      POSTGRES_USER: test
      POSTGRES_PASSWORD: test
      POSTGRES_DB: testdb
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U test"]
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine

# Run tests in a consistent environment
docker compose -f docker-compose.test.yml run --rm test

Handling Resource Constraints

CI runners typically have fewer resources than developer machines. Account for this:

// Increase timeouts in CI
const timeout = process.env.CI ? 30000 : 10000;

// Reduce parallelism in CI
const maxWorkers = process.env.CI ? 2 : os.cpus().length;

// jest.config.js
module.exports = {
  testTimeout: process.env.CI ? 30000 : 10000,
  maxWorkers: process.env.CI ? '50%' : '75%',
};

Handling Time Zone Issues

// Force UTC in all test environments
// jest.setup.js or test bootstrap
process.env.TZ = 'UTC';

// For browser tests
test('displays correct date', () => {
  jest.useFakeTimers();
  jest.setSystemTime(new Date('2026-04-07T12:00:00Z'));

  render();

  // This will now be consistent regardless of the developer's time zone
  expect(screen.getByText('April 7, 2026')).toBeInTheDocument();

  jest.useRealTimers();
});

Handling File System Differences

// BAD: Hardcoded path separators
const configPath = 'config/settings.json';

// GOOD: Use path.join for cross-platform compatibility
const configPath = path.join('config', 'settings.json');

// BAD: Assuming case sensitivity
const fileExists = fs.existsSync('README.md');
// On macOS (case-insensitive), 'readme.md' would also match

// GOOD: Normalize file names in tests
const normalizedName = fileName.toLowerCase();

Strategy 6: Use Contract Testing to Replace Brittle Integration Tests

Integration tests that hit real services are inherently flaky because they depend on the availability, performance, and behavior of those services. Contract testing replaces these brittle tests with reliable, fast alternatives.

What Is Contract Testing?

Contract testing verifies that two services (a consumer and a provider) agree on the interface between them, without requiring both services to be running simultaneously.

Implementing Contract Tests with Pact

// Consumer test (runs without the real provider)
const { PactV3 } = require('@pact-foundation/pact');

describe('User API Consumer', () => {
  const provider = new PactV3({
    consumer: 'WebApp',
    provider: 'UserService',
  });

  test('fetches a user by ID', async () => {
    // Define the expected interaction
    provider
      .given('a user with ID 1 exists')
      .uponReceiving('a request for user 1')
      .withRequest({
        method: 'GET',
        path: '/api/users/1',
      })
      .willRespondWith({
        status: 200,
        headers: { 'Content-Type': 'application/json' },
        body: {
          id: 1,
          name: like('John Doe'),
          email: like('john@example.com'),
        },
      });

    await provider.executeTest(async (mockServer) => {
      // Point your API client at the mock server
      const client = new UserClient(mockServer.url);
      const user = await client.getUser(1);

      expect(user.id).toBe(1);
      expect(user.name).toBeDefined();
      expect(user.email).toContain('@');
    });
  });
});

// Provider verification test (runs on the provider side)
const { Verifier } = require('@pact-foundation/pact');

describe('User Service Provider', () => {
  test('verifies the contract', async () => {
    const verifier = new Verifier({
      providerBaseUrl: 'http://localhost:3000',
      pactUrls: ['./pacts/webapp-userservice.json'],
      stateHandlers: {
        'a user with ID 1 exists': async () => {
          await db.users.create({ id: 1, name: 'John Doe', email: 'john@example.com' });
        },
      },
    });

    await verifier.verifyProvider();
  });
});

When to Use Contract Tests vs. Integration Tests

| Scenario | Use Contract Tests | Use Integration Tests |

|----------|-------------------|----------------------|

| API request/response format | Yes | No |

| Business logic across services | No | Yes |

| Error handling between services | Yes | Supplementary |

| Performance under load | No | Yes |

| Data consistency | No | Yes |

Contract tests eliminate an entire category of flakiness (external service availability) while still verifying that services can communicate correctly.

Strategy 7: Implement Visual Regression Testing Correctly

Visual regression tests compare screenshots to detect unintended UI changes. They are valuable but notoriously flaky when implemented poorly.

Sources of Visual Test Flakiness

Anti-aliasing differences between operating systems and GPUs

Font rendering varies by platform

Animations captured at different frames

Dynamic content like timestamps, ads, or user-generated content

Viewport inconsistencies between environments

Making Visual Tests Reliable

// Playwright example with visual comparison
test('homepage renders correctly', async ({ page }) => {
  await page.goto('/');

  // Wait for all content to load
  await page.waitForLoadState('networkidle');

  // Disable animations
  await page.addStyleTag({
    content: 

      , ::before, *::after {
        animation-duration: 0s !important;
        animation-delay: 0s !important;
        transition-duration: 0s !important;
        transition-delay: 0s !important;
      }
    ,
  });

  // Hide dynamic content
  await page.evaluate(() => {
    // Hide timestamps
    document.querySelectorAll('[data-testid="timestamp"]').forEach(el => {
      el.textContent = '2026-01-01';
    });

    // Hide ads
    document.querySelectorAll('.ad-container').forEach(el => {
      el.style.display = 'none';
    });
  });

  // Use a tolerance threshold
  await expect(page).toHaveScreenshot('homepage.png', {
    maxDiffPixels: 100,        // Allow up to 100 different pixels
    maxDiffPixelRatio: 0.01,   // Or 1% of total pixels
    threshold: 0.2,            // Per-pixel color difference threshold
  });
});

Platform-Consistent Screenshots

Always capture screenshots in a consistent environment:

# Use Docker for consistent rendering
visual-tests:
  image: mcr.microsoft.com/playwright:v1.42.0-focal
  script:
    - npx playwright test --project=visual
  artifacts:
    paths:
      - test-results/
    when: always

Component-Level Visual Tests

Test individual components rather than full pages to reduce flakiness surface area:

// Storybook + Chromatic for component-level visual testing
// More reliable than full-page screenshots
test('Button renders correctly in all states', async ({ page }) => {
  await page.goto('/storybook/iframe.html?id=button--default');
  await expect(page.locator('.storybook-button')).toHaveScreenshot('button-default.png');

  await page.goto('/storybook/iframe.html?id=button--disabled');
  await expect(page.locator('.storybook-button')).toHaveScreenshot('button-disabled.png');

  await page.goto('/storybook/iframe.html?id=button--loading');
  // Wait for loading animation to reach a deterministic state
  await page.evaluate(() => {
    document.querySelectorAll('.spinner').forEach(el => {
      el.style.animationPlayState = 'paused';
    });
  });
  await expect(page.locator('.storybook-button')).toHaveScreenshot('button-loading.png');
});

Strategy 8: Build a Test Monitoring and Alerting System

Prevention is better than cure. A monitoring system catches flaky tests early, before they erode team trust.

What to Monitor

Test failure rate per test -- identify tests that fail more than expected

Test duration trends -- tests getting slower often become flaky

Retry rates -- tests that need retries are flaky even if they ultimately pass

New flaky tests -- catch flakiness at the point of introduction

Quarantine growth -- a growing quarantine indicates systemic issues

Building a Simple Monitoring Pipeline

// test-monitor.js
// Run after each CI build to collect and analyze test data

const fs = require('fs');
const { parseJunitXml } = require('./junit-parser');

async function monitorTestResults(junitPath) {
  const results = await parseJunitXml(junitPath);
  const alerts = [];

  for (const test of results.tests) {
    // Check for new failures
    const history = await getTestHistory(test.name, { days: 7 });

    if (test.status === 'fail' && history.recentPassRate > 0.95) {
      alerts.push({
        type: 'new_failure',
        test: test.name,
        message: ${test.name} failed but has 95%+ pass rate -- possible flake,
      });
    }

    // Check for duration anomalies
    if (history.avgDuration > 0 && test.duration > history.avgDuration * 3) {
      alerts.push({
        type: 'slow_test',
        test: test.name,
        message: ${test.name} took ${test.duration}s (avg: ${history.avgDuration}s),
      });
    }

    // Check for emerging flakiness
    if (history.passRate < 0.95 && history.passRate > 0.05) {
      const isNew = history.firstFailure > Date.now() - 7  24  60  60  1000;
      if (isNew) {
        alerts.push({
          type: 'emerging_flake',
          test: test.name,
          message: ${test.name} has become flaky (${(history.passRate * 100).toFixed(0)}% pass rate),
        });
      }
    }
  }

  return alerts;
}

Integrating with DeFlaky

DeFlaky provides built-in monitoring and alerting that handles all of the above automatically:

# Configure DeFlaky monitoring
deflaky monitor configure \
  --watch "test-results/*.xml" \
  --alert-channel slack \
  --alert-webhook "$SLACK_WEBHOOK" \
  --threshold-new-flake 0.05 \
  --threshold-duration-increase 3x \
  --digest-schedule "daily@9am"

DeFlaky's monitoring goes beyond simple pass/fail tracking. It performs statistical analysis to distinguish between genuine flakiness and one-time failures, reducing alert noise while catching real problems early. When a test first shows signs of flakiness (even before it would be obvious to a developer), DeFlaky alerts the team and provides diagnostic information about the likely root cause.

Strategy 9: Apply the Test Pyramid with Reliability in Mind

The test pyramid (many unit tests, fewer integration tests, fewest E2E tests) is well-known, but few teams consider reliability when designing their pyramid.

Reliability by Test Type

| Test Type | Typical Flake Rate | Root Causes |

|-----------|-------------------|-------------|

| Unit | < 0.1% | Timer mocks, module state, randomness |

| Integration | 1-5% | Database state, service availability, timing |

| E2E | 5-15% | Browser rendering, network timing, animations |

Designing for Reliability

Push tests down the pyramid whenever possible. A flaky E2E test can often be replaced by a reliable unit test plus a contract test:

// INSTEAD OF: Flaky E2E test for form validation
// e2e/checkout.test.js
test('validates credit card number', async ({ page }) => {
  await page.goto('/checkout');
  await page.fill('#card-number', '1234');
  await page.click('#submit');
  await expect(page.locator('.error')).toHaveText('Invalid card number');
});

// USE: Reliable unit test for validation logic
// unit/validators.test.js
test('rejects invalid card numbers', () => {
  expect(validateCardNumber('1234')).toEqual({
    valid: false,
    error: 'Invalid card number',
  });
  expect(validateCardNumber('4111111111111111')).toEqual({ valid: true });
});

// PLUS: Component test for error display
// component/CardInput.test.js
test('displays validation error', async () => {
  render();
  await userEvent.type(screen.getByLabelText('Card Number'), '1234');
  await userEvent.tab(); // Trigger blur validation
  expect(await screen.findByText('Invalid card number')).toBeInTheDocument();
});

// PLUS: Contract test for payment API
// contract/payment.test.js
test('payment API accepts valid card', async () => {
  provider
    .uponReceiving('a payment with a valid card')
    .withRequest({ method: 'POST', path: '/api/payment', body: { cardNumber: '4111111111111111' } })
    .willRespondWith({ status: 200 });
  // ...
});

This approach gives you better coverage with far less flakiness.

The Reliability-Adjusted Pyramid

Consider adjusting your test distribution based on reliability:

Unit tests: Maximize these. They are fast, reliable, and cheap.

Component tests: Good middle ground. Test UI behavior without full E2E complexity.

Contract tests: Replace integration tests that hit external services.

Integration tests: Use sparingly, only for testing actual integration points.

E2E tests: Limit to critical user flows only. Invest heavily in making these reliable.

Strategy 10: Establish a Flaky Test Culture and Process

Technical strategies are necessary but not sufficient. Without the right culture and process, flaky tests will keep accumulating no matter how many you fix.

The Flaky Test Policy

Every team should have a written policy for handling flaky tests. Here is a template:

## Flaky Test Policy

Definition
A test is "flaky" if it produces different results (pass/fail) across multiple
runs without any code changes.

Detection
DeFlaky runs nightly and flags new flaky tests
Any developer who encounters a flaky test reports it by adding the flaky label
CI pipeline tracks retry rates and alerts on increases

Response SLAs
Critical (blocks deployment): Fix within 24 hours or quarantine
High (fails >10% of the time): Fix within 3 days
Medium (fails 2-10% of the time): Fix within 1 sprint
Low (fails <2% of the time): Fix within 2 sprints

Quarantine Rules
Quarantined tests must have a tracking issue with an assignee
Maximum quarantine duration: 2 weeks
If not fixed within 2 weeks, escalate to team lead
Quarantined tests still run nightly (non-blocking) to check if fixed

Prevention
All new tests must pass 10 consecutive runs before merging
Code review includes test reliability review
No sleep() or fixed waits in test code without justification
Test isolation is mandatory (enforced by random ordering in CI)

Metrics
Dashboard: [link to DeFlaky dashboard]
Weekly report: Sent to #qa-channel every Monday
Monthly review: Test reliability is reviewed in engineering all-hands

The Flaky Test Retrospective

Hold monthly retrospectives focused on test reliability:

Agenda:

Review metrics (flake rate trend, MTTR, pipeline reliability)
Discuss root causes of fixed flaky tests (what patterns emerge?)
Identify systemic issues (are the same types of flakiness recurring?)
Update testing guidelines based on lessons learned
Celebrate improvements (show the before/after metrics)

Code Review for Test Reliability

Train your team to review test code for reliability red flags:

Red Flags Checklist:

[ ] sleep() or fixed waits without justification

[ ] Missing await on async operations

[ ] Tests that depend on specific ordering
[ ] Shared mutable state between tests
[ ] Hardcoded IDs, timestamps, or ports

[ ] Missing cleanup in afterEach/afterAll

[ ] Direct database manipulation without transaction rollback
[ ] Missing mock cleanup (mock.restore, mock.reset)

Ownership and Accountability

Assign flaky test ownership:

The author of the test is the default owner

If the author has left: Assign to the team that owns the feature

If the feature is deprecated: Delete the test

If no one claims ownership: Escalate to engineering manager

Track ownership in your flaky test tracking system:

// flaky-test-registry.json
{
  "flaky_tests": [
    {
      "test": "src/__tests__/checkout.test.js::processes payment correctly",
      "status": "quarantined",
      "flake_score": 73.2,
      "owner": "jane@company.com",
      "detected": "2026-03-15",
      "deadline": "2026-04-01",
      "root_cause": "timing",
      "notes": "cy.intercept race condition with payment gateway callback"
    }
  ]
}

Preventing Regressions

Once you fix a flaky test, prevent it from regressing:

# Run the fixed test 50 times to verify the fix is solid
for i in {1..50}; do
  npx jest path/to/fixed.test.js --forceExit 2>&1 | tail -1
done | sort | uniq -c

Expected output:
  50 PASS path/to/fixed.test.js

Putting It All Together: A 90-Day Reliability Plan

Here is how to implement all 10 strategies over 90 days:

Days 1-7: Assessment and Foundation

Set up DeFlaky or manual flaky test tracking
Measure baseline metrics (flake rate, pipeline reliability)
Write and communicate the flaky test policy
Identify the top 10 flakiest tests

Days 8-30: Quick Wins

Fix the top 10 flakiest tests (Strategies 1-3)
Implement framework-level retries in CI (Strategy 4)
Set up a quarantine workflow (Strategy 4)
Add random test ordering in CI (Strategy 1)

Days 31-60: Infrastructure

Containerize the test environment (Strategy 5)
Replace brittle integration tests with contract tests (Strategy 6)
Set up monitoring and alerting (Strategy 8)
Build or deploy the test reliability dashboard

Days 61-90: Culture and Prevention

Conduct the first flaky test retrospective (Strategy 10)
Add reliability review to code review process (Strategy 10)
Implement visual regression testing properly (Strategy 7)
Review test pyramid distribution (Strategy 9)
Measure improvement against baseline

Expected Results

Teams that follow this plan typically see:

50-70% reduction in flaky test count within 90 days

Pipeline reliability improving from ~90% to >97%

Developer time saved: 10-20 hours per week across the team

Deployment frequency: 2-3x increase as pipeline trust improves

Conclusion

Eliminating flaky tests is not about finding a silver bullet. It requires a combination of technical strategies (isolation, deterministic data, smart waits, environment parity), process improvements (quarantine workflows, retries, monitoring), and cultural changes (policies, retrospectives, code review).

The 10 strategies in this guide address every major category of test flakiness. Start with the highest-impact, easiest-to-implement strategies (isolation and deterministic data), and progressively adopt more advanced techniques (contract testing, visual regression, monitoring) as your team's reliability maturity grows.

Tools like DeFlaky accelerate this journey by automating detection, tracking, and prioritization. But even without specialized tools, a team that commits to these strategies and measures their progress will see dramatic improvements in test reliability, developer productivity, and software quality.

The most important thing is to start. Every flaky test you fix is a step toward a test suite your team can actually trust. And a trusted test suite is the foundation of confident, fast software delivery.