Test Retry vs Test Fix: When to Retry Flaky Tests and When to Fix Them
Your CI pipeline failed. The failing test has failed before -- intermittently. You have two options: add a retry or fix the root cause. Both are valid strategies, but choosing the wrong one at the wrong time either wastes engineering effort or buries a problem that will keep compounding.
This article provides a clear decision framework for when retrying is the right move and when fixing is non-negotiable, along with production-ready retry configurations for every major test framework.
The Retry Trap
Retrying is seductive because it is fast. Add two lines of config, and your flaky test stops blocking the pipeline. The build goes green. Everyone moves on.
But retrying does not fix anything. It masks the symptom while the underlying cause persists. Every retry costs compute time, adds latency to your pipeline, and -- most dangerously -- trains your team to tolerate flakiness rather than eliminate it.
Here is what happens when retries become the default strategy:
The Decision Framework
Use this framework to decide between retrying and fixing.
Retry When:
The flakiness is environmental and outside your control.If the flakiness comes from infrastructure you do not own -- a cloud CI runner with variable performance, a third-party service your tests cannot mock, or a browser rendering inconsistency across OS versions -- retrying is a pragmatic response. You cannot fix what you do not control.
The fix requires significant refactoring that is not prioritized.Some flaky tests require substantial changes to fix: rewriting the test from scratch, refactoring the application code to be more testable, or overhauling the test infrastructure. If the fix is a multi-day effort and the test is only mildly flaky (under 5% failure rate), a retry buys time while the fix is scheduled.
The test covers critical functionality that cannot be disabled.If a test guards a critical path -- payment processing, authentication, data integrity -- and disabling it would be riskier than retrying it, use retries as a temporary safety net while the fix is developed.
Fix When:
The flake rate exceeds 10%.A test that fails more than 10% of the time is unreliable enough that retries will frequently exhaust all attempts. It needs a real fix.
The root cause is known and straightforward.If the fix is "replace sleep(2) with an explicit wait" or "add a unique ID to test data," the effort is trivial. Choosing to retry instead of spending 15 minutes on a real fix is technical debt by choice.
If multiple tests in the same area are becoming flaky, the root cause is systemic. Retrying each one individually does not address the shared underlying problem.
The test has been retried for more than two weeks.If you added a retry as a "temporary" measure and two weeks have passed, it is no longer temporary. It is the new normal. Fix it or remove it.
Retry Configurations for Major Frameworks
When retrying is the right call, configure it properly.
Jest
Jest does not have built-in per-test retries in its default configuration, but you can use the jest.retryTimes API.
// In your test file
jest.retryTimes(2, { logErrorsBeforeRetry: true });
describe('Payment API', () => {
test('processes charge successfully', async () => {
const result = await processCharge({ amount: 1000, currency: 'usd' });
expect(result.status).toBe('succeeded');
});
});
For global retries across all tests:
// jest.setup.js
jest.retryTimes(2, { logErrorsBeforeRetry: true });
// jest.config.js
module.exports = {
setupFilesAfterFramework: ['./jest.setup.js'],
};
Best practice: Only retry in CI, not locally. Developers should see failures immediately during local development.
// jest.setup.js
if (process.env.CI) {
jest.retryTimes(2, { logErrorsBeforeRetry: true });
}
Playwright
Playwright has first-class retry support in its configuration.
// playwright.config.ts
import { defineConfig } from '@playwright/test';
export default defineConfig({
retries: process.env.CI ? 2 : 0,
// Capture trace on first retry for debugging
use: {
trace: 'on-first-retry',
screenshot: 'only-on-failure',
},
});
Playwright's retry mechanism is sophisticated: it reruns the entire test including beforeEach hooks, captures traces only on retries (to avoid performance overhead on passing tests), and reports which tests needed retries.
import { test, expect } from '@playwright/test';
// This specific test gets more retries because it depends on a flaky third-party widget
test('loads payment widget', async ({ page }) => {
test.info().annotations.push({ type: 'retries', description: '3' });
// ... test code
});
pytest
pytest uses the pytest-rerunfailures plugin for retries.
pip install pytest-rerunfailures
# Retry all failed tests up to 2 times
pytest --reruns 2 --reruns-delay 1
Retry only specific failure types
pytest --reruns 2 --only-rerun "TimeoutError" --only-rerun "ConnectionError"
Per-test retry with decorators:
import pytest
@pytest.mark.flaky(reruns=3, reruns_delay=2)
def test_external_api_response():
"""This test calls a third-party API that occasionally times out."""
response = requests.get("https://api.external-service.com/status")
assert response.status_code == 200
Configuration in pytest.ini:
[pytest]
addopts = --reruns 2 --reruns-delay 1
Cypress
Cypress supports test retries natively.
// cypress.config.js
const { defineConfig } = require('cypress');
module.exports = defineConfig({
retries: {
runMode: 2, // Retries when running in CI (cypress run)
openMode: 0, // No retries in interactive mode (cypress open)
},
});
Per-test configuration:
describe('Checkout Flow', () => {
it('completes purchase', { retries: 3 }, () => {
cy.visit('/checkout');
cy.get('[data-testid="pay-button"]').click();
cy.contains('Order confirmed').should('be.visible');
});
});
JUnit 5
JUnit 5 uses the @RepeatedTest annotation for basic repeats, or the Pioneer extension for retries.
// Using JUnit Pioneer
import org.junitpioneer.jupiter.RetryingTest;
class PaymentTest {
@RetryingTest(3)
void processPayment() {
PaymentResult result = paymentService.charge(1000, "usd");
assertEquals("succeeded", result.getStatus());
}
}
The Hybrid Strategy: Retry Now, Track, Fix Later
The most effective teams use a hybrid approach: retry immediately to unblock the pipeline, but track retried tests and schedule fixes based on impact.
Step 1: Enable Retries with Tracking
# Run tests with retries enabled, output results to JUnit XML
npx playwright test --retries 2
Push results to DeFlaky for tracking
deflaky push --input test-results.xml --project my-app
Step 2: Monitor the Retry Dashboard
The DeFlaky Dashboard shows which tests are being retried, how often, and whether their retry frequency is increasing or decreasing. This gives you a data-driven priority list for fixes.
Key metrics to watch:
Step 3: Set Fix SLAs Based on Severity
| Retry Rate | Severity | SLA |
|-----------|----------|-----|
| > 20% | Critical | Fix within 24 hours |
| 10-20% | High | Fix within 1 week |
| 5-10% | Medium | Fix within 2 weeks |
| < 5% | Low | Schedule for next sprint |
Step 4: Validate Fixes
After fixing a flaky test, remove the retry and monitor for at least one week.
// After fixing the root cause, remove excess retries
test('processes payment', async ({ page }) => {
// Previously had retries: 3 due to iframe loading race condition
// Fixed by adding proper frame wait -- retries no longer needed
await page.goto('/checkout');
await page.frameLocator('#payment-iframe').getByLabel('Card').fill('4242424242424242');
await page.getByRole('button', { name: 'Pay' }).click();
await expect(page.getByText('Payment successful')).toBeVisible();
});
Anti-Patterns to Avoid
Anti-Pattern 1: Infinite Retries
// NEVER DO THIS
jest.retryTimes(10); // If a test needs 10 retries, it needs a fix
More than 3 retries is a red flag. If a test cannot pass within 3 attempts, the problem is too severe for retries.
Anti-Pattern 2: Retries Without Logging
# BAD: Retries happen silently
pytest --reruns 3
GOOD: Log retries so you know they're happening
pytest --reruns 3 -v # Verbose output shows retry attempts
If retries happen silently, nobody knows the problem exists. Ensure retry events are visible in your CI logs and tracked in your test dashboard.
Anti-Pattern 3: Retries as a Permanent Solution
If a test has had retries enabled for more than 30 days without a fix being scheduled, the retry has become a permanent coping mechanism. Either fix the test or acknowledge that the test is unreliable and consider removing it.
Anti-Pattern 4: Retrying Without Cleanup
If a test fails because it created partial state (e.g., half-created database records), retrying without cleaning up will fail again for the same reason.
# BAD: Retry without cleanup
@pytest.mark.flaky(reruns=2)
def test_create_order():
order = create_order(sku="WIDGET-001")
assert order.status == "confirmed"
GOOD: Ensure cleanup before retry
@pytest.mark.flaky(reruns=2)
def test_create_order():
cleanup_pending_orders() # Clean up any partial state from a previous failed attempt
order = create_order(sku="WIDGET-001")
assert order.status == "confirmed"
Conclusion
Retrying and fixing are not opposing strategies -- they are tools for different situations. Retries buy time for environmental flakiness and low-impact issues. Fixes are required for high-flake-rate tests, known root causes, and systemic problems.
The key is to never let retries become invisible. Track every retry, measure the trend, and set SLAs for fixes. Use DeFlaky to automate this tracking so your team always knows which tests are being retried, how often, and whether the situation is improving.
The teams with the most reliable test suites are not the ones that never use retries. They are the ones that treat every retry as a temporary measure with an expiration date.