The Test Quarantine Pattern: How to Isolate Flaky Tests Without Ignoring Them

Every engineering team eventually faces the same dilemma with flaky tests. You have a test that fails intermittently. It is not catching real bugs. It is blocking deployments. The temptation is to skip it, delete it, or add @pytest.mark.skip and move on. But skipping a test means losing the coverage it provides. Deleting it means losing the validation it performs. And adding a skip marker means it will never be looked at again.

The test quarantine pattern offers a third option. Instead of deleting flaky tests or letting them block your pipeline, you move them to a quarantine. Quarantined tests still run. Their results are still recorded. But they do not block deployments or mark builds as failed. This gives your team breathing room to investigate and fix the underlying flakiness without sacrificing CI velocity.

This guide covers everything you need to implement the test quarantine pattern: the workflow, the tooling, the CI integration, and the cultural practices that make it work.

Why Traditional Approaches to Flaky Tests Fail

Before diving into quarantine, it is worth understanding why the common alternatives do not work.

Skip and Forget

The most common approach to flaky tests is @skip or @ignore. The test is marked as skipped, the pipeline goes green, and everyone moves on. The problem is that "skip" is permanent in practice. Nobody schedules time to revisit skipped tests. The skip marker accumulates. Six months later, you have 50 skipped tests and no idea which ones are still relevant.

Skipped tests also provide zero information. They do not run, so you do not know if the underlying flakiness has been fixed. You do not know if the code they covered has changed. You are flying blind.

Delete the Test

Deleting a flaky test solves the immediate CI problem but creates a coverage gap. The test existed for a reason. It was testing a specific behavior, a specific integration, or a specific edge case. Deleting it means that behavior is no longer validated. If a regression is introduced in that area, you will not catch it until it reaches production.

Retry Until Green

Some teams configure their CI to retry failed tests automatically. If a test fails, rerun it. If it passes on retry, treat it as passing. This approach masks flakiness rather than addressing it. It also slows down CI because every flaky failure adds a retry cycle. Worse, it normalizes unreliable tests. When retries become routine, teams stop investigating failures, and real bugs slip through.

The Real Problem

All of these approaches share a fundamental flaw: they treat flaky tests as a nuisance to be worked around rather than a signal to be acted on. The quarantine pattern is different because it treats flaky tests as a managed backlog with visibility, ownership, and a resolution process.

What Is the Test Quarantine Pattern?

The test quarantine pattern is a systematic approach to managing flaky tests that separates the "should this test block the pipeline?" question from the "should this test run?" question.

A quarantined test:

Still runs in every CI build

Still reports its pass/fail result

Does not block the pipeline if it fails

Is tracked in a dashboard or backlog

Has an owner responsible for fixing it

Has a deadline for resolution

Is automatically re-qualified when it demonstrates stability

The quarantine is not a graveyard. It is an ICU. Tests go in when they are sick, they are monitored and treated, and they come out when they are healthy.

Implementing the Quarantine Workflow

Step 1: Detection - Identifying Flaky Tests

The first step is identifying which tests are flaky. A test is flaky if it produces inconsistent results across multiple runs without any changes to the test code or the application code.

Manual detection works for small teams. A developer notices a test that keeps failing in CI. They check the recent commits and confirm that no relevant code has changed. They run the test locally and it passes. They conclude the test is flaky. Automated detection is essential for larger teams. Automated detection analyzes test results across multiple CI runs and flags tests with inconsistent outcomes. This is exactly what DeFlaky does. DeFlaky ingests your test results, tracks pass/fail history for every test, and flags tests that exhibit flakiness above a configurable threshold.

# DeFlaky automated detection
deflaky detect --threshold 0.95 --window 7d

Output:
FLAKY: test_api.py::test_create_order (92% pass rate, 12/13 runs)
FLAKY: test_search.py::test_fuzzy_match (88% pass rate, 7/8 runs)
FLAKY: test_auth.py::test_token_refresh (95% pass rate, 19/20 runs)

The detection threshold is important. A threshold of 95% means any test with a pass rate below 95% over the detection window is flagged as flaky. Adjust this based on your team's tolerance. A stricter threshold catches more flaky tests but may generate more noise.

Step 2: Quarantine - Isolating the Flaky Test

Once a test is identified as flaky, it enters quarantine. The implementation depends on your test framework and CI system, but the core mechanism is marking the test so that its failure does not block the pipeline.

Marker-based quarantine (pytest):

# conftest.py
import pytest

def pytest_configure(config):
    config.addinivalue_line(
        "markers", "quarantine: mark test as quarantined (flaky)"
    )

Use in tests
@pytest.mark.quarantine(reason="Flaky due to race condition in event handler",
                         owner="alice", deadline="2026-04-21")
def test_event_processing():
    # This test still runs but won't block the pipeline
    pass

Configuration-based quarantine:

Instead of marking individual tests, maintain a quarantine list in a configuration file. This keeps quarantine metadata out of the test code and makes it easier to manage centrally.

# quarantine.yml
quarantined_tests:
  - id: "test_api.py::test_create_order"
    reason: "Intermittent timeout in CI, passes locally"
    owner: "alice"
    quarantined_date: "2026-04-01"
    deadline: "2026-04-15"
    jira_ticket: "QA-1234"

  - id: "test_search.py::test_fuzzy_match"
    reason: "Elasticsearch indexing delay causes assertion failure"
    owner: "bob"
    quarantined_date: "2026-04-03"
    deadline: "2026-04-17"
    jira_ticket: "QA-1235"

CI-level quarantine:

Some CI systems allow you to mark specific test failures as non-blocking. This is the simplest approach but provides the least visibility.

# GitHub Actions example
name: Run tests
  run: pytest --junitxml=results.xml

name: Check results (excluding quarantined)
  run: |
    deflaky filter-results results.xml \
      --quarantine quarantine.yml \
      --output filtered-results.xml
    # Fail the build only if non-quarantined tests failed
    deflaky check filtered-results.xml --fail-on-error

Step 3: Investigation - Finding the Root Cause

A quarantined test should be investigated promptly. The quarantine buys time, but it should not buy indefinite procrastination.

Reproduce the failure. Run the test many times locally with pytest-repeat and pytest-randomly. If the test is consistently passing locally but failing in CI, the problem is likely environmental: different timing, different resources, different network conditions.

# Try to reproduce locally
pytest tests/test_api.py::test_create_order --count=100 -x

Try with random ordering
pytest tests/ --randomly-seed=random --count=10 -x

Check the failure pattern. Look at the historical failures in DeFlaky's dashboard. Is the test always failing with the same error? Does it fail at specific times of day? Does it fail only on certain CI runners? Patterns reveal root causes. Analyze timing. If the test involves waiting for asynchronous operations, the failure might be a timing issue. Compare the test's execution time between passing and failing runs. Check for shared state. If the test only fails when other specific tests run before it, there is a state dependency. Use pytest-randomly to identify which test is polluting the state.

Step 4: Fix - Resolving the Flakiness

Fixing a flaky test means addressing the root cause, not papering over the symptom. Common fixes include:

Adding proper waits for asynchronous operations instead of fixed sleeps

Improving fixture isolation by using function-scoped fixtures instead of shared ones

Adding retry logic for genuine transient failures (network timeouts, rate limits)

Freezing time for tests that depend on the current date or time

Using test containers for tests that depend on external services

Fixing the application code when the flakiness reveals a real concurrency or race condition bug

Step 5: Re-Qualification - Returning to the Main Suite

A fixed test should not immediately return to the main suite. It needs to demonstrate stability first. This is the re-qualification process.

Manual re-qualification: Run the test 50-100 times with pytest-repeat and verify that it passes every time. Then remove the quarantine marker and monitor it for a week. Automated re-qualification: DeFlaky can automatically re-qualify tests. Configure a stability threshold (e.g., 100% pass rate over 20 runs) and a stability window (e.g., 7 days). When a quarantined test meets both criteria, DeFlaky marks it for re-qualification.

# Check quarantine status
deflaky quarantine status

Output:
QUARANTINED: test_api.py::test_create_order
  Pass rate: 100% (last 25 runs)
  Stable since: 2026-04-05
  Status: ELIGIBLE FOR RE-QUALIFICATION
#
QUARANTINED: test_search.py::test_fuzzy_match
  Pass rate: 94% (last 18 runs)
  Status: STILL FLAKY - needs investigation

CI Integration Strategies

The quarantine pattern must integrate cleanly with your CI pipeline. There are several approaches, each with different trade-offs.

Approach 1: Two-Phase Test Run

Run tests in two phases: main tests and quarantined tests. The main tests must all pass for the build to succeed. The quarantined tests run separately and their results are recorded but do not affect the build status.

# GitHub Actions
jobs:
  test:
    steps:
      - name: Run main tests
        run: pytest -m "not quarantine" --junitxml=main-results.xml

      - name: Run quarantined tests (non-blocking)
        run: pytest -m "quarantine" --junitxml=quarantine-results.xml || true

      - name: Report quarantine results
        if: always()
        run: deflaky ingest quarantine-results.xml --tag quarantine

Approach 2: Post-Processing Results

Run all tests together but post-process the results to separate quarantined failures from real failures.

# GitLab CI
test:
  script:
    - pytest --junitxml=results.xml || true
    - deflaky evaluate results.xml --quarantine quarantine.yml
  artifacts:
    reports:
      junit: results.xml

Approach 3: DeFlaky as CI Gatekeeper

Use DeFlaky as the arbiter of build success. DeFlaky knows which tests are quarantined and evaluates results accordingly.

# Jenkins pipeline
stage('Test') {
    steps {
        sh 'pytest --junitxml=results.xml || true'
        sh 'deflaky evaluate results.xml --fail-on-non-quarantine-failures'
    }
}

This approach centralizes quarantine management in DeFlaky rather than in CI configuration or test markers. It provides a single source of truth for quarantine status and makes it easy to manage quarantines across multiple CI pipelines.

Building a Quarantine Dashboard

Visibility is critical for the quarantine pattern to work. Without visibility, quarantined tests are forgotten. A quarantine dashboard provides this visibility.

Essential Dashboard Metrics

Quarantine size over time. Track how many tests are quarantined. If the number is growing, the team is quarantining faster than it is fixing. This is a red flag that needs management attention. Quarantine age distribution. How long have tests been quarantined? Tests that have been quarantined for more than 30 days should trigger an escalation. Either the test should be fixed or it should be formally deleted with documented acceptance of the coverage gap. Quarantine resolution rate. What percentage of quarantined tests are being fixed and re-qualified each sprint? This measures the team's commitment to test reliability. Top quarantine contributors. Which areas of the codebase produce the most flaky tests? This identifies systemic problems that need architectural attention. Re-qualification success rate. What percentage of re-qualified tests stay stable? If tests frequently return to quarantine after re-qualification, the fixes are not addressing the root cause.

DeFlaky's Quarantine Dashboard

DeFlaky provides a built-in quarantine dashboard that tracks all of these metrics automatically. It integrates with your CI pipeline to ingest test results, tracks quarantine status, and provides visualizations of quarantine health over time.

# Launch the DeFlaky dashboard
deflaky dashboard --port 8080

Or view quarantine summary in the terminal
deflaky quarantine summary

Output:
Quarantine Summary (2026-04-07)
================================
Total quarantined:     8
Added this week:       2
Fixed this week:       3
Avg quarantine age:    9 days
Oldest quarantine:     21 days (test_legacy.py::test_migration)
Ready to re-qualify:   2

Team Ownership and Accountability

The quarantine pattern only works if quarantined tests have owners and deadlines. Without accountability, the quarantine becomes a dump.

Assigning Ownership

When a test is quarantined, it must be assigned to someone. The owner is responsible for investigating the flakiness, fixing the root cause, and shepherding the test through re-qualification.

Ownership assignment strategies:

Code owner: The person or team that owns the code being tested

Last modifier: The person who last modified the test or the code under test

On-call rotation: The person currently on quality rotation

Volunteer: A team member who picks up the quarantine during sprint planning

Setting Deadlines

Every quarantined test needs a deadline. A reasonable default is two weeks: one week for investigation and one week for fix verification. If the deadline passes without resolution, the test should be escalated.

Quarantine Budgets

Set a maximum quarantine size for your team. For example, "no more than 5% of tests can be quarantined at any time." This creates pressure to fix quarantined tests and prevents the quarantine from growing unbounded.

Team quarantine budget: 5% of test suite
Total tests:           2,000
Quarantine capacity:   100
Current quarantine:    8
Budget remaining:      92

When the quarantine budget is approaching capacity, the team must prioritize fixing quarantined tests before adding new ones. This prevents the quarantine from becoming the new normal.

Quarantine Review in Sprint Ceremonies

Include quarantine status in your sprint ceremonies:

Sprint planning: Review the quarantine backlog. Assign owners to unowned quarantined tests. Include quarantine fixes in the sprint scope. Daily standup: If someone is working on a quarantine fix, mention progress. This keeps quarantine fixes visible and prevents them from being deprioritized. Sprint retrospective: Review quarantine metrics. Is the quarantine growing or shrinking? Are deadlines being met? What systemic issues are causing the most flakiness?

Advanced Quarantine Patterns

Automatic Quarantine

Instead of manually quarantining tests, configure DeFlaky to automatically quarantine tests that exceed a flakiness threshold. This ensures that newly flaky tests are caught immediately without requiring manual intervention.

# Configure automatic quarantine
deflaky config set auto-quarantine.enabled true
deflaky config set auto-quarantine.threshold 0.90  # 90% pass rate
deflaky config set auto-quarantine.window 7d
deflaky config set auto-quarantine.min-runs 5

With automatic quarantine, the workflow becomes:

A test becomes flaky (passes less than 90% of the time over 7 days)
DeFlaky automatically quarantines it
DeFlaky creates a ticket or notification
The team assigns an owner and fixes the test
DeFlaky automatically re-qualifies the test when it demonstrates stability

Graduated Quarantine

Not all flaky tests are equally flaky. A test that fails 1% of the time is different from a test that fails 50% of the time. Graduated quarantine assigns different severity levels based on the degree of flakiness.

Level 1 (Warning): Pass rate between 90-99%. The test is monitored but still blocks the pipeline. The team is notified. Level 2 (Quarantine): Pass rate between 50-90%. The test is quarantined. It runs but does not block. The team must fix it within two weeks. Level 3 (Critical): Pass rate below 50%. The test is quarantined and escalated immediately. It is likely exposing a real problem in the code or the test infrastructure.

Quarantine for Different Test Types

The quarantine parameters should vary by test type:

Unit tests should have a very low flakiness tolerance (99.9% pass rate). Flaky unit tests almost always indicate a bug in the test code. Integration tests can tolerate slightly more flakiness (99% pass rate) because they depend on external systems. End-to-end tests are the most flaky by nature (95% pass rate threshold). They depend on the full stack and are sensitive to timing, rendering, and environmental differences.

# quarantine-config.yml
thresholds:
  unit:
    flakiness_threshold: 0.999
    max_quarantine_age: 7d
  integration:
    flakiness_threshold: 0.99
    max_quarantine_age: 14d
  e2e:
    flakiness_threshold: 0.95
    max_quarantine_age: 21d

Common Pitfalls and How to Avoid Them

Pitfall 1: Quarantine as Permission to Ignore

The biggest risk is that the quarantine becomes a graveyard. Tests go in and never come out. Counter this with strict deadlines, ownership requirements, and quarantine budgets.

Pitfall 2: Over-Quarantining

Not every test failure is flakiness. Sometimes the test is catching a real, intermittent bug. Before quarantining a test, verify that the failure is not a legitimate defect. Run the test multiple times. Examine the failure message. Check if the failure correlates with specific code changes.

Pitfall 3: Insufficient Re-Qualification

Re-qualifying a test after one successful run is not enough. The test might just be flaky in the other direction, passing most of the time but still failing occasionally. Require a sustained period of stability (20+ consecutive passes over multiple days) before re-qualification.

Pitfall 4: No Root Cause Analysis

Quarantine without root cause analysis is just a more organized way of ignoring flaky tests. Every quarantined test should have a documented root cause or at least a hypothesis. "It's flaky" is not a root cause. "It fails when Elasticsearch indexing takes more than 2 seconds, which happens under high CI load" is a root cause.

Pitfall 5: Quarantine Without Metrics

If you cannot measure your quarantine health, you cannot improve it. Track quarantine size, age, resolution rate, and recidivism (tests that return to quarantine after re-qualification). Without metrics, the quarantine is invisible and therefore unmanageable.

Measuring the Impact of Quarantine

Before and After Metrics

Track these metrics before and after implementing quarantine:

False failure rate: The percentage of CI builds that fail due to flaky tests rather than real bugs. This should decrease significantly after implementing quarantine. Time to deploy: The average time from merge to deployment. Quarantine reduces this by eliminating flaky-test-induced deployment delays. Developer time spent on false failures: The hours per week that developers spend investigating test failures that turn out to be flaky. This should decrease as quarantined tests are fixed. Test suite trust: Survey your team. Do they trust the test suite? Do they investigate failures or dismiss them as flaky? Trust should increase as the quarantine shrinks and false failures decrease.

ROI of the Quarantine Pattern

A concrete example: A team of 10 developers has a test suite with a 10% false failure rate. Each false failure costs an average of 30 minutes of developer time to investigate. With 5 CI runs per day, that is 0.5 false failures per day, costing 15 minutes per day, or about 5 hours per month of wasted developer time.

Implementing quarantine reduces the false failure rate to 1%. The time savings is 4.5 hours per month. Over a year, that is 54 hours of developer time saved, roughly 1.5 weeks of a developer's time. And this does not account for the indirect benefits: faster deployments, higher team morale, and fewer incidents caused by developers ignoring test failures.

Conclusion

The test quarantine pattern is not a silver bullet. It does not fix flaky tests. What it does is create a structured, visible, accountable process for managing them. Instead of the chaos of random test failures blocking deployments, skipped tests accumulating silently, and developers losing trust in the test suite, quarantine provides order.

The key principles are:

Quarantined tests still run. They are not ignored; they are monitored.

Every quarantined test has an owner. Someone is responsible for fixing it.

Every quarantined test has a deadline. It will be fixed or formally deleted.

Re-qualification requires demonstrated stability. One passing run is not enough.

Quarantine health is measured. Dashboards and metrics keep the process accountable.

Tools like DeFlaky make quarantine management practical by automating detection, tracking, and re-qualification. But the pattern works even with simple tooling: a YAML file listing quarantined tests, a CI script that filters results, and a team commitment to reviewing the quarantine regularly.

Start small. Quarantine your three most problematic flaky tests. Assign owners. Set deadlines. Fix them. Then expand the process. Within a few sprints, you will have a test suite that your team trusts and a CI pipeline that your team relies on. That is worth more than any number of green builds achieved by ignoring failing tests.

The Test Quarantine Pattern: How to Isolate Flaky Tests Without Ignoring Them

Why Traditional Approaches to Flaky Tests Fail

Skip and Forget

Delete the Test

Retry Until Green

The Real Problem

What Is the Test Quarantine Pattern?

Implementing the Quarantine Workflow

Step 1: Detection - Identifying Flaky Tests

Output:

FLAKY: test_api.py::test_create_order (92% pass rate, 12/13 runs)

FLAKY: test_search.py::test_fuzzy_match (88% pass rate, 7/8 runs)

FLAKY: test_auth.py::test_token_refresh (95% pass rate, 19/20 runs)

Step 2: Quarantine - Isolating the Flaky Test

Use in tests

Step 3: Investigation - Finding the Root Cause

Try with random ordering

Step 4: Fix - Resolving the Flakiness

Step 5: Re-Qualification - Returning to the Main Suite

Output:

QUARANTINED: test_api.py::test_create_order

Pass rate: 100% (last 25 runs)

Stable since: 2026-04-05

Status: ELIGIBLE FOR RE-QUALIFICATION

QUARANTINED: test_search.py::test_fuzzy_match

Pass rate: 94% (last 18 runs)

Status: STILL FLAKY - needs investigation

CI Integration Strategies

Approach 1: Two-Phase Test Run

Approach 2: Post-Processing Results

Approach 3: DeFlaky as CI Gatekeeper

Building a Quarantine Dashboard

Essential Dashboard Metrics

DeFlaky's Quarantine Dashboard

Or view quarantine summary in the terminal

Output:

Quarantine Summary (2026-04-07)

================================

Total quarantined: 8

Added this week: 2

Fixed this week: 3

Avg quarantine age: 9 days

Oldest quarantine: 21 days (test_legacy.py::test_migration)

Ready to re-qualify: 2

Team Ownership and Accountability

Assigning Ownership

Setting Deadlines

Quarantine Budgets

Quarantine Review in Sprint Ceremonies

Advanced Quarantine Patterns

Automatic Quarantine

Graduated Quarantine

Quarantine for Different Test Types

Common Pitfalls and How to Avoid Them

Pitfall 1: Quarantine as Permission to Ignore

Pitfall 2: Over-Quarantining

Pitfall 3: Insufficient Re-Qualification

Pitfall 4: No Root Cause Analysis

Pitfall 5: Quarantine Without Metrics

Measuring the Impact of Quarantine

Before and After Metrics

ROI of the Quarantine Pattern

Conclusion

Stop guessing. DeFlaky your tests.