Preparing for Third‑Party Outages: Testing Patient Access and Telehealth Failovers
Design tabletop exercises and automated synthetic tests to keep patient portals and telehealth working during CDN or DNS outages.
When the CDN or DNS goes dark: why healthcare IT must rehearse and automate failovers now
Recent global CDN and DNS disruptions in late 2025 and early 2026 — impacting major providers and high‑profile platforms — showed how quickly patient access and telehealth can degrade. For hospitals and health systems running Allscripts EHR, patient portals, and telehealth stacks, those outages are not an abstract risk: they threaten care continuity, violate SLAs, and increase regulatory exposure. This guide shows how to design practical tabletop exercises and build automated synthetic testing so patient portals and telehealth degrade gracefully when CDN or DNS providers fail.
Executive takeaways
- Run targeted tabletop exercises that simulate CDN and DNS provider failures, with clear injects, metrics and escalation playbooks.
- Automate synthetic testing (multi‑region, multi‑protocol) to detect and quantify degradations before real users are impacted.
- Design graceful degradation for patient portals and telehealth: cached UI, API fallbacks, WebRTC/PSTN call routing, and offline/queueing modes.
- Prioritize compliance and data safety during tests: use staging, redact PHI, and execute under BAAs with providers.
- Measure the right KPIs: time to detect, time to failover, concurrent session preservation, percent of degraded‑mode transactions.
Why CDN and DNS failures are uniquely dangerous for healthcare
CDNs and DNS are foundational internet services. A CDN outage can strip away cached content, static assets, and edge routing optimizations — slowing or breaking patient portal pages and telehealth signaling. A DNS failure can make the entire domain unreachable even if origin systems are healthy.
Unlike consumer services, healthcare faces: HIPAA/SOC2 controls, integrated clinical workflows, and often direct patient safety considerations when telehealth appointments are missed. That’s why resilience testing and rehearsals must be built into an operational program.
Designing tabletop exercises for CDN and DNS outages
Tabletops are structured simulations — inexpensive, low‑risk, and high‑impact. They let teams practice decisions, communications, and technical mitigation without touching production systems.
Who participates
- IT Operations/Cloud Engineering (networking, DNS)
- Application Owners (patient portal, telehealth services)
- Clinical Operations and Telehealth Program Managers
- Security & Compliance (HIPAA officer, legal)
- Vendor/BAA contacts (CDN/DNS providers, telephony providers)
- Service Desk and Communications (patient notifications)
Core objectives
- Validate the runbook for CDN failover and DNS secondary resolution.
- Confirm communication templates and notification channels for patients and providers.
- Assess telehealth continuity plans: WebRTC fallback, PSTN bridging, appointment rescheduling cadence.
- Measure decision times and identify knowledge gaps.
Designing the scenario and injects
Build a plausible timeline. Example CDN outage tabletop:
- T+0: Alerts from synthetic monitors show 70% asset load failure in multiple regions.
- T+10: Patient reports of telehealth call failures begin via service desk.
- T+20: External comms show CDN provider status page with degraded performance.
- T+40: Management requests decision: fail over to origin, switch to multi‑CDN, or enable degraded UX?
For a DNS failure, injects should include inconsistent DNS resolution from different regions, inability to update DNS records (e.g., registrar portal failure), and expired secondary records to exercise recovery procedures.
Run the exercise: roles, timeline, and outcomes
- Appoint a facilitator and a scribe to capture decisions and times.
- Keep the exercise time‑boxed to 60–120 minutes.
- Force at least one difficult decision: cutover to vendor B or move to origin with long‑TTL changes.
- Record metrics: time to detection, time to decision, time to mitigation, number of sessions affected.
Post‑exercise deliverables
- Updated runbooks and precise playbook steps (commands, contacts, escalation chain).
- Gap analysis: missing permissions, tooling shortfalls (e.g., lack of secondary DNS), and training needs.
- Action list with owners and deadlines—tie to compliance evidence for audits.
Automated synthetic testing: the operational backbone
Tabletops teach people to act. Synthetic testing assures you detect problems automatically and understand impact before patients report them. In 2026, synthetic monitoring is increasingly expected to be API‑driven, multi‑region, and integrated into incident response automation.
Types of synthetic checks you need
- DNS resolution probes: measure authoritative and recursive resolution from multiple providers and regions; detect NXDOMAIN, SERVFAIL, and long latencies.
- HTTP/asset checks: page load (Lighthouse metrics), 200/404 checks for static assets, and integrity of CSS/JS bundles.
- API transaction tests: simulate patient login, fetch chart summaries, and post appointment bookings via FHIR endpoints where applicable.
- WebRTC signaling tests: simulate call setup, ICE candidate exchange, and media negotiation; validate STUN/TURN connectivity — treat these like low-latency streaming checks in hybrid environments (see low-latency playbooks).
- PSTN fallback tests: trigger call bridging to validate telephony failover if web calls fail.
Multi‑region and multi‑network probing
Run synthetic checks from multiple geographies and ASN vantage points (cloud providers, ISP probes). CDN/DNS issues can be regional; a global outage may not be uniform. Use synthetic providers that offer commercial probes or deploy your own multi-region probes using Kubernetes agents or lightweight edge nodes.
How to simulate a CDN outage safely
- Never simulate by disabling production CDN outright. Instead, emulate failure in staging using a cloned stack and the same CDN configuration.
- Use HTTP proxy rules to block CDN edge IPs for a subset of synthetic probes.
- Modify response headers to mimic missing edge caching (e.g., remove Cache‑Control) to test origin load handling and UI degradation.
How to simulate DNS failure safely
- Use a sandboxed resolver that returns SERVFAIL or altered responses for your test domain from synthetic agents.
- Validate secondary DNS behavior: intentionally change authoritative records in staging to ensure TTL‑based propagation and recovery steps are correct.
- Test registrar/portal access with prearranged accounts to confirm you can modify records during an incident.
Sample synthetic tests (playbook snippets)
Use Playwright for browser flows and k6 or Artillery for API checks. Example (conceptual) Playwright flow to test patient portal login and appointment join:
<!-- Playwright pseudocode -->
// navigate to portal
await page.goto('https://portal.examplehealth.org')
// check main CSS/JS loaded
const cssLoaded = await page.waitForSelector('link[rel=stylesheet]', {timeout:5000})
// login
await page.fill('#user', 'synthetic_test_user')
await page.fill('#pass', 'REDACTED')
await page.click('#login')
await page.waitForSelector('#dashboard', {timeout:10000})
// simulate join telehealth
await page.click('#next-appointment')
const joinStatus = await page.evaluate(() => startTelehealthCall())
expect(joinStatus).toBe('connected' || 'degraded')
For API checks, use k6 to assert FHIR calls succeed and latency bounds:
<!-- k6 pseudo -->
http.get('https://api.examplehealth.org/Patient/123/$everything')
check(response, {'status is 200': r => r.status === 200, 'latency < 500ms': r => r.timings.duration < 500})
Design patterns for graceful degradation
Degradation is not failure — it is a controlled, safe reduction of functionality that preserves core patient and clinician workflows. Build these patterns into web and telehealth architecture.
Patient portals
- Edge caching and stale‑while‑revalidate: allow non‑PHI static content and portioned pages to be served from cache during CDN degradation.
- Progressive rendering: render core patient info first (appointments, messages), load heavy assets later.
- Local service workers: support offline forms and queueing (localStorage/IndexedDB) for appointment requests when network is poor; sync when restored.
- Read‑only mode: when API writes fail, present a clear read‑only UX and capture user intent to reconcile later.
Telehealth
- WebRTC robustness: maintain multiple STUN/TURN endpoints (not tied to a single CDN) and test ICE candidate diversity.
- PSTN fallback: pre‑configured call bridges so patients can be called on their phones when web calls fail. Test these end‑to‑end.
- Session continuity: persist session metadata server‑side (so users reconnect without losing context) rather than relying solely on edge session affinity.
- Adaptive media: scale down video quality proactively when signaling or packet loss increases to keep audio stable.
Monitoring, KPIs and alerting you must track
Good monitoring converts synthetic tests into early warnings and automated mitigations. Embed synthetic checks as distinct alert sources and tune alerting to reduce noise.
- Time to detect (TTD) — from outage start to generated alert.
- Time to mitigate (TTM) — from alert to executed mitigation (CDN failover, DNS switch).
- Degraded transaction rate — percent of logins/telehealth joins served in degraded mode.
- Session retention — percent of sessions resumed after mitigation.
- Origin load — origin server CPU and error rates during CDN degradation.
Compliance and security during simulations
Always avoid exposing PHI in synthetic checks. Use anonymized or synthetic patient records and run heavy testing in pre‑production environments when possible. If production testing is necessary, obtain sign‑off, document scope for audits, and involve the HIPAA privacy officer.
Coordinate with vendors under BAAs: confirm testing windows and validate that synthetic probes won’t trigger DDoS defenses or rate limits that could affect production services — and be mindful of threats like credential stuffing and automated attack patterns when tuning rate limits.
Automate remediation where safe
Beyond alerts, automate low‑risk remediations: switch DNS to a pre‑validated secondary name server, deploy a pre‑signed ENI or ACL change to route traffic to an alternative origin, or enable a read‑only flag in the portal. Keep higher‑risk actions human‑mediated but supported by rapid playbooks. Consider safe automation patterns and sandboxed agents when you author runbook automation — desktop LLM agents can help but must be isolated and auditable.
Example automated flow
- Synthetic monitors detect DNS SERVFAIL from multiple regions.
- Runbook automation queries registrar API to confirm records and then toggles TTL to 60s (if allowed) and triggers a prepared DNS failover to the secondary provider.
- Notify operators and open an incident with prefilled context and suggested next steps.
From tabletop to continuous assurance: a practical roadmap
- Quarter 1: Build synthetic test library (DNS, HTTP, API, WebRTC) and run weekly smoke checks from 5 regions.
- Quarter 2: Run tabletop exercise for CDN outage; update runbooks and fix priority gaps (secondary DNS, registrar credentials).
- Quarter 3: Add automated remediation for low‑risk actions; begin monthly chaos‑style simulations in staging.
- Quarter 4: Integrate synthetic results into SLO reporting and audit artifacts for HIPAA/SOC2.
Case study: how a health system preserved telehealth during a multi‑region CDN incident (anonymized)
In December 2025, a mid‑sized health system experienced CDN edge degradation affecting video asset delivery. Their prebuilt synthetic probes flagged rising signal loss in 3 regions. Because they had a tabletop‑tested runbook, they immediately triggered origin‑direct routing for signaling, enabled PSTN fallback for active appointments, and activated an origin cache policy for static content. The result: 92% of scheduled telehealth visits connected, with only a 10% increase in manual reschedules. Post‑incident analysis reduced DNS TTLs for critical domains and added a secondary DNS provider—changes validated in a follow‑up tabletop.
Future trends (2026 and beyond) you should watch
- Edge compute for clinical UX: hosting portal fragments at the edge reduces dependence on a single origin and improves resilience when CDNs are partially impaired — see approaches in edge observability playbooks.
- Programmable DNS and API‑driven registrars: faster recovery through automation—if you’ve automated credentials and workflows securely; tie API automation to safe, auditable agents (desktop agent patterns).
- Federated synthetic networks: healthcare consortia sharing synthetic probes to detect regional routing problems faster.
- Regulatory expectations: auditors increasingly expect evidence of resilience testing (tabletops + synthetic checks) as part of HIPAA/SOC2 operational controls — see policy labs on digital resilience for frameworks.
Checklist: immediate actions to improve CDN/DNS resilience
- Implement multi‑region synthetic monitoring (DNS, HTTP, API, WebRTC).
- Run a CDN outage tabletop within 30 days; update runbooks with exact commands and registrant credentials.
- Configure secondary DNS and test registrar access (and keep credentials in an approved vault).
- Develop telehealth PSTN fallback plans and test calls end‑to‑end monthly.
- Adopt read‑only and queueing UX patterns for portal write failures.
- Document all tests and outcomes for HIPAA/SOC2 audits.
Resilience is not a one‑time project — it’s a continuous program that combines rehearsals (tabletops), automated detection (synthetic), and safe, rapid remediation.
Final thoughts and call to action
Patient access and telehealth face a higher bar: failures hurt people and increase regulatory risk. The right combination of well‑crafted tabletop exercises and automated synthetic testing gives your teams the confidence to act fast and keep care flowing when DNS or CDN providers falter. Start small — a single scenario, a handful of synthetic probes — then scale the program into continuous assurance tied to compliance reporting.
Ready to harden your patient portal and telehealth for real‑world outages? Schedule a resilience review and tabletop workshop with our Allscripts.cloud engineers. We’ll help you build synthetic suites, author compliance‑ready runbooks, and automate safe remediations so your patients and clinicians stay connected when it matters most.
Related Reading
- Edge Observability for Resilient Login Flows in 2026
- Implementing RCS Fallbacks in Notification Systems
- Policy Labs and Digital Resilience: A 2026 Playbook
- Building Hybrid Low-Latency Event Playbooks (relevant to WebRTC testing)
- Press Kit + Invitation Template for Announcing a Loyalty Integration (Inspired by Frasers)
- Color Temperature, Spectrum and Taste: Using Smart Lamps to Make Food Look and Feel Better
- 6 Ways to Avoid Cleaning Up AI Scheduling Mistakes
- Spotting the Next Hardware Trend: Domains to Buy for Semiconductor & Storage Companies
- Starting a Backyard Pet Treat Brand: Lessons from a DIY Food Company
Related Topics
allscripts
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you