Service Provider Health Check
Summary
This enhancement proposes a mechanism for the DCM control plane to actively monitor the health of service providers. Instead of providers pushing heartbeats, the DCM control plane will poll a /health endpoint on the service provider to verify liveness and backing provider health.
Motivation
Define the DCM control plane way to determine if a service provider is accessible. Without an active check, the control plane might attempt to schedule services on providers that are down.
Goals
- Implement a polling mechanism where DCM checks provider health.
- Define a standard
/healthendpoint for all Service Providers.
Non-Goals
- Status reporting of individual services running on the provider.
- Deep provider diagnostics (out of scope for liveness check).
- Ensure DCM excludes “Unhealthy” or “Unreachable” providers from scheduling.
Proposal
Overview
The DCM Control Plane will act as the “prober.” It will maintain a list of registered service providers URLs. At a configurable interval, DCM will perform an HTTP GET request to the provider’s /health endpoint.
Architecture
Health Polling (High Frequency):
- Initiator: DCM Control Plane.
- Target: Service Provider
/healthendpoint. - Frequency: Every 10 seconds (default).
- Success Criteria: HTTP 200 OK.
Resource Synchronization (Low Frequency/On-Demand):
- Note: Detailed resource data (CPU/Memory) continues to be handled via the Provider Info API, but the “Ready” state is governed by the Health Check results.
Health Check Flow
- DCM Controller: Iterates through the list of active providers in the database.
- Probing: For each provider, DCM executes:
GET http://<provider-ip>:<port>/health. - State Machine:
- Ready: If response is
200 OKand bodystatusishealthy, reset failure counter and mark asReady. - Unhealthy: If response is
200 OKand bodystatusisunhealthy, mark asUnhealthy. The service provider is reachable but the backing provider is unavailable. - Failure: If timeout or non-200 response, increment failure counter.
- Threshold: If failures exceed the
FailureThreshold(default: 3), transition provider toUnavailable.
- Ready: If response is
- Recovery: A single
200 OKwithstatushealthytransitions anUnhealthyorUnavailableprovider back toReady.
Design Details
Service Provider Implementation
The Service Provider must expose a lightweight unauthenticated (or internally secured) endpoint.
Health Endpoint
Endpoint: GET /health
Expected Response:
- Code:
200 OK - Body:
{
"status": "healthy",
"version": "v1.2.3",
"uptime": 3600
}The status field indicates the health of the backing provider:
healthy— The service provider and its backing provider are operational. DCM marks the provider as Ready.unhealthy— The service provider is reachable but the backing provider is unavailable. DCM marks the provider as Unhealthy.
Unhealthy Response Example:
{
"status": "unhealthy",
"version": "v1.2.3",
"uptime": 3600
}Provider State Summary
| HTTP Response | status field | DCM State |
|---|---|---|
200 OK | healthy | Ready |
200 OK | unhealthy | Unhealthy |
| Non-200 / Timeout | N/A | Unavailable (after exceeding FailureThreshold) |