Service Provider Health Check
Summary
This enhancement proposes a mechanism for the DCM control plane to actively monitor the health of service providers. Instead of providers pushing heartbeats, the DCM control plane will poll a /health endpoint on the service provider to verify liveness.
Motivation
Define the DCM control plane way to determine if a service provider is accessible. Without an active check, the control plane might attempt to schedule services on providers that are down.
Goals
- Implement a polling mechanism where DCM checks provider health.
- Define a standard
/healthendpoint for all Service Providers.
Non-Goals
- Status reporting of individual services running on the provider.
- Deep provider diagnostics (out of scope for liveness check).
- Ensure DCM excludes “Unhealthy” or “Unreachable” providers from scheduling.
Proposal
Overview
The DCM Control Plane will act as the “prober.” It will maintain a list of registered service providers URLs. At a configurable interval, DCM will perform an HTTP GET request to the provider’s /health endpoint.
Architecture
Health Polling (High Frequency):
- Initiator: DCM Control Plane.
- Target: Service Provider
/healthendpoint. - Frequency: Every 10 seconds (default).
- Success Criteria: HTTP 200 OK.
Resource Synchronization (Low Frequency/On-Demand):
- Note: Detailed resource data (CPU/Memory) continues to be handled via the Provider Info API, but the “Ready” state is governed by the Health Check results.
Health Check Flow
- DCM Controller: Iterates through the list of active providers in the database.
- Probing: For each provider, DCM executes:
GET http://<provider-ip>:<port>/health. - State Machine:
- Success: If response is
200 OK, reset failure counter and mark asReady. - Failure: If timeout or non-200 response, increment failure counter.
- Threshold: If failures exceed the
FailureThreshold(default: 3), transition provider toNotReady.
- Success: If response is
- Recovery: A single successful
200 OKtransitions aNotReadyprovider back toReady.
Design Details
Service Provider Implementation
The Service Provider must expose a lightweight unauthenticated (or internally secured) endpoint.
Health Endpoint
Endpoint: GET /health
Expected Response:
- Code:
200 OK - Body: (Optional)
{
"status": "pass",
"version": "v1.2.3",
"uptime": 3600
}