Skip to content
Service Provider Health Check

Service Provider Health Check

Summary

This enhancement proposes a mechanism for the DCM control plane to actively monitor the health of service providers. Instead of providers pushing heartbeats, the DCM control plane will poll a /health endpoint on the service provider to verify liveness and backing provider health.

Motivation

Define the DCM control plane way to determine if a service provider is accessible. Without an active check, the control plane might attempt to schedule services on providers that are down.

Goals

  • Implement a polling mechanism where DCM checks provider health.
  • Define a standard /health endpoint for all Service Providers.

Non-Goals

  • Status reporting of individual services running on the provider.
  • Deep provider diagnostics (out of scope for liveness check).
  • Ensure DCM excludes “Unhealthy” or “Unreachable” providers from scheduling.

Proposal

Overview

The DCM Control Plane will act as the “prober.” It will maintain a list of registered service providers URLs. At a configurable interval, DCM will perform an HTTP GET request to the provider’s /health endpoint.

Architecture

  1. Health Polling (High Frequency):

    • Initiator: DCM Control Plane.
    • Target: Service Provider /health endpoint.
    • Frequency: Every 10 seconds (default).
    • Success Criteria: HTTP 200 OK.
  2. Resource Synchronization (Low Frequency/On-Demand):

    • Note: Detailed resource data (CPU/Memory) continues to be handled via the Provider Info API, but the “Ready” state is governed by the Health Check results.

Health Check Flow

  1. DCM Controller: Iterates through the list of active providers in the database.
  2. Probing: For each provider, DCM executes: GET http://<provider-ip>:<port>/health.
  3. State Machine:
    • Ready: If response is 200 OK and body status is healthy, reset failure counter and mark as Ready.
    • Unhealthy: If response is 200 OK and body status is unhealthy, mark as Unhealthy. The service provider is reachable but the backing provider is unavailable.
    • Failure: If timeout or non-200 response, increment failure counter.
    • Threshold: If failures exceed the FailureThreshold (default: 3), transition provider to Unavailable.
  4. Recovery: A single 200 OK with status healthy transitions an Unhealthy or Unavailable provider back to Ready.

Design Details

Service Provider Implementation

The Service Provider must expose a lightweight unauthenticated (or internally secured) endpoint.

Health Endpoint

Endpoint: GET /health

Expected Response:

  • Code: 200 OK
  • Body:
{
  "status": "healthy",
  "version": "v1.2.3",
  "uptime": 3600
}

The status field indicates the health of the backing provider:

  • healthy — The service provider and its backing provider are operational. DCM marks the provider as Ready.
  • unhealthy — The service provider is reachable but the backing provider is unavailable. DCM marks the provider as Unhealthy.

Unhealthy Response Example:

{
  "status": "unhealthy",
  "version": "v1.2.3",
  "uptime": 3600
}

Provider State Summary

HTTP Responsestatus fieldDCM State
200 OKhealthyReady
200 OKunhealthyUnhealthy
Non-200 / TimeoutN/AUnavailable (after exceeding FailureThreshold)