Service Provider Status Reporting
Summary
This proposal outlines the event-driven architecture for reporting Service Provider resource states (e.g., VM or Container status). By leveraging the messaging system and CloudEvents as the standard data format, we establish a scalable “fire-and-forget” mechanism. Service Providers will publish status updates to a message bus, which the DCM subscribes to, ensuring the observed state of the system is updated in near real-time without tight coupling or API bottlenecks.
Motivation
As the platform scales to support thousands of instances across multiple
providers, the current synchronous HTTP PUT model poses significant risks:
- Scalability: High-frequency status updates (e.g., during mass provisioning or region recovery) can flood the DCM API.
- Coupling: Providers require knowledge of specific DCM endpoints and must implement complex retry logic for downtime.
- Extensibility: Adding new consumers (e.g., Billing, Auditing) requires modifying the Provider to make additional API calls.
Moving to message system and CloudEvents resolves these issues by decoupling the Producer (Provider) from the Consumer (DCM) and standardizing the event envelope.
Goals
- Define a standardized, event-driven contract for Service Providers to report the “observed state” of resources.
- Decouple the Service Provider implementation from the DCM backend logic.
- Support high-throughput status reporting without degrading DCM API performance.
- Standardize event metadata using the CNCF CloudEvents specification.
Non-Goals
- Defining the internal monitoring logic of specific Service Providers.
- Defining “Provider Health” (heartbeats), which is covered in a separate proposal.
- Authentication between DCM and SPs.
Proposal
We propose adopting a Kubernetes-style Declarative Model where Service Providers act as Publishers and the DCM acts as a Subscriber.
Instead of calling a specific API endpoint, Providers will emit events to a message bus subject whenever an instance’s state changes. The payload will adhere to a strict schema (VMStatus/ContainerStatus) wrapped in a CloudEvent envelope.
User Stories
- As a Service Provider Developer, I want to reliantly publish a status update message (“fire and forget”) so that I don’t have to implement complex retry logic if the DCM is briefly unavailable.
- As a Platform Admin, I want to see the status of VMs update in real-time on my dashboard without waiting for a polling interval.
- As a Billing System Maintainer, I want to listen to “Instance Stopped” events to calculate costs without asking the Core Team to build a new API for me.
Risks and Mitigations
| Risk | Mitigation |
|---|---|
| Message Loss | For critical transitions, we can use message system persistence to ensure at-least-once delivery with persistence, also to not overload database. |
| Flooding/Flapping | Providers must implement “Debounce” logic to avoid sending updates for rapid status oscillation (e.g., running->error->running) within milliseconds. |
Design Details
1. Flow diagram
sequenceDiagram
autonumber
participant Provider as Service Provider
participant MS as Messaging System
participant DCM as DCM Core Service
participant DB as DCM Database
Note over Provider: Instance state changes
rect rgb(240, 248, 255)
Provider->>MS: PUBLISH CloudEvent Message
end
par Fan-Out Process
MS->>DCM: PUSH Message
DCM->>DCM: Validate CloudEvent Schema
alt is valid
DCM->>DB: UPSERT Instance Status
else is invalid
DCM->>DCM: Log Validation Error / Discard
end
and Other subscribers
MS-->>Billing Service: PUSH Message
end
2. Message system Subject Hierarchy
Providers must publish messages to a subject based on the service type:
dcm.{serviceType}
serviceType: The type of resource (e.g.,vm,container,cluster).
The service type determines the message schema and is the only routing-relevant token. All other context — provider identity, instance identifier, timestamps — is carried in the CloudEvent envelope attributes (see section 3).
3. CloudEvents Format
All messages must be valid JSON CloudEvents (v1.0). We currently define only
very simple format for VmStatus and ContainerStatus.
type VmStatus struct {
Status string `json:"status"`
Message string `json:"message"`
}type ContainerStatus struct {
Id string `json:"id"`
Status string `json:"status"`
Message string `json:"message"`
}Example golang event
cloudevents "github.com/cloudevents/sdk-go/v2"
type VmStatus struct {
Id string `json:"id"`
Status string `json:"status"`
Message string `json:"message"`
}
event := cloudevents.NewEvent()
event.SetID("event-123-456")
event.SetSource("dcm/providers/{providerName}")
event.SetType("dcm.status.vm")
event.SetSubject("dcm.{serviceType}")
event.SetData(cloudevents.ApplicationJSON, VmStatus{Id, "123-123", Status: "Running", Message: "VM is running."})4. Status mapping
To ensure a consistent user experience across different cloud backends (e.g., AWS, Azure, On-Premise), the DCM enforces a strict Generic Status Enum. Service Providers are responsible for normalizing their internal raw state into these generic states before publishing the CloudEvent.
VM Status
Providers must map their hypervisor-specific states to the following DCM
Lifecycle Phases: PROVISIONING, RUNNING, STOPPED, ERROR, DELETED,
DELETING, PAUSED, STOPPING.
| DCM Generic Status | AWS EC2 Equivalent | Azure VM Equivalent | VMWare Equivalent |
|---|---|---|---|
| PROVISIONING | pending | Creating | PoweredOff |
| RUNNING | running | running | PoweredOn |
| STOPPED | stopped, stopping | stopped, deallocated | PoweredOff |
| FAILED | terminated, error | Failed | Error |
| DELETED | terminated | Deleted | Ref Not Found |
| DELETING | shutting-down | Deleting | Destroying |
| PAUSED | N/A (AWS does not pause, only stop/hibernate) | paused | Suspended |
| STOPPING | stopping | stopping | GuestOS Shutting Down |
Note: If a provider has a state that is ambiguous, they should default to the
closest “active” state or ERROR if functionality is impaired.
Container status
For container providers, we align closely with the Kubernetes Pod Phase model
but simplified for general consumption. Target Statuses: PENDING,
RUNNING, SUCCEEDED, FAILED, UNKNOWN.
| DCM Generic Status | Kubernetes / Docker Equivalent |
|---|---|
| PENDING | Pending, ContainerCreating, ImagePullBackOff |
| RUNNING | Running |
| SUCCEEDED | Succeeded, Exited (0) |
| FAILED | Failed, CrashLoopBackOff, Exited (non-zero) |
| UNKNOWN | Unknown (Node lost) |
Cluster status
For managed clusters (e.g., K8s clusters), the status reflects the health of the
control plane and worker nodes as a single unit. Target Statuses:
CREATING, ACTIVE, UPDATING, DEGRADED, DELETED.
| DCM Generic Status | Kubernetes |
|---|---|
| CREATING | Control plane is provisioning; API is not yet reachable. |
| ACTIVE | Control plane is healthy and minimum worker nodes are ready. |
| UPDATING | Rolling upgrade in progress (API remains reachable). |
| DEGRADED | Control plane is reachable, but critical components are unhealthy. |
| DELETED | Cluster resources have been de-provisioned. |
Message systems
We evaluated several architectural approaches for using message system.
Apache Kafka is a distributed event streaming platform known for its high durability and strict ordering, which allows for replaying historical events. While excellent for long-term data retention and audit trails, its heavy operational footprint (requiring ZooKeeper or KRaft clusters) and higher end-to-end latency make it less ideal for simple, real-time ephemeral state synchronization.
RabbitMQ offers robust reliability and complex routing capabilities through its “Exchange” architecture, ensuring messages are rarely lost via mature acknowledgement mechanisms. However, its “smart broker” design can become a throughput bottleneck during high-load bursts, and managing queues for thousands of dynamic provider instances adds significant configuration overhead.
REST API (Synchronous HTTP) Sticking with the status quo of synchronous PUT requests offers the highest simplicity and ease of debugging using standard HTTP tools. However, this approach enforces tight coupling between the Provider and DCM, where high-frequency bursts (such as a region recovery) can overwhelm the API server and force developers to implement complex retry logic.
gRPC Streaming provides high performance and strong typing via binary Protobuf serialization, ensuring low-latency communication over HTTP/2. The primary downside is its point-to-point nature; it lacks inherent “fan-out” capabilities, requiring a custom dispatcher implementation to forward status updates to multiple downstream consumers like Billing or Auditing.
NATS is a lightweight messaging system designed for high scalability, offering “fire-and-forget” publishing and efficient subject-based fan-out to multiple subscribers. While it drastically reduces operational overhead and latency, it defaults to “at-most-once” delivery, meaning persistence (via JetStream) is required if strict delivery guarantees are needed over raw speed.