Vantage Launches Persistent Metrics Recovery for Kubernetes

by Vantage Team


Vantage Launches Persistent Metrics Recovery for Kubernetes

Today, Vantage is launching Persistent Metrics Recovery for Kubernetes: a reliability enhancement to ensure resource utilization metrics collected by the Vantage Kubernetes agent are retained during periods where the Vantage upload endpoint is unreachable. This can be due to environment configuration changes preventing outside communication or outages to Internet Service Providers, Cloud Providers, or Vantage. With this update, the agent now persistently stores metric reports in a specified fallback location and retries uploads later, ensuring reliable reporting even during periods of prolonged connectivity issues.

The Vantage Kubernetes agent collects hourly resource utilization metrics and sends them to the Vantage API for ingestion and analysis in the Vantage console. Historically, when the Vantage API was unreachable for more than 1–2 hours, the agent’s in-memory buffer would lose raw metrics before hourly reports could be generated. These metrics were unrecoverable, resulting in gaps in customers’ Kubernetes cost and usage data.

Now, with Persistent Metrics Recovery, the Vantage Kubernetes agent detects failed uploads of hourly generated metric reports and stores them in a fallback location for up to 96 hours until they can be successfully uploaded. There is no additional configuration required for this feature. The hourly reports are stored in the customer’s existing data persistence location, which can be a Persistent Volume (default storage location) or a specified S3 location. The system periodically retries uploading these hourly utilization reports and clears them once they are successfully delivered. This approach ensures hourly data is preserved for up to 96 hours, even during extended API or S3 outages.

This improvement is available to all Kubernetes customers starting today. To get started, upgrade your Kubernetes agent to version v1.0.29. For more details, see the Kubernetes Agent documentation, or check the logs for recovery and retry events.

Frequently Asked Questions

1. What is being launched today?

Vantage is introducing Persistent Metrics Recovery for the Vantage Kubernetes agent. This feature stores hourly resource utilization reports for up to 96 hours when uploads to the Vantage API fail. The agent periodically retries these uploads until the reports are successfully delivered.

2. Who is the customer?

Any Vantage customer running the Kubernetes agent in their cluster who relies on consistent utilization metrics.

3. How much does this cost?

There is no additional cost. Persistent Metrics Recovery is included, by default, for all Vantage Kubernetes agent users.

4. How does it work?

When the agent fails to upload a generated hourly report (due to a Vantage API or S3 outage), it now writes that report to local disk (or a fallback S3 location). The agent retries to upload the reports until they are successful or age out after 96 hours. Old or un-sendable reports are eventually purged to prevent storage bloat.

By default, reports are stored in the attached Persistent Volume, but you can configure data persistence to use S3 instead.

5. What are the reasons that my data might not successfully sync with Vantage?

This could be caused by outages to either S3 or the Vantage API, configuration changes in customer environments preventing external communication, or broader internet outages that prevented communication with the Vantage API.

6. What happens if an outage lasts more than 96 hours?

Data older than 96 hours will be purged to avoid unbounded disk usage or storage consumption. Any hourly data not successfully uploaded within this window may still be lost.

7. Does this increase the agent’s resource usage?

The mechanism is designed to have minimal impact during normal operations. Disk usage is bounded, and logs are provided to track recovery and retry behavior.

8. Will this consume additional memory for my clusters?

No, because these logs are written to an external backup location, this will not impact memory utilization of your running workloads

9. How can I tell if my agent is in this state or has been in this state?

The agent publishes a metric called vantage_last_report_timestamp_seconds that can tell you the delta between now and the last report. In normal circumstances, it would be under an hour. Additionally, logs can be viewed by running the kubectl logs <pod-name> command.

10. Is any additional configuration required to enable this feature?

This behavior is enabled by default in the latest version of the Kubernetes agent. If you choose to use S3 for your persistent storage, you will need to configure your storage preference.