Skip to content

Improving DirectFileStore for performance and scalability #281

Open
@stanhu

Description

@stanhu

At GitLab, we've been running prometheus-client-mmap to record and report Prometheus metrics. It's an early fork of this project and uses a C extension with mmap() to optimize both the read and write paths.

We'd like to switch to DirectFileStore, but we see a number of issues:

  1. File descriptor usage: I don't believe DirectFileStore recording one metric per file is going to work at scale. With thousands of metrics, that's a lot of file descriptors that may have to be opened to read and write metrics. In prometheus-client-mmap we use one file metric per metric type (counter, histogram, gauge). I see there was an attempt to use one file per process in One file per process instead of per metric/process for DirectFileStore [DOESNT WORK] #161.
  2. Read performance (related to CPU usage growing over time with DirectFileStore #232, DirectFileStore creates too many files for long running applications #143, slow response on metrics http endpoint #194): Aggregating thousands of metrics in Ruby is pretty CPU and memory intensive. We found that this was even difficult to do efficiently with a Go exporter that reads the prometheus-client-mmap metrics since garbage collection becomes an issue due to memory allocations. We have a prototype Rust port that handles reading of metrics that is faster than the C implementation.
  3. Aggregating metrics in a separate process: The metrics stored by prometheus-client-mmap can be aggregated by a separate process since the metric type is in the filename. With DirectFileStore, I believe the types are registered in the registry, so there's no way an outside process can determine the metric type just by scanning the .bin files.

I would like to propose a path forward:

  1. Switch DirectFileStore to use file per process, or one file per metric type. The latter is simpler from our experience. If we do the former, we'd probably want to encode the metric type in the file.
  2. Add an optional Rust extension for reading metrics generated by DirectFileStore. We can easily adapt the work in our Rust port for DirectFileStore.

A side point: I believe the metrics are always written in the native endian format. I propose that we enforce little endian (which is what x86 and ARM64 use) to avoid cross-platform confusion.

@dmagliola What do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions