Decomposing OpenTelemetry Collector Configuration for Maintainability

When your collector configuration grows beyond a few hundred lines, you start feeling the friction. Pull request reviews become exercises in scrolling. Testing a single processor change means deploying an entire pipeline. Environment-specific variations creep in through copy-paste, and suddenly you have three nearly-identical YAML files that drift apart over time. The monolithic collector configuration that worked well during initial deployment becomes a liability at scale.

The OpenTelemetry Collector provides configuration providers that enable modular configurations, but the documentation tends to focus on individual features rather than composition patterns. This post examines practical strategies for decomposing collector configurations into maintainable, testable units.

The monolith problem

Consider a typical production collector configuration. It starts innocently enough: OTLP receiver, batch processor, OTLP exporter. Then you add Kubernetes metadata enrichment, tail sampling for traces, filtering for noisy health check spans, resource detection for cloud provider attributes, and suddenly you have a 500-line YAML file that handles traces, metrics, and logs across multiple pipelines.

The problems compound in predictable ways. When someone submits a pull request to modify the tail sampling policy, the reviewer must mentally parse the entire configuration to understand context. When a team wants to test a new transform processor statement, they cannot easily isolate that piece from the rest of the pipeline. When you deploy to staging versus production, environment-specific values get mixed with structural configuration, making it difficult to identify what actually differs between deployments.

The collector's configuration merging behavior and provider system offer a path forward, but the patterns for using them effectively are not immediately obvious from the documentation.

Configuration providers and merging

The collector supports multiple configuration sources through providers. The most commonly used are the file provider (file:), environment provider (env:), HTTP provider (http://, https://), and YAML provider (yaml:). Each provider resolves a URI to configuration content, and the collector merges configurations from multiple sources in the order specified.

otelcol --config file:base.yaml --config file:overrides.yaml

When the collector receives multiple configuration sources, it performs a deep merge. Keys from later sources override keys from earlier sources at each level of the hierarchy. This merge behavior is the foundation for decomposition: you can split configuration by concern and let the merge operation assemble the final result.

The environment provider substitutes environment variable values within configuration files using ${env:VAR_NAME} or the shorthand ${VAR_NAME} syntax.

The file provider supports recursive inclusion through the ${file:path} syntax within configuration files. This enables configuration fragments to reference other fragments, building up complex configurations from smaller pieces.

Decomposition strategies

Three primary patterns emerge for organizing collector configurations: splitting by component type, splitting by signal pipeline, and layering environment-specific overlays. Each serves different organizational needs, and they can be combined.

Splitting by component type

The first pattern separates receivers, processors, and exporters into distinct files. This works well when teams own different parts of the telemetry pipeline. The platform team might own receiver configurations, while the observability team owns processor logic, and the SRE team manages exporter destinations.

collector/
  base.yaml           # service section, extensions
  receivers.yaml      # all receiver definitions
  processors.yaml     # all processor definitions
  exporters.yaml      # all exporter definitions

The base configuration defines the service section and references components by name:

# base.yaml
extensions:
  health_check:
    endpoint: 0.0.0.0:13133

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Component files define the actual configurations:

# receivers.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: kubernetes-pods
          kubernetes_sd_configs:
            - role: pod

# processors.yaml
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: ${env:MEMORY_LIMIT_MIB:-512}
    spike_limit_mib: ${env:SPIKE_LIMIT_MIB:-128}

  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.deployment.name

  batch:
    timeout: 1s
    send_batch_size: 1024

The collector assembles these with multiple --config flags:

otelcol --config file:base.yaml \
        --config file:receivers.yaml \
        --config file:processors.yaml \
        --config file:exporters.yaml

This pattern makes pull requests smaller and more focused. A change to the batch processor configuration only touches processors.yaml, and reviewers can evaluate it in isolation.

Splitting by signal pipeline

When different teams own different telemetry signals, splitting by pipeline makes more sense. The traces team iterates on sampling policies while the metrics team focuses on aggregation rules. Each signal gets its own configuration file containing receivers, processors, and exporters relevant to that signal.

collector/
  common.yaml         # shared extensions, telemetry settings
  traces.yaml         # trace pipeline: receivers, processors, exporters, service.pipelines.traces
  metrics.yaml        # metrics pipeline: receivers, processors, exporters, service.pipelines.metrics
  logs.yaml           # logs pipeline: receivers, processors, exporters, service.pipelines.logs

The common file contains shared infrastructure:

# common.yaml
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: localhost:1777

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: ${env:MEMORY_LIMIT_MIB:-512}

service:
  extensions: [health_check, pprof]
  telemetry:
    logs:
      level: ${env:LOG_LEVEL:-info}
      encoding: json
    metrics:
      level: detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: 0.0.0.0
                port: 8888

Each signal file is self-contained for its domain:

# traces.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 2000
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

  batch:
    timeout: 1s
    send_batch_size: 512

exporters:
  otlp:
    endpoint: ${env:TRACES_BACKEND_ENDPOINT}
    tls:
      insecure: false

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp]

The merge operation combines the service.pipelines sections from each file, resulting in a complete configuration with all three signal pipelines.

Environment-specific overlays

Production, staging, and development environments differ in endpoints, resource limits, and sometimes pipeline structure. The overlay pattern uses a shared base with environment-specific files that override particular values.

collector/
  base.yaml
  env/
    production.yaml
    staging.yaml
    development.yaml

The base file defines the complete structure with placeholders or defaults:

# base.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  otlp:
    endpoint: localhost:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Environment files override specific values:

# env/production.yaml
processors:
  memory_limiter:
    limit_mib: 2048
    spike_limit_mib: 512

exporters:
  otlp:
    endpoint: ${env:BACKEND_ENDPOINT}
    tls:
      insecure: false
      ca_file: /etc/ssl/certs/ca-bundle.crt
    retry_on_failure:
      enabled: true
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

service:
  telemetry:
    logs:
      level: info
      encoding: json

# env/development.yaml
processors:
  memory_limiter:
    limit_mib: 256

exporters:
  otlp:
    endpoint: localhost:4317
    tls:
      insecure: true

  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      exporters: [otlp, debug]
  telemetry:
    logs:
      level: debug

Deployment selects the appropriate overlay:

# Production
otelcol --config file:base.yaml --config file:env/production.yaml

# Development
otelcol --config file:base.yaml --config file:env/development.yaml

Nested file inclusion

For deeply modular configurations, the file provider supports nested inclusion. This is particularly useful for complex processor configurations like tail sampling policies, where individual policies might be maintained by different teams.

# processors/tail_sampling.yaml
processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 50000
    policies:
      - ${file:policies/errors.yaml}
      - ${file:policies/slo-violations.yaml}
      - ${file:policies/baseline.yaml}

Each policy file contains a single policy definition:

# policies/errors.yaml
name: errors
type: status_code
status_code:
  status_codes: [ERROR]

# policies/slo-violations.yaml
name: slo-violations
type: and
and:
  and_sub_policy:
    - name: latency-threshold
      type: latency
      latency:
        threshold_ms: 2000
    - name: high-priority-services
      type: string_attribute
      string_attribute:
        key: service.tier
        values: [critical, high]

This granularity enables teams to own individual policies, submit focused pull requests, and test policies in isolation before integration.

Testing decomposed configurations

The collector's validate command accepts the same configuration sources as runtime, enabling validation of decomposed configurations:

# Validate merged configuration
otelcol validate --config file:base.yaml --config file:env/production.yaml

# Validate with environment variables set
BACKEND_ENDPOINT=backend:4317 otelcol validate --config file:base.yaml --config file:env/production.yaml

For more complex validation, the print-config command outputs the fully resolved configuration after merging and environment variable substitution:

otelcol print-config --config file:base.yaml --config file:env/production.yaml

This output shows exactly what the collector would receive, useful for debugging merge issues or unexpected environment variable values.

Individual component files can be validated in isolation by wrapping them in minimal configurations. For a processor file to validate independently, it needs at least one receiver, exporter, and pipeline that uses the processor:

# test-harness.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: localhost:4317

exporters:
  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]  # processor under test
      exporters: [debug]

otelcol validate --config file:test-harness.yaml --config file:processors/tail_sampling.yaml

Practical considerations

The merge operation has limitations worth understanding. Arrays are not merged; the later source completely replaces the earlier source's array. This affects pipeline definitions: if base.yaml defines processors: [a, b, c] and an overlay defines processors: [a, b], the result is [a, b], not a combination. Plan your decomposition accordingly, keeping arrays that need to vary together in the same file.

File paths in nested inclusions are relative to the working directory, not the file containing the inclusion. A ${file:policies/errors.yaml} reference resolves relative to where the collector process runs, regardless of which configuration file contains the reference. This behavior can surprise when organizing configurations in subdirectories; consider using absolute paths or ensuring the working directory is set appropriately.

Environment variable defaults (${env:VAR:-default}) only apply when the variable is unset. An empty string is not the same as unset; if VAR="" is exported, the default is not used. For required variables without reasonable defaults, validate externally before starting the collector:

: ${BACKEND_ENDPOINT:?BACKEND_ENDPOINT must be set}
otelcol --config file:config.yaml

Custom collector distributions built with ocb need appropriate providers included in the manifest. The file, env, http, https, and yaml providers are not included by default. Missing providers cause cryptic errors when the configuration tries to use unsupported URI schemes:

# builder.yaml
providers:
  - gomod: go.opentelemetry.io/collector/confmap/provider/fileprovider v1.57.0
  - gomod: go.opentelemetry.io/collector/confmap/provider/envprovider v1.57.0
  - gomod: go.opentelemetry.io/collector/confmap/provider/yamlprovider v1.57.0

When not to decompose

Decomposition adds indirection. A single-file configuration that fits on a screen is easier to understand than multiple files that must be mentally merged. Small teams with straightforward pipelines may find decomposition overhead exceeds its benefits.

The patterns described here target configurations that have grown painful to maintain. If your configuration is not yet painful, keeping it simple might be the right choice. The collector's configuration system supports decomposition when you need it; you are not required to use it from the start.

Summary

The OpenTelemetry Collector's configuration merging and provider system enable modular configurations that scale with organizational complexity. Splitting by component type aligns with team ownership of pipeline stages. Splitting by signal pipeline aligns with team ownership of telemetry domains. Environment overlays separate deployment concerns from structural configuration.

The key insight is that the collector's deep merge behavior lets you compose configurations from independent pieces. Each piece can be reviewed, tested, and modified in isolation. When combined with validation in CI pipelines, decomposed configurations become easier to maintain than their monolithic alternatives.

Start with the decomposition pattern that matches your organizational boundaries. If platform and observability teams have clear ownership, split by component type. If traces, metrics, and logs teams operate independently, split by signal. Layer environment overlays on either approach. The collector's configuration system is flexible enough to support the structure that works for your team.