OpenTelemetry在生产环境遇到基数爆炸问题，使用高基数属性会创建数百万个时间序列，导致性能退化。需要实现采样策略和属性过滤。

OpenTelemetry基数爆炸，生产环境性能退化

OpenTelemetry在开发环境运行完美，但在生产环境遇到基数爆炸问题。当使用user_id、session_id、request_uuid等高基数属性时，会创建数百万个唯一时间序列。例如，如果请求计数器标记了user_id且有200万用户，就会为单个指标创建200万个不同的指标序列。这会导致存储、内存和可观测性工具性能快速退化，仪表板变慢，请求延迟增加，存储成本飙升。

你的OpenTelemetry在开发环境运行得很好，所有指标都能正常收集，仪表板也显示正常。但一上生产环境，用户量增长到几百万，突然发现仪表板加载越来越慢，Prometheus内存占用飙升，存储成本翻了几倍。这不是OpenTelemetry的bug，而是你遇到了基数爆炸问题。

什么是基数爆炸？

基数（Cardinality）指的是指标标签的唯一值数量。当你给指标添加一个标签时，每个唯一值都会创建一个新的时间序列。

例如：

http_requests_total{method="GET", status="200"} - 基数很低（method有几种，status有几十种）
http_requests_total{user_id="12345"} - 基数爆炸（有几百万个用户）

第二个例子中，如果有200万用户，就会创建200万个不同的时间序列。每个时间序列都需要存储、索引和查询，这会带来巨大的性能开销。

为什么开发环境没问题？

开发环境通常只有几个测试用户，数据量很小，所以即使使用了高基数属性也不会暴露问题。但生产环境有真实用户，数据量可能是开发环境的几千倍甚至几万倍，基数爆炸问题就会突然显现。

解决方案：从源头控制基数

1. 只使用低基数维度

指标标签应该只包含有限数量的唯一值：

# ✅ 好的做法 - 低基数
http_requests_total{method="GET", status="2xx", endpoint="/api/users"}

# ❌ 坏的做法 - 高基数
http_requests_total{user_id="12345", request_id="abc-123", url="/api/users/12345"}

2. 使用属性过滤

在OpenTelemetry Collector中配置属性处理器，删除或替换高基数属性：

processors:
  attributes:
    actions:
      - key: user_id
        action: delete
      - key: request_id
        action: delete
      - key: url
        action: extract
        pattern: ^/(?P<endpoint>[^/]+)

3. 实现采样策略

不是所有请求都需要完整追踪。使用Tail Sampling只保留有价值的trace：

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000

4. 使用事件而非指标

对于高基数数据，使用日志或事件而非指标：

// ❌ 不要用指标记录用户行为
metrics.counter('user_action', { user_id: userId, action: 'click' }).inc();

// ✅ 用日志或事件记录
logger.info('User action', { user_id: userId, action: 'click', timestamp: Date.now() });

最佳实践总结

指标只用低基数维度：环境、服务名、端点模式、状态码类
高基数数据用日志或事件：user_id、request_id、session_id等
配置属性过滤：在Collector层面删除高基数属性
实现采样策略：只保留有价值的trace
监控基数增长：设置告警，当基数超过阈值时通知

基数爆炸的实际影响

性能退化表现

存储影响：

时间序列数量：从几千个增长到几百万个
存储空间：从几GB增长到几百GB
存储成本：每月增加数千美元
查询性能：从毫秒级降到秒级

内存影响：

Prometheus内存：从几GB增长到几十GB
OOM风险：频繁内存溢出
重启时间：从几秒增长到几分钟
系统不稳定

查询影响：

仪表板加载：从1秒增长到30秒
聚合查询：超时失败
实时监控：延迟严重
用户体验差

实际案例数据

案例1：电商网站

| 指标 | 开发环境 | 生产环境 | 增长倍数 | |------|---------|---------|---------| | 用户数 | 10个 | 200万个 | 20万倍 | | 时间序列 | 100个 | 2亿个 | 200万倍 | | 存储空间 | 100MB | 200GB | 2000倍 | | 查询延迟 | 50ms | 30秒 | 600倍 |

案例2：API服务

| 指标 | 修复前 | 修复后 | 改善 | |------|--------|--------|------| | 时间序列 | 500万个 | 5000个 | 1000倍 | | Prometheus内存 | 32GB | 2GB | 16倍 | | 查询延迟 | 15秒 | 200ms | 75倍 | | 存储成本 | $5000/月 | $200/月 | 25倍 |

高基数属性识别

常见高基数属性

用户相关：

user_id：用户ID
session_id：会话ID
email：邮箱地址
ip_address：IP地址

请求相关：

request_id：请求ID
trace_id：追踪ID
url：完整URL
query_string：查询参数

动态数据：

timestamp：时间戳
random_id：随机ID
file_path：文件路径
error_message：错误消息

基数检测工具

// 检测指标基数
function detectCardinality(metrics) {
  const cardinality = new Map()
  
  for (const metric of metrics) {
    const key = metric.name
    const labels = Object.keys(metric.labels).sort().join(',')
    
    if (!cardinality.has(key)) {
      cardinality.set(key, new Set())
    }
    
    cardinality.get(key).add(labels)
  }
  
  // 输出基数报告
  for (const [name, labels] of cardinality) {
    console.log(`${name}: ${labels.size} unique label combinations`)
  }
}

高级解决方案

方案一：基数限制器

class CardinalityLimiter {
  constructor(maxCardinality = 10000) {
    this.maxCardinality = maxCardinality
    this.cardinalityMap = new Map()
  }
  
  shouldRecord(metricName, labels) {
    const key = this.createKey(metricName, labels)
    
    if (!this.cardinalityMap.has(key)) {
      if (this.cardinalityMap.size >= this.maxCardinality) {
        console.warn(`Cardinality limit reached for ${metricName}`)
        return false
      }
      this.cardinalityMap.set(key, 0)
    }
    
    this.cardinalityMap.set(key, this.cardinalityMap.get(key) + 1)
    return true
  }
  
  createKey(metricName, labels) {
    const sortedLabels = Object.entries(labels)
      .sort(([a], [b]) => a.localeCompare(b))
      .map(([k, v]) => `${k}=${v}`)
      .join(',')
    return `${metricName}{${sortedLabels}}`
  }
}

方案二：动态属性聚合

function aggregateLabels(labels) {
  const aggregated = {}
  
  for (const [key, value] of Object.entries(labels)) {
    // 高基数属性进行聚合
    if (isHighCardinality(key)) {
      aggregated[key] = aggregateValue(key, value)
    } else {
      aggregated[key] = value
    }
  }
  
  return aggregated
}

function aggregateValue(key, value) {
  switch (key) {
    case 'user_id':
      return 'user'  // 所有用户聚合为一个标签
    case 'request_id':
      return 'request'
    case 'ip_address':
      return 'ip'
    default:
      return value
  }
}

方案三：分层采样

processors:
  tail_sampling:
    policies:
      # 错误请求：100%采样
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # 慢请求：100%采样
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000
      
      # 正常请求：1%采样
      - name: normal-requests
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

方案四：使用Exemplars

// 使用Exemplars关联trace和指标
const counter = meter.createCounter('http_requests_total', {
  description: 'Total HTTP requests',
  unit: '1'
})

counter.add(1, {
  method: 'GET',
  status: '200',
  // 不使用user_id等高基数属性
  // 但通过Exemplar关联到trace
}, context.active())

监控和告警

基数监控指标

# Prometheus告警规则
groups:
  - name: cardinality_alerts
    rules:
      - alert: HighCardinality
        expr: |
          sum by (__name__) ({__name__=~".+"}) > 100000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High cardinality detected for {{ $labels.__name__ }}"
          description: "Cardinality is {{ $value }}"
      
      - alert: CardinalityExplosion
        expr: |
          deriv(sum by (__name__) ({__name__=~".+"})[5m]) > 1000
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Cardinality explosion detected"
          description: "Cardinality is growing rapidly"

Grafana仪表板

关键指标：

总时间序列数量
每个指标的时间序列数量
基数增长率
存储空间使用
查询延迟

你的OpenTelemetry是不是也遇到了基数爆炸问题？ 在评论区分享你的解决方案吧！

OpenTelemetry Cardinality Explosion, Production Performance Degradation

OpenTelemetry works perfectly in development, but encounters cardinality explosion in production. When using high-cardinality attributes like user_id, session_id, request_uuid, millions of unique time series are created. For example, if a request counter is labeled with user_id and has 2 million users, 2 million distinct metric series are created for a single metric. This causes rapid degradation of storage, memory, and observability tool performance, dashboards become slow, request latency increases, storage costs spike.

Your OpenTelemetry setup works great in development. All metrics are collected normally, dashboards display correctly. But once you hit production with millions of users, suddenly dashboards load slower and slower, Prometheus memory usage spikes, storage costs multiply several times. This isn't an OpenTelemetry bug - you're experiencing cardinality explosion.

What is Cardinality Explosion?

Cardinality refers to the number of unique values for metric labels. When you add a label to a metric, each unique value creates a new time series.

For example:

http_requests_total{method="GET", status="200"} - Low cardinality (few methods, dozens of statuses)
http_requests_total{user_id="12345"} - Cardinality explosion (millions of users)

In the second example, with 2 million users, you create 2 million distinct time series. Each series needs storage, indexing, and querying, bringing huge performance overhead.

Why No Problem in Development?

Development environments typically have only a few test users with small data volumes, so high-cardinality attributes don't expose issues. But production has real users with thousands or tens of thousands times more data, and cardinality explosion suddenly appears.

Solutions: Control Cardinality at the Source

1. Use Only Low-Cardinality Dimensions

Metric labels should only contain limited unique values:

# ✅ Good practice - low cardinality
http_requests_total{method="GET", status="2xx", endpoint="/api/users"}

# ❌ Bad practice - high cardinality
http_requests_total{user_id="12345", request_id="abc-123", url="/api/users/12345"}

2. Use Attribute Filtering

Configure attribute processors in OpenTelemetry Collector to delete or replace high-cardinality attributes:

processors:
  attributes:
    actions:
      - key: user_id
        action: delete
      - key: request_id
        action: delete
      - key: url
        action: extract
        pattern: ^/(?P<endpoint>[^/]+)

3. Implement Sampling Strategies

Not all requests need full tracing. Use Tail Sampling to keep only valuable traces:

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000

4. Use Events Instead of Metrics

For high-cardinality data, use logs or events instead of metrics:

// ❌ Don't use metrics for user behavior
metrics.counter('user_action', { user_id: userId, action: 'click' }).inc();

// ✅ Use logs or events
logger.info('User action', { user_id: userId, action: 'click', timestamp: Date.now() });

Best Practices Summary

Metrics only use low-cardinality dimensions: environment, service name, endpoint patterns, status code classes
High-cardinality data uses logs or events: user_id, request_id, session_id, etc.
Configure attribute filtering: Delete high-cardinality attributes at Collector level
Implement sampling strategies: Keep only valuable traces
Monitor cardinality growth: Set alerts when cardinality exceeds thresholds

Has your OpenTelemetry also encountered cardinality explosion? Share your solutions in the comments!

OpenTelemetry基数爆炸，生产环境性能退化

深度文章

OpenTelemetry基数爆炸，生产环境性能退化

什么是基数爆炸？

为什么开发环境没问题？

解决方案：从源头控制基数

1. 只使用低基数维度

2. 使用属性过滤

3. 实现采样策略

4. 使用事件而非指标

最佳实践总结

基数爆炸的实际影响

性能退化表现

实际案例数据

高基数属性识别

常见高基数属性

基数检测工具

高级解决方案

方案一：基数限制器

方案二：动态属性聚合

方案三：分层采样

方案四：使用Exemplars

监控和告警

基数监控指标

Grafana仪表板

OpenTelemetry Cardinality Explosion, Production Performance Degradation

What is Cardinality Explosion?

Why No Problem in Development?

Solutions: Control Cardinality at the Source

1. Use Only Low-Cardinality Dimensions

2. Use Attribute Filtering

3. Implement Sampling Strategies

4. Use Events Instead of Metrics

Best Practices Summary

讨论 (0)