6.6 Monitoring
Kafka uses Yammer Metrics for metrics reporting in both the server and the client. This can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.
Kafka使用Yammer Metrics(度量,也可称为指标)(在服务器和客户端之间的指标报告)。可以配置使用可插拔的记录统计连接到你的监控系统。
The easiest way to see the available metrics to fire up jconsole and point it at a running kafka client or server; this will all browsing all metrics with JMX.
最简单的方式是通过查看可用的指标来激活jconsole并将其指向正在运行的kafka客户端或服务器(将使用JMX游览所有的指标);
We pay particular we do graphing and alerting on the following metrics:
我们特别支持对以下指标进行图形化和警报:
Description | Mbean name | Normal value | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Message in rate 消息比率 | kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec | |||||||||
Byte in rate 字节比率 | kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec | |||||||||
Request rate 请求比率 | kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce\ | FetchConsumer\ | FetchFollower} | |||||||
Byte out rate 字节输出比率 | kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec | |||||||||
Log flush rate and time 日志冲洗比率和时间 | kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs | |||||||||
# of under replicated partitions (\ | ISR\ | < \ | all replicas\ | ) 关于副本分区(\ | ISR\ | < \ | all replicas\ | ) | kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions | 0 |
Is controller active on broker 在broker上控制活跃 | kafka.controller:type=KafkaController,name=ActiveControllerCount | only one broker in the cluster should have 1 急群中仅1个应该有1 | ||||||||
Leader election rate leader选举比率 | kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs | non-zero when there are broker failures 非零,当broker失败 | ||||||||
Unclean leader election rate Unclean leader 选举的比率 | kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec | 0 | ||||||||
Partition counts 分区总数 | kafka.server:type=ReplicaManager,name=PartitionCount | mostly even across brokers 大部分甚至跨broker | ||||||||
Leader replica counts leader副本数 | kafka.server:type=ReplicaManager,name=LeaderCount | mostly even across brokers 大部分甚至跨broker | ||||||||
ISR shrink rate ISR收缩比率 | kafka.server:type=ReplicaManager,name=IsrShrinksPerSec | If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0. | ||||||||
ISR expansion rate ISR膨胀比率 | kafka.server:type=ReplicaManager,name=IsrExpandsPerSec | See above | ||||||||
Max lag in messages btw follower and leader replicas 跟随者和leader副本的最大消息落后 | kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica | < replica.lag.max.messages | ||||||||
Lag in messages per follower replica 每个跟随者副本的消息落后 | kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+) | < replica.lag.max.messages | ||||||||
Requests waiting in the producer purgatory 生产者purgatory请求告警 | kafka.server:type=ProducerRequestPurgatory,name=PurgatorySize | non-zero if ack=-1 is used 非零,如果ack=-1 | ||||||||
Requests waiting in the fetch purgatory 拉取purgatory的请求告警 | kafka.server:type=FetchRequestPurgatory,name=PurgatorySize | size depends on fetch.wait.max.ms in the consumer | ||||||||
Request total time 请求总时间 | kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce\ | FetchConsumer\ | FetchFollower} | broken into queue, local, remote and response send time 分成队列,本地,远程和响应发送时间 | ||||||
Time the request waiting in the request queue 在请求队列中等待请求的时间 | kafka.network:type=RequestMetrics,name=QueueTimeMs,request={Produce\ | FetchConsumer\ | FetchFollower} | |||||||
Time the request being processed at the leader leader处理请求的时间 | kafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce\ | FetchConsumer\ | FetchFollower} | |||||||
Time the request waits for the follower 跟随者请求等待的时间 | kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce\ | FetchConsumer\ | FetchFollower} | non-zero for produce requests when ack=-1 当ack=-1,生产请求非零 non-zero for produce requests when ack=-1 当ack=-1,生产请求非零 | ||||||
Time to send the response 响应发送的时间 | kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce\ | FetchConsumer\ | FetchFollower} | |||||||
Number of messages the consumer lags behind the producer by 消息数,消费者落后于消生产者 | kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+) | |||||||||
The average fraction of time the network processors are idle 网络处理闲置的平均分数 | kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent | between 0 and 1, ideally > 0.3 0和1之间,理想地 > 0.3 | ||||||||
The average fraction of time the request handler threads are idle 请求处理线程闲置的平均分数 | kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent | between 0 and 1, ideally > 0.30和1之间,理想地 > 0.3 |
Common monitoring metrics for producer/consumer/connect
生产者/消费者/连接的共同监控指标
The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.
以下指标可用于生产者/消费者/连接器实例。有关具体的指标。请查看以下部分。
METRIC/ATTRIBUTE NAME | DESCRIPTION | MBEAN NAME | ||||
---|---|---|---|---|---|---|
connection-close-rate | Connections closed per second in the window. 窗口每秒关闭的连接。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
connection-creation-rate | New connections established per second in the window. 窗口每秒建立的新连接。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
network-io-rate | The average number of network operations (reads or writes) on all connections per second. 所有连接每秒的平均网络操作数(读取或写入)。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
outgoing-byte-rate | The average number of outgoing bytes sent per second to all servers. 每秒向所有服务器发送的传出字节的平均数。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
request-rate | The average number of requests sent per second. 每秒发送请求的平均数。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
request-size-avg | The average size of all requests in the window. 窗口所有请求的平均大小。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
request-size-max | The maximum size of any request sent in the window. 窗口发送请求的最大值。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
incoming-byte-rate | Bytes/second read off all sockets. 字节/秒读取所有socket。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
response-rate | Responses received sent per second. 每秒响应收到的发送 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
select-rate | Number of times the I/O layer checked for new I/O to perform per second. I/O层每秒检查新I/O执行的次数。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
io-wait-time-ns-avg | The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds. I/O线程花费在等待以纳秒为单位准备好读取或写入的socket的平均时间长度。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
io-wait-ratio | The fraction of time the I/O thread spent waiting. I/O线程花费等待的时间的比例。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
io-time-ns-avg | The average length of time for I/O per select call in nanoseconds. 每个选择调用的I/O的平均时间长度(以纳秒为单位)。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
io-ratio | The fraction of time the I/O thread spent doing I/O. I/O线程用于执行I/O的时间比例。 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
connection-count | The current number of active connections. 当前活跃的连接数 | kafka.[producer\ | consumer\ | connect]:type=[producer\ | consumer\ | connect]-metrics,client-id=([-.\w]+) |
Common Per-broker metrics for producer/consumer/connect
生产者/消费者/连接的broker指标
The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.
**以下可用于生产者/消费者/连接器实例。有关具体指标,请参阅以下部分。
**
METRIC/ATTRIBUTE NAME | DESCRIPTION | MBEAN NAME | ||
---|---|---|---|---|
outgoing-byte-rate | The average number of outgoing bytes sent per second for a node. 每个节点每秒传出字节的平均数。 | kafka.producer:type=[consumer\ | producer\ | connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+) |
request-rate | The average number of requests sent per second for a node. 每个节点每秒发送的平均请求数。 | kafka.producer:type=[consumer\ | producer\ | connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+) |
request-size-avg | The average size of all requests in the window for a node. 每个节点窗口所有请求平均大小。 | kafka.producer:type=[consumer\ | producer\ | connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+) |
request-size-max | The maximum size of any request sent in the window for a node. 每个节点窗口发送请求最大值。 | kafka.producer:type=[consumer\ | producer\ | connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+) |
incoming-byte-rate | The average number of responses received per second for a node. 每个节点接收响应的平均时间。 | kafka.producer:type=[consumer\ | producer\ | connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+) |
request-latency-avg | The average request latency in ms for a node. 节点等待平均请求延迟(毫秒) | kafka.producer:type=[consumer\ | producer\ | connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+) |
request-latency-max | The maximum request latency in ms for a node. 节点的请求最大延迟。 | kafka.producer:type=[consumer\ | producer\ | connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+) |
response-rate | Responses received sent per second for a node. 节点每秒接收发送的响应。 | kafka.producer:type=[consumer\ | producer\ | connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+) |
Producer monitoring
生产者监控
The following metrics are available on producer instances.
以下指数可用于生产实例。
METRIC/ATTRIBUTE NAME | DESCRIPTION | MBEAN NAME |
---|---|---|
waiting-threads | The number of user threads blocked waiting for buffer memory to enqueue their records. 用户线程数,阻塞等待缓冲内存消息入队。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
buffer-total-bytes | The maximum amount of buffer memory the client can use (whether or not it is currently used). 客户端可以使用的最大缓冲区内存(无论目前是否使用) | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
buffer-available-bytes | The total amount of buffer memory that is not being used (either unallocated or in the free list). 未使用的缓冲内存总量(未分配或在空闲列表中)。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
bufferpool-wait-time | The fraction of time an appender waits for space allocation. appender等待空间分配的时间比率。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
batch-size-avg | The average number of bytes sent per partition per-request. 每个分区每个请求发送的平均字节数 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
batch-size-max | The max number of bytes sent per partition per-request. 每个分区每个请求发送的最大字节数 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
compression-rate-avg | The average compression rate of record batches. 消息批次的平均压缩比率 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
record-queue-time-avg | The average time in ms record batches spent in the record accumulator. 消息累加器花费消息批次的平均时间(毫秒)。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
record-queue-time-max | The maximum time in ms record batches spent in the record accumulator. 消息累加器花费消息批次的最大时间(毫秒)。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
request-latency-avg | The average request latency in ms. 请求平均延迟(毫秒) | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
request-latency-max | The maximum request latency in ms. 最大请求延迟(毫秒) | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
record-send-rate | The average number of records sent per second. 每秒发送的消息平均数。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
records-per-request-avg | The average number of records per request. 每个请求的平均消息数 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
record-retry-rate | The average per-second number of retried record sends. 每秒重试消息发送的平均数。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
record-error-rate | The average per-second number of record sends that resulted in errors. 引起错误的消息发送的每秒平均数。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
record-size-max | The maximum record size. 最大消息大小 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
record-size-avg | The average record size. 平均消息大小 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
requests-in-flight | The current number of in-flight requests awaiting a response. 等待响应的当前请求数。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
metadata-age | The age in seconds of the current producer metadata being used. 当前生产者元数据已使用的时间(以秒为单位)。 | kafka.producer:type=producer-metrics,client-id=([-.\w]+) |
record-send-rate | The average number of records sent per second for a topic. topic每秒发送的平均消息数。 | kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
byte-rate | The average number of bytes sent per second for a topic. topic每秒发送的平均字节数 | kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
compression-rate | The average compression rate of record batches for a topic. topic的消息批次的平均压缩比率。 | kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
record-retry-rate | The average per-second number of retried record sends for a topic. topic发送重试消息的每秒平均数 | kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
record-error-rate | The average per-second number of record sends that resulted in errors for a topic. topic引起错误的发送每秒平均数。 | kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
produce-throttle-time-max | The maximum time in ms a request was throttled by a broker. broker限制请求的最打时间(以毫秒为单位) | kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+) |
produce-throttle-time-avg | The average time in ms a request was throttled by a broker. broker限制请求的平均时间(以毫秒为单位) | kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+) |
New consumer monitoring 新消费者监控
The following metrics are available on new consumer instances.
以下指标适用于新的消费者实例。
Consumer Group Metrics
消费者组指标
METRIC/ATTRIBUTE NAME | DESCRIPTION | MBEAN NAME |
---|---|---|
commit-latency-avg | The average time taken for a commit request 提交请求所需的平均时间 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
commit-latency-max | The max time taken for a commit request 提交请求所需的最大时间 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
commit-rate | The number of commit calls per second 每秒调用提交数 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
assigned-partitions | The number of partitions currently assigned to this consumer 当前分配给此消费者的分区数 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
heartbeat-response-time-max | The max time taken to receive a response to a heartbeat request 接收心跳请求响应所需的最大时间 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
heartbeat-rate | The average number of heartbeats per second 每秒心跳的平均数 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
join-time-avg | The average time taken for a group rejoin group重新加入所需要的平均时间 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
join-time-max | The max time taken for a group rejoin group重新加入的最大时间 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
join-rate | The number of group joins per second 每秒加入的group数 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
sync-time-avg | The average time taken for a group sync group同步所需的平均时间 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
sync-time-max | The max time taken for a group sync group同步所需的最大时间 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
sync-rate | The number of group syncs per second 每秒group同步数 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
last-heartbeat-seconds-ago | The number of seconds since the last controller heartbeat 上次控制器心跳之后的秒数 | kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+) |
Consumer Fetch Metrics 消费者拉取指标
METRIC/ATTRIBUTE NAME | DESCRIPTION | MBEAN NAME |
---|---|---|
fetch-size-avg | The average number of bytes fetched per request 每个请求拉取的平均字节数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
fetch-size-max | The maximum number of bytes fetched per request 每次请求拉取的最大字节数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
bytes-consumed-rate | The average number of bytes consumed per second 每秒消费的平均字节数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
records-per-request-avg | The average number of records in each request 每个请求的平均消息数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
records-consumed-rate | The average number of records consumed per second 每秒消费的消息平均数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
fetch-latency-avg | The average time taken for a fetch request 拉取请求所需的平均时间 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
fetch-latency-max | The max time taken for a fetch request 拉取请求所需的最大时间 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
fetch-rate | The number of fetch requests per second 每秒拉取请求数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
records-lag-max | The maximum lag in terms of number of records for any partition in this window 此窗口中任何分区消息数的最大落后 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
fetch-throttle-time-avg | The average throttle time in ms 平均限制时间(毫秒) | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
fetch-throttle-time-max | The maximum throttle time in ms 最大限流时间(毫秒) | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+) |
Topic-level Fetch Metrics
topic级别拉取指标
METRIC/ATTRIBUTE NAME | DESCRIPTION | MBEAN NAME |
---|---|---|
fetch-size-avg | The average number of bytes fetched per request for a specific topic. 每个分区针对特定topic拉取的平均字节数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
fetch-size-max | The maximum number of bytes fetched per request for a specific topic. 每个分区针对特定topic拉取的最大数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
bytes-consumed-rate | The average number of bytes consumed per second for a specific topic. 特定topic每秒消费的平均字节数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
records-per-request-avg | The average number of records in each request for a specific topic. 特定topic每个请求的平均消息数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
records-consumed-rate | The average number of records consumed per second for a specific topic. 特定topic每秒消费的平均消息数 | kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+) |
Others
其他方面
We recommend monitoring GC time and other stats and various server stats such as CPU utilization, I/O service time, etc. On the client side, we recommend monitoring the message/byte rate (global and per topic), request rate/size/time, and on the consumer side, max lag in messages among all partitions and min fetch request rate. For a consumer to keep up, max lag needs to be less than a threshold and min fetch rate needs to be larger than 0.
我们建议监控GC时间和其他统计信息以及各种服务器状态,例如CPU利用率,I/O服务时间等。客户端方面,我们建议监控消息/字节速率(全局和每个topic),请求速率/大小/ 时间,并且在消费者方面,在所有分区之间的消息中的最大滞后和最小获取请求速率。 对于消费者来说,最大落后需要小于阈值,并且最少拉取速率需要大于0。
Audit
审计
The final alerting we do is on the correctness of the data delivery. We audit that every message that is sent is consumed by all consumers and measure the lag for this to occur. For important topics we alert if a certain completeness is not achieved in a certain time period. The details of this are discussed in KAFKA-260.
我们最后提醒的是数据传输的正确性。 我们审核发送的每条消息都由所有消费者消费,并估算发生这种情况的落后。 对于重要的topic,我们提醒,如果在一定时间内没有达到某种完整性。 详细内容在KAFKA-260中讨论。