feat: add opentelemetry counters for sent and acked messages#2532
feat: add opentelemetry counters for sent and acked messages#2532agrawal-siddharth merged 1 commit intogoogleapis:mainfrom
Conversation
| }; | ||
| static AttributeKey<String> telemetryKeyErrorCode = AttributeKey.stringKey("error_code"); | ||
| private Attributes telemetryAttributes; | ||
| private long incomingRequestCountBuffered; |
There was a problem hiding this comment.
nit, I would group all of them under telemetryMetrics.
| "Reports time taken in milliseconds for a response to arrive once a message has been sent over the network.") | ||
| .setExplicitBucketBoundariesAdvice(METRICS_LATENCY_BUCKETS) | ||
| .build(); | ||
| instrumentConnectionEstablishCount = |
There was a problem hiding this comment.
I believe this can be derived from network_response_latency, if you put connection_id as the metrics field.
There was a problem hiding this comment.
I have added writer_id as an attribute. I still have this metric, however, as it directly provides information about establishing a connection.
| measurement.record(length, getTelemetryAttributes()); | ||
| }); | ||
| writeMeter | ||
| .gaugeBuilder("inflight_queue_length") |
There was a problem hiding this comment.
It will need connection_id as metrics field, otherwise it doesn't make too much sense?
There was a problem hiding this comment.
I have added writer_id as an attribute.
| .build(); | ||
| instrumentSentRequestRows = | ||
| writeMeter | ||
| .counterBuilder("append_rows_sent") |
There was a problem hiding this comment.
I think we should maintain less metrics. Can we just add a "result" field to append_requests/rows/bytes?
There was a problem hiding this comment.
Done. I have removed the following metrics: append_requests, append_request_bytes, append_rows, waiting_queue_length, connection_retry_count, append_requests_error, append_request_bytes_error, append_rows_error.
I now use the "error_code" attribute on each of the following metrics: append_requests_acked, append_request_bytes_acked, append_rows_acked, connection_end_count.
| private LongCounter instrumentErrorRequestCount; | ||
| private LongCounter instrumentErrorRequestSize; | ||
| private LongCounter instrumentErrorRequestRows; | ||
| private static final List<Long> METRICS_LATENCY_BUCKETS = |
There was a problem hiding this comment.
are these millis/micros/nanos?
There was a problem hiding this comment.
Renamed this to METRICS_MILLISECONDS_LATENCY_BUCKETS.
|
|
||
| private void periodicallyReportOpenTelemetryMetrics() { | ||
| Duration durationSinceLastRefresh = Duration.between(instantLastSentMetrics, Instant.now()); | ||
| if (durationSinceLastRefresh.compareTo(METRICS_UPDATE_INTERVAL) > 0) { |
There was a problem hiding this comment.
Are metrics updates really that costly on the producer side that you don't just update metrics at time of event?
In opencensus, flushing/updates were mostly an exporter concern.
There was a problem hiding this comment.
I am testing using an exporter to Google Cloud Monitoring. I encountered "exceeded max frequency" errors with this exporter. To resolve this issue, I have switched to updating the instruments only once every second, which I believe should be sufficient for our needs.
There was a problem hiding this comment.
Upon further inspection, I narrowed down the issue I was seeing to the frequency of the exporter. I have restored all metrics to be instrumented in real time.
3caa290 to
b6b30cf
Compare
b6b30cf to
796ae3e
Compare
0b87d97 to
ad164fb
Compare
| private LongCounter instrumentIncomingRequestSize; | ||
| private LongCounter instrumentIncomingRequestRows; | ||
| private static final List<Long> METRICS_MILLISECONDS_LATENCY_BUCKETS = | ||
| ImmutableList.of(0L, 50L, 100L, 500L, 1000L, 5000L, 10000L, 20000L, 30000L, 60000L, 120000L); |
There was a problem hiding this comment.
@GaoleMeng Do these buckets look good to you? Do we need a bucket at 50L? Maybe add a 2000L?
There was a problem hiding this comment.
this is too sparse, in backend we are using power of 1.5 bucket, that means it's
1 1.5 1.5^2 1.5^3.... millisecond
We were once using power of 4, but found that was too sparse, so we reduced it to power of 1.5
could we do similar bucketing here?
There was a problem hiding this comment.
The power of 1.5 sequence looks like this:
1, 2, 3, 5, 8, 11, 17, 26, 38, 58, 86, 130, 195, 292, 438, 657, 985, 1478, 2217, 3325, 4988, 7482, 11223, 16834, 25251, 37877, 56815, 85223, 127834, 191751, 287627, 431440, 647160, 970740, 1456110
Would it be useful to provide all of these buckets? Alternatively, we could just provide every other bucket, so the list looks like this:
1, 3, 8, 17, 38, 86, 195, 438, 985, 2217, 4988, 11223, 25251, 56815, 127834, 287627, 647160, 1456110
| if (!tableName.isEmpty()) { | ||
| builder.put(telemetryKeyTableId, tableName); | ||
| } | ||
| builder.put(telemetryKeyWriterId, writerId); |
There was a problem hiding this comment.
Add some comment to buildOpenTelemetryAttributes, what kind of attributes it is building and does this apply to all metrics?
| ImmutableList.of(0L, 50L, 100L, 500L, 1000L, 5000L, 10000L, 20000L, 30000L, 60000L, 120000L); | ||
|
|
||
| private static final class OpenTelemetryMetrics { | ||
| private LongCounter instrumentSentRequestCount; |
There was a problem hiding this comment.
Discussed with Gaole, we think that maybe the Sent and Ack won't make a significant difference. Let's just record Ack for now for simplicity.
| writeMeter | ||
| .counterBuilder("append_requests") | ||
| .setDescription("Counts number of incoming requests") | ||
| .counterBuilder("append_requests_acked") |
There was a problem hiding this comment.
This can be a TODO, I am wondering if it is possible to have a Retry attribute to the metric.
ad164fb to
73c9193
Compare
yirutang
left a comment
There was a problem hiding this comment.
LGTM, please address the bucket length issue.
73c9193 to
0572a47
Compare
| // Buckets are based on a list of 1.5 ^ n | ||
| private static final List<Long> METRICS_MILLISECONDS_LATENCY_BUCKETS = | ||
| ImmutableList.of( | ||
| 1L, 3L, 8L, 17L, 38L, 86L, 195L, 438L, 985L, 2217L, 4988L, 11223L, 25251L, 56815L, |
There was a problem hiding this comment.
Do we need to have 1L, 3L and 8L?
There was a problem hiding this comment.
I have removed these. I now start at 0 (as the lowest bucket boundary) and end at 647160 which represents about 10 minutes.
Also add network latency, queue length and error counts. The metrics (other than error counts) are now reported periodically, every second.
0572a47 to
8ac0ed2
Compare
…pis#2532) Also add network latency, queue length and error counts. The metrics (other than error counts) are now reported periodically, every second.
Also add network latency, queue length and error counts.
The metrics (other than error counts) are now reported periodically, every second.