-
Notifications
You must be signed in to change notification settings - Fork 1.5k
otel: Export rate metrics as a single counter #5164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
09cd6ff to
130844d
Compare
| @@ -53,12 +54,12 @@ func (o *Output) StopWithTestError(_ error) error { | |||
| o.logger.Debug("Stopping...") | |||
| defer o.logger.Debug("Stopped!") | |||
|
|
|||
| o.periodicFlusher.Stop() | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to stop the ingestion before stopping the flushing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note this IS NOT blocking!
I'm not sure whether it's related to this or not, but from time to time I see the following error:
INFO[0010] 2025/09/16 15:39:10 failed to upload metrics: context canceled: rpc error: code = Canceled desc = context canceled especially when using the old method, which I guess is more prone to fail because it needs to send two metrics instead of one.
Have you seen this before? Was that what made you change the order of these calls?
This is definitely NOT an issue introduced by your changes, because it happens with master as well, but I'm wondering whether there's something within this shutdown process we could to completely avoid that.
Perhaps we should create an issue and investigate it later, and consider if we should include it as part of #5173.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm not aware of it. Did you see it on local tests or across networks? I've ran multiples local tests and I never hit it.
I discovered the current bug mostly by writing the unit test. Because they are sensible if the last batch has been correctly flushed or not.
The error reported smells like a bug related a bad management of the context in the recurring flush operation. It's a good idea to open a bug report and we have to tackle it as part of the graduation process.
| default: | ||
| o.logger.Warnf("metric %q has unsupported metric type", entry.Metric.Name) | ||
| return fmt.Errorf("metric %q has unsupported metric type", entry.Metric.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this happens is an error. There isn't a way to create an untyped metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 👏🏻
|
@oleiade @inancgumus I've decided to be less brutal and do this change a bit more gradual. We will have a feature flag for the next release, so if we overlooked at any topic or if users need more time then they can still use the previous method without breaking their workflow. |
8f35f70 to
ae9506c
Compare
joanlopez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't carefully looked at all the changes in this PR, and I haven't had time to test it end-to-end yet neither, but I'm wondering how this approach may help in one of the problems mentioned in the issue #4573 -the one I personally reported-, which is the lack of initialization, also discussed within k6 and around thresholds: #4787.
Beyond that, and thinking around the practicality of this approach, I'm not quite sure if it really brings much value. Although it's true that using labels is likely a better approach than having two "different" counters, I'm not quite sure how, generally speaking, having a statically defined attribute with a pair of values is going to improve user's experience overall.
Let's suppose we have a metric that tracks (using Prometheus notation for simplicity):
- the "rate" of HTTP errors, like:
http_requests_failed{type="zero|non-zero"} - the "rate" of hit/miss in cache, like:
cache_reads_miss{type="zero|non-zero"}
Do we really think that this really improves much the experience, versus:
- HTTP requests failed:
http_requests_errors_occurred{} 0http_requests_errors_total{} 0- Cache hits/misses:
cache_reads_miss_occurred{} 0cache_reads_miss_total{} 0
? 🤔
Because, despite that not being perfect (I don't like it a lot neither), semantically the former -current- looks a bit nicer to me than the new one. And note that this is NOT only a "naming" thing, because if you want to track "all" vs "some" (as "occurred" vs "total" pursues), I think you shouldn't use OTEL attributes, which purpose I think is very close to Prometheus labels, usually used to categorize rather than tracking rates (i.e. In what % of the total amount of events happened X?).
As said, I don't like either, and I don't have a better proposal, but I have doubts around changing what already exists in favor of something that doesn't really look much better. Instead, perhaps we could try to gather feedback and knowledge from more experienced users, also more familiar with OTEL.
| @@ -5,7 +5,7 @@ networks: | |||
|
|
|||
| services: | |||
| prometheus: | |||
| image: prom/prometheus:v3.2.0 | |||
| image: prom/prometheus:v3.5.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for keeping this file up-to-date. Do you think we can also get rid of the version attribute, as per https://docs.docker.com/reference/compose-file/version-and-name/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Rate metrics are now exported as a single counter with a label that can have two values: zero and nonzero.
3441fe5 to
edea45f
Compare
examples/docker-compose/opentelemetry/grafana/dashboards/k6-prometheus.json
Outdated
Show resolved
Hide resolved
edea45f to
1766f00
Compare
| "editorMode": "code", | ||
| "exemplar": false, | ||
| "expr": "k6_checks_occurred{testid=~\"$testid\"} / k6_checks_total{testid=~\"$testid\"}", | ||
| "expr": "rate(k6_checks_total{testid=~\"$testid\", condition=\"nonzero\"}[$__rate_interval]) / rate(k6_checks_total{testid=~\"$testid\"}[$__rate_interval])", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong, as I have tested it with successful and unsuccessful checks and it still display: 100%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be something along the lines of:
sum by (check) (k6_checks_total{testid=~"$testid", condition="nonzero"}) / sum by (check) (k6_checks_total{testid=~"$testid"})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test that verifies this behavior (whether it's incorrect or not)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inancgumus Do you mean an automated or a manual tests? In terms of automation, we don't have it. If you want to manual test it, then you can go to examples/docker-compose/opentelemetry/* which is the path where this dashboard is in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I basically tested this manually.
I'm not sure if there's an easy way to automate that, or if it does worth the effort of automating such thing (probably not?).
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, great job @codebien! 👏🏻
I left multiple (minor) comments, which aren't really blocking, except for the query in the dashboard that I think it's wrong and could be misleading.
In general, I'm quite satisfied with the approach, and it really solves the issue reported 👍🏻 Although, I have to admit that condition="zero|nonzero" still resonates in my head as unlikely the best candidate/most intuitive one.
In other words, I'm not super convinced that if you would ask to a large population of k6 users if: k6_http_req_failed{condition="zero"} reflects the amount of HTTP requests that failed, or that did not, a vast majority would answer correctly.
But overall it's true that his fixes the underlying issue, so I'm 👍🏻 for merging it, as it improves the current status 🙌🏻
After an internal chat, we decided to stick with If we ever find a better candidate, we can always add the new attribute and eventually get rid of the old one 👍🏻 |
mstoykov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Co-authored-by: Mihail Stoykov <312246+mstoykov@users.noreply.github.com>
inancgumus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What?
Rate metrics are now exported as a single counter with a label that can have two values:
zeroandnonzero.Why?
Most of the time a single metric should be enough to query the main stats for this kind of query. We are aware that sometimes it might require a bit more complexity in case of aggregation but it seems to be acceptable.
It automatically solves the issue with cold startup of the metric because we expect at least a value between nonzero and zero.
Checklist
make check) and all pass.Checklist: Documentation (only for k6 maintainers and if relevant)
Please do not merge this PR until the following items are filled out.
Related PR(s)/Issue(s)
Closes #4573