Skip to content

Conversation

@codebien
Copy link
Contributor

@codebien codebien commented Sep 10, 2025

What?

Rate metrics are now exported as a single counter with a label that can have two values: zero and nonzero.

Why?

Most of the time a single metric should be enough to query the main stats for this kind of query. We are aware that sometimes it might require a bit more complexity in case of aggregation but it seems to be acceptable.

It automatically solves the issue with cold startup of the metric because we expect at least a value between nonzero and zero.

Checklist

  • I have performed a self-review of my code.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added tests for my changes.
  • I have run linter and tests locally (make check) and all pass.

Checklist: Documentation (only for k6 maintainers and if relevant)

Please do not merge this PR until the following items are filled out.

  • I have added the correct milestone and labels to the PR.
  • I have updated the release notes: link
  • I have updated or added an issue to the k6-documentation: grafana/k6-docs#NUMBER if applicable
  • I have updated or added an issue to the TypeScript definitions: grafana/k6-DefinitelyTyped#NUMBER if applicable

Related PR(s)/Issue(s)

Closes #4573

@codebien codebien self-assigned this Sep 10, 2025
@codebien codebien force-pushed the otel-rate-single-counter branch from 09cd6ff to 130844d Compare September 10, 2025 13:26
@@ -53,12 +54,12 @@ func (o *Output) StopWithTestError(_ error) error {
o.logger.Debug("Stopping...")
defer o.logger.Debug("Stopped!")

o.periodicFlusher.Stop()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to stop the ingestion before stopping the flushing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note this IS NOT blocking! ⚠️ )

I'm not sure whether it's related to this or not, but from time to time I see the following error:

INFO[0010] 2025/09/16 15:39:10 failed to upload metrics: context canceled: rpc error: code = Canceled desc = context canceled 

especially when using the old method, which I guess is more prone to fail because it needs to send two metrics instead of one.

Have you seen this before? Was that what made you change the order of these calls?

This is definitely NOT an issue introduced by your changes, because it happens with master as well, but I'm wondering whether there's something within this shutdown process we could to completely avoid that.

Perhaps we should create an issue and investigate it later, and consider if we should include it as part of #5173.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not aware of it. Did you see it on local tests or across networks? I've ran multiples local tests and I never hit it.

I discovered the current bug mostly by writing the unit test. Because they are sensible if the last batch has been correctly flushed or not.

The error reported smells like a bug related a bad management of the context in the recurring flush operation. It's a good idea to open a bug report and we have to tackle it as part of the graduation process.

default:
o.logger.Warnf("metric %q has unsupported metric type", entry.Metric.Name)
return fmt.Errorf("metric %q has unsupported metric type", entry.Metric.Name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happens is an error. There isn't a way to create an untyped metric.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! 👏🏻

@codebien codebien marked this pull request as ready for review September 10, 2025 16:12
@codebien codebien requested a review from a team as a code owner September 10, 2025 16:12
@codebien codebien requested review from inancgumus and oleiade and removed request for a team September 10, 2025 16:12
@codebien codebien added this to the v1.3.0 milestone Sep 10, 2025
@codebien codebien mentioned this pull request Sep 11, 2025
@codebien
Copy link
Contributor Author

@oleiade @inancgumus I've decided to be less brutal and do this change a bit more gradual. We will have a feature flag for the next release, so if we overlooked at any topic or if users need more time then they can still use the previous method without breaking their workflow.

@codebien codebien force-pushed the otel-rate-single-counter branch from 8f35f70 to ae9506c Compare September 11, 2025 12:12
@codebien codebien temporarily deployed to azure-trusted-signing September 11, 2025 12:18 — with GitHub Actions Inactive
@codebien codebien temporarily deployed to azure-trusted-signing September 11, 2025 12:19 — with GitHub Actions Inactive
@joanlopez joanlopez self-requested a review September 12, 2025 08:03
Copy link
Contributor

@joanlopez joanlopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't carefully looked at all the changes in this PR, and I haven't had time to test it end-to-end yet neither, but I'm wondering how this approach may help in one of the problems mentioned in the issue #4573 -the one I personally reported-, which is the lack of initialization, also discussed within k6 and around thresholds: #4787.

Beyond that, and thinking around the practicality of this approach, I'm not quite sure if it really brings much value. Although it's true that using labels is likely a better approach than having two "different" counters, I'm not quite sure how, generally speaking, having a statically defined attribute with a pair of values is going to improve user's experience overall.

Let's suppose we have a metric that tracks (using Prometheus notation for simplicity):

  • the "rate" of HTTP errors, like: http_requests_failed{type="zero|non-zero"}
  • the "rate" of hit/miss in cache, like: cache_reads_miss{type="zero|non-zero"}

Do we really think that this really improves much the experience, versus:

  • HTTP requests failed:
  • http_requests_errors_occurred{} 0
  • http_requests_errors_total{} 0
  • Cache hits/misses:
  • cache_reads_miss_occurred{} 0
  • cache_reads_miss_total{} 0

? 🤔

Because, despite that not being perfect (I don't like it a lot neither), semantically the former -current- looks a bit nicer to me than the new one. And note that this is NOT only a "naming" thing, because if you want to track "all" vs "some" (as "occurred" vs "total" pursues), I think you shouldn't use OTEL attributes, which purpose I think is very close to Prometheus labels, usually used to categorize rather than tracking rates (i.e. In what % of the total amount of events happened X?).

As said, I don't like either, and I don't have a better proposal, but I have doubts around changing what already exists in favor of something that doesn't really look much better. Instead, perhaps we could try to gather feedback and knowledge from more experienced users, also more familiar with OTEL.

@@ -5,7 +5,7 @@ networks:

services:
prometheus:
image: prom/prometheus:v3.2.0
image: prom/prometheus:v3.5.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping this file up-to-date. Do you think we can also get rid of the version attribute, as per https://docs.docker.com/reference/compose-file/version-and-name/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Rate metrics are now exported as a single counter with a label that can
have two values: zero and nonzero.
@codebien codebien force-pushed the otel-rate-single-counter branch 2 times, most recently from 3441fe5 to edea45f Compare September 15, 2025 17:24
@codebien codebien temporarily deployed to azure-trusted-signing September 15, 2025 17:30 — with GitHub Actions Inactive
@codebien codebien temporarily deployed to azure-trusted-signing September 15, 2025 17:32 — with GitHub Actions Inactive
@codebien codebien force-pushed the otel-rate-single-counter branch from edea45f to 1766f00 Compare September 16, 2025 12:39
@codebien codebien requested a review from joanlopez September 16, 2025 12:41
@codebien codebien temporarily deployed to azure-trusted-signing September 16, 2025 12:45 — with GitHub Actions Inactive
@codebien codebien temporarily deployed to azure-trusted-signing September 16, 2025 12:46 — with GitHub Actions Inactive
"editorMode": "code",
"exemplar": false,
"expr": "k6_checks_occurred{testid=~\"$testid\"} / k6_checks_total{testid=~\"$testid\"}",
"expr": "rate(k6_checks_total{testid=~\"$testid\", condition=\"nonzero\"}[$__rate_interval]) / rate(k6_checks_total{testid=~\"$testid\"}[$__rate_interval])",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong, as I have tested it with successful and unsuccessful checks and it still display: 100%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be something along the lines of:

sum by (check) (k6_checks_total{testid=~"$testid", condition="nonzero"}) / sum by (check) (k6_checks_total{testid=~"$testid"})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that verifies this behavior (whether it's incorrect or not)?

Copy link
Contributor Author

@codebien codebien Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inancgumus Do you mean an automated or a manual tests? In terms of automation, we don't have it. If you want to manual test it, then you can go to examples/docker-compose/opentelemetry/* which is the path where this dashboard is in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I basically tested this manually.

I'm not sure if there's an easy way to automate that, or if it does worth the effort of automating such thing (probably not?).

codebien and others added 2 commits September 16, 2025 15:57
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
Co-authored-by: Joan López de la Franca Beltran <5459617+joanlopez@users.noreply.github.com>
Copy link
Contributor

@joanlopez joanlopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, great job @codebien! 👏🏻

I left multiple (minor) comments, which aren't really blocking, except for the query in the dashboard that I think it's wrong and could be misleading.

In general, I'm quite satisfied with the approach, and it really solves the issue reported 👍🏻 Although, I have to admit that condition="zero|nonzero" still resonates in my head as unlikely the best candidate/most intuitive one.

In other words, I'm not super convinced that if you would ask to a large population of k6 users if: k6_http_req_failed{condition="zero"} reflects the amount of HTTP requests that failed, or that did not, a vast majority would answer correctly.

But overall it's true that his fixes the underlying issue, so I'm 👍🏻 for merging it, as it improves the current status 🙌🏻

@codebien codebien temporarily deployed to azure-trusted-signing September 16, 2025 14:03 — with GitHub Actions Inactive
@codebien codebien temporarily deployed to azure-trusted-signing September 16, 2025 14:06 — with GitHub Actions Inactive
@joanlopez
Copy link
Contributor

joanlopez commented Sep 16, 2025

In general, I'm quite satisfied with the approach, and it really solves the issue reported 👍🏻 Although, I have to admit that condition="zero|nonzero" still resonates in my head as unlikely the best candidate/most intuitive one.

After an internal chat, we decided to stick with condition="zero|nonzero" as any other alternative evaluated so far is also confusing in certain conditions (note here that signs of positivity like positive|negative and success|failure would suffer the same as discussed in #2306, an in some other related issues/pull requests, so we stick to a boolean representation).

If we ever find a better candidate, we can always add the new attribute and eventually get rid of the old one 👍🏻

@oleiade oleiade modified the milestones: v1.3.0, v1.4.0 Sep 19, 2025
@mstoykov mstoykov self-requested a review September 24, 2025 08:10
mstoykov
mstoykov previously approved these changes Sep 25, 2025
Copy link
Contributor

@mstoykov mstoykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Co-authored-by: Mihail Stoykov <312246+mstoykov@users.noreply.github.com>
@codebien codebien temporarily deployed to azure-trusted-signing September 25, 2025 08:59 — with GitHub Actions Inactive
@codebien codebien temporarily deployed to azure-trusted-signing September 25, 2025 09:02 — with GitHub Actions Inactive
Copy link
Contributor

@inancgumus inancgumus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codebien codebien merged commit 621b71e into master Sep 25, 2025
45 of 46 checks passed
@codebien codebien deleted the otel-rate-single-counter branch September 25, 2025 14:43
@codebien codebien added the breaking change for PRs that need to be mentioned in the breaking changes section of the release notes label Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change for PRs that need to be mentioned in the breaking changes section of the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better handling k6's Rate metric for opentelemetry output

5 participants