Skip to content

Simplify gateway programming#50321

Merged
robmry merged 3 commits into
moby:masterfrom
robmry:simplify_gateway_programming
Jul 11, 2025
Merged

Simplify gateway programming#50321
robmry merged 3 commits into
moby:masterfrom
robmry:simplify_gateway_programming

Conversation

@robmry

@robmry robmry commented Jul 3, 2025

Copy link
Copy Markdown
Contributor

- What I did

For a long time, libnetwork's sbLeave/sbJoin have had logic to work out whether a container's default gateway changed because of an endpoint joining/leaving ... when the gateway changed, libnet notified its network drivers via calls to ProgramExternalConnectivity/RevokeExternalConnectivity, so the driver could add or remove host port mappings - revoking from old gateway endpoints, then programming new ones.

In that original code, only one endpoint could be the container's gateway - if there was a dual-stack endpoint, it'd be selected, otherwise it'd be IPv4-only. When a bridge driver's IPv4-only endpoint is acting as a container's gateway, it can proxy from host IPv6 addresses to container IPv4 addresses using the userland-proxy.

Now we have IPv6-only endpoints as well. So, when an IPv4-only endpoint with dual-stack port mappings is running, and an IPv6-only endpoint is added - the IPv4 endpoint has to stop its 6-to-4 proxying in order to let the IPv6 endpoint set up its own mappings. Tracking that in sbJoin/sbLeave, to work out when to program/revoke connectivity got complicated, that's 1832774.

(Extending that logic to understand that routed-mode networks should always have programmed-external-connectivity would further break the libnet/driver layering and make the libnet code even more convoluted. So, as well as simplifying the logic by putting it in the right place, this change enables that modification to routed-mode.)

- How I did it

Remove the complexity from sbJoin/sbLeave ...

  • libnet now selects IPv4 and IPv6 gateway endpoints, and tells affected network drivers about its decision (providing endpoint-ids for its selected gateways).
    • It's up to the driver to work out what it needs to do about it, if anything.
    • The bridge driver can now make incremental changes ...
  • When a proxying to from host IPv6 to an IPv4-only endpoint, and an IPv6 endpoint is added - rather than libnet telling the driver to drop and re-create its IPv4 mappings as single-stack, the bridge driver drops the 6-to-4 bindings (before the IPv6 endpoint is added in the same way as before.
    • In that case, when the IPv4 host port is from a range, adding or removing the IPv6 endpoint doesn't change the IPv4 host port binding (because it's not torn down and re-allocated).
  • RevokeExternalConnectivity is no longer needed and has been removed, it's the same as call to ProgramExternalConnectivity where the endpoint is not a gateway. So, libnet doesn't need to work out whether to Program/Revoke, it can just Program any endpoint affected by a join/leave.

For follow-up ...

Since 4f09af6, the bridge driver allocates the same host port for IPv4 and IPv6 for dual-stack bindings, when the endpoint is dual stack. It can do that because there's a call to ProgramExternalConnectivity that adds both bindings to a single endpoint (so, it sees the request and the same time, and can combine them into a single request to the port allocator). But, since IPv6-only endpoints were added, it hasn't been able to do that for containers that have two single-stack endpoints ... the bridge driver learns about them in unrelated calls to ProgramExternalConnectivity. Now, because ProgramExternalConnectivity supplies the endpoint ids of both gateways - when asked to set up bindings for a single-stack endpoint, the bridge driver can check whether the "other" endpoint already has a port allocated for dual-stack bindings. So, it can prefer that port. (If the second port is unavailable, maybe the user would want a new binding for both ports, but equally that might break things. They probably wouldn't want the network connect to fail. So, rather than make it too complicated, probably just make it a preference that'll work a lot of the time - it's possible to specify a host port if that's not good enough.)

- How to verify it

Existing tests, particularly the ones in integration/networking/port_mapping_linux_test.go.

- Human readable description for the release notes

@robmry robmry added this to the 29.0.0 milestone Jul 3, 2025
@robmry robmry self-assigned this Jul 3, 2025
@robmry robmry added area/networking Networking kind/refactor PR's that refactor, or clean-up code area/networking/d/bridge Networking area/networking/portmapping Networking labels Jul 3, 2025
@robmry robmry force-pushed the simplify_gateway_programming branch 4 times, most recently from cd8fab4 to 2276936 Compare July 3, 2025 16:27
@robmry robmry marked this pull request as ready for review July 3, 2025 17:10
return fmt.Errorf("failed to update bridge endpoint %.7s to store: %v", endpoint.id, err)
undo := func() []portBinding {
pbReq := make([]types.PortBinding, 0, len(ep.portMapping))
for _, pb := range ep.portMapping {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer scope doesn't capture ep, nor the original ep.portMapping. This is currently okay because ep.portMapping isn't overwritten until addPortMappings succeeds, but I think it's not reasonable to make trimPortBindings depend on the exact implementation of its caller.

Could you capture ep.portMapping at the start of trimPortBindings instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eek, yes - it should have been restoring toDrop, not the whole set of original mappings ... so I've made that change.

Comment thread libnetwork/endpoint.go
func epShortId(ep *Endpoint) string {
return stringid.TruncateID(epId(ep))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this an *Endpoint method?

(And it'd be great to suppress epId() but I'm not sure it's 100% safe until we fix data persistence issues.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because it (and epId()) are there to cope with nil Endpoints, by just returning no id. Calling a method on a nil pointer can be a bit nerve wracking for the casual reader! So, could change it and handle the nil receiver ... but I think it's probably better not to?

Comment thread libnetwork/endpoint.go
}
}()
}
if !n.internal {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now it's driver's responsibility to not map ports for an internal network? But I don't see any n.internal check in the bridge driver (neither existing or added by this PR) 🤔 What's preventing an internal bridge network from having port bindings set?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An endpoint in an internal network can't be selected as a gateway, so the bridge driver won't set up port bindings. (And it probably should be up to the driver to decide that.)

Comment thread libnetwork/driverapi/driverapi.go Outdated
Comment on lines +88 to +94
// Ids of the endpoints currently acting as the container's default gateway for
// IPv4 and IPv6 are passed as gw4Id/gw6Id. (Those endpoints may be managed by
// different network drivers. If there is no gateway, the id will be the
// empty string.)
//
// This method is called after Driver.Join, and when eid is or was equal to
// gw4Id or gw6Id, and there's a change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called after Driver.Join, and during Driver.Leave.

Maybe it'd be helpful to indicate that a sandbox with no gateway is a transient state that appears when the current gateway endpoint leaves the sandbox — and empty gwId can only occur in that case. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called after Driver.Join, and during Driver.Leave.

Before Driver.Leave (rather than during) ... but yes, updated.

Maybe it'd be helpful to indicate that a sandbox with no gateway is a transient state that appears when the current gateway endpoint leaves the sandbox — and empty gwId can only occur in that case. WDYT?

It's not a transient state for a container that's only connected to an internal network (or when there's no remaining network) ... and, just to note, this PR doesn't change that - before, there wouldn't have been a gateway between calls to RevokeExternalConnnectivity for the leaving endpoint and ProgramExternalConnectivity for the new one. So, I've added this - hopefully it captures the main point? ...

	// When an endpoint acting as a gateway is deleted, this function is called
	// with the leaving endpoint's id, and empty gateway ids (even if another
	// endpoint is present and will shortly be selected as the gateway).

Comment thread libnetwork/endpoint.go Outdated
log.G(ctx).WithError(err).Warn("driver failed revoking external connectivity on endpoint")
}
if ecd, ok := d.(driverapi.ExtConner); ok {
if err := ecd.ProgramExternalConnectivity(context.WithoutCancel(ctx), n.ID(), ep.ID(), nil, "", ""); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd make more sense to always pass the sandbox labels.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed - the nil param disappears anyway in the next commit in this PR, because labels are only passed to Join now, not ProgramExternalConnectivity (the labels can't change after Join).

@robmry robmry force-pushed the simplify_gateway_programming branch 2 times, most recently from d7412ca to 7be5d13 Compare July 8, 2025 17:56

@corhere corhere left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the exact semantics of ProgramExternalConnectivity. Isn't the default gateway a property of the sandbox -- the iface of the next-hop for outgoing packets in the netns which do not match a more specific route? Does that mean that the (network id, endpoint id) tuple being passed into ProgramExternalConnectivity is semantically an indirect way of referencing the sandbox the endpoint is joined to, and that libnetwork is expected to call ProgramExternalConnectivity on all endpoints attached to a sandbox when the default gateway changes?

It would make more intuitive sense to me (if my assumptions above are correct) to have an API shaped more like

interface {
    HandleGatewayChange(ctx context.Context, sboxKey, gw4Id, gw6Id string) error
}

which libnetwork calls once per driver when the sandbox's gateway changes. Since the drivers all know which sandbox each endpoint is joined to they would have no problem iterating through all of their endpoints that are joined to some sandbox. And it makes the drivers more loosely coupled to the implementation details of libnetwork by being descriptive rather than prescriptive (this event happened btw, do whatever is necessary vs. do this specific thing because the framework said so.) It looks rather suspiciously like ProgramExternalConnectivity is only called for the gateway endpoints, simply because that's what the bridge driver required.

Comment thread libnetwork/driverapi/driverapi.go
Comment thread libnetwork/driverapi/driverapi.go
Comment on lines +93 to +95
// When an endpoint acting as a gateway is deleted, this function is called
// with that endpoint's id, and empty gateway ids (even if another endpoint
// is present and will shortly be selected as the gateway).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the gateway IDs may be for endpoints belonging to other drivers makes this statement ambiguous. What happens if the endpoint for gw4Id, belonging to another driver, is deleted? I would intuitively expect libnetwork to call ProgramExternalConnectivity(ctx, nid, eid, "", "") but the interface contract reads like libnetwork would be expected to pass gw4Id as the eid parameter!

How helpful are these semantics in the case where the endpoint being deleted is the gateway: eid == gw4Id || eid == gw6Id? DeleteEndpoint(nid, eid) will be invoked in the driver regardless. What are drivers expected to do if DeleteEndpoint is called on an endpoint that is the default gateway for a network of the driver's type?

Are atomic swaps of the gateways permitted? E.g.

ProgramExternalConnectivity(ctx, nid, eid, "gateway1", "")
ProgramExternalConnectivity(ctx, nid, eid, "gateway2", "")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the gateway IDs may be for endpoints belonging to other drivers makes this statement ambiguous. What happens if the endpoint for gw4Id, belonging to another driver, is deleted? I would intuitively expect libnetwork to call ProgramExternalConnectivity(ctx, nid, eid, "", "") but the interface contract reads like libnetwork would be expected to pass gw4Id as the eid parameter!

Added "in eid" to "with that endpoint's id".

How helpful are these semantics in the case where the endpoint being deleted is the gateway: eid == gw4Id || eid == gw6Id? DeleteEndpoint(nid, eid) will be invoked in the driver regardless. What are drivers expected to do if DeleteEndpoint is called on an endpoint that is the default gateway for a network of the driver's type?

RevokeExternalConnectivity would have been called for the leaving endpoint before, now it's ProgramExternalConnectivity (with no gateway ids), there's no other change.

A follow-up could take it further and get rid of that call by making sure the driver cleans up port mappings during DeleteEndpoint.

Are atomic swaps of the gateways permitted? E.g.

ProgramExternalConnectivity(ctx, nid, eid, "gateway1", "")
ProgramExternalConnectivity(ctx, nid, eid, "gateway2", "")

Not currently, without a call to revoke the gateway-ness of endpoint "gateway1" - just as a call RevokeExternalConnectivity would have been needed before this change. (Its godoc comment says "This method is called after Driver.Join, before Driver.Leave, and when eid is or was equal to gw4Id or gw6Id, and there's a change.")

Again, a follow-up could take this change further, continuing with cleanup of driverapi.

return err
// trimPortBindings compares pbmReq with the current port bindings, and removes
// port bindings that are no longer required. If anything goes wrong, an error will
// be logged, but there's no error return - try to continue, because failure to

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is an error return?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was wrong (old), I've removed it.

(releasePortBindings should probably roll-back its changes on error, but that's not changed in this PR.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like your edit to the comment has made it into your PR branch yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Thank you - it's in now.

Comment thread libnetwork/endpoint.go Outdated
@robmry

robmry commented Jul 9, 2025

Copy link
Copy Markdown
Contributor Author

I'm a little confused about the exact semantics of ProgramExternalConnectivity. Isn't the default gateway a property of the sandbox -- the iface of the next-hop for outgoing packets in the netns which do not match a more specific route? Does that mean that the (network id, endpoint id) tuple being passed into ProgramExternalConnectivity is semantically an indirect way of referencing the sandbox the endpoint is joined to, and that libnetwork is expected to call ProgramExternalConnectivity on all endpoints attached to a sandbox when the default gateway changes?

Yes and no, mostly no ... a container might be connected to several bridge networks, meaning the Sandbox will have several bridge driver Endpoints - and one of those will be selected as the gateway for each address family.

So, for the bridge driver to identify the gateway, it needs the endpoint id (and which network that endpoint it in). If we only told the bridge driver the sandbox has a gateway, it still wouldn't know which endpoint to use.

Oh, I misread - you'd still pass in the endpoint ids, but you'd just replace endpoint/network ids with sandbox id. (I don't think it would make any difference, except maybe the driver having to track the relationship between sandbox/network/endpoint slightly differently) ...

(The driver could be told sandbox id and network instead of endpoint id, that wouldn't give it any more or less information. Or, rather than give Endpoints unique ids, they could be identified by sandbox id plus network id - I've not looked, but I think that would be quite a wide ranging change with little benefit.)

It would make more intuitive sense to me (if my assumptions above are correct) to have an API shaped more like

interface {
    HandleGatewayChange(ctx context.Context, sboxKey, gw4Id, gw6Id string) error
}

which libnetwork calls once per driver when the sandbox's gateway changes. Since the drivers all know which sandbox each endpoint is joined to they would have no problem iterating through all of their endpoints that are joined to some sandbox. And it makes the drivers more loosely coupled to the implementation details of libnetwork by being descriptive rather than prescriptive (this event happened btw, do whatever is necessary vs. do this specific thing because the framework said so.) It looks rather suspiciously like ProgramExternalConnectivity is only called for the gateway endpoints, simply because that's what the bridge driver required.

You're suggesting taking this change slightly further - it already does a lot of that decoupling ... libnetwork no longer needs to care what the driver does when the gateway-ness of one of its Endpoints changes, it just tells the driver when there's a change in its affected Endpoints.

Having the driver work out which of its endpoints are affected (by pushing down the gateway-ness tracking into each of the drivers) probably would be an improvement. But, there's no pressing need ... with this change, the layering is already better, and this PR along with #50226 achieve what they set out to achieve - which is enabling the change to make routed-mode endpoints open, even when the endpoint is not selected as a gateway (these PRs are split out of #50140, which was stuck in the review queue for a month or so).

So, let's leave that for a follow-up.

@robmry robmry force-pushed the simplify_gateway_programming branch from 7be5d13 to f73b550 Compare July 9, 2025 09:32

@corhere corhere left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirming my understanding! I wholeheartedly agree with leaving further improvements to follow-ups.

return err
// trimPortBindings compares pbmReq with the current port bindings, and removes
// port bindings that are no longer required. If anything goes wrong, an error will
// be logged, but there's no error return - try to continue, because failure to

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like your edit to the comment has made it into your PR branch yet.

robmry added 3 commits July 9, 2025 19:42
Change the semantics of ProgramExternalConnectivity, libnet
can now call it whenever an endpoint is selected or deselected
as a container's gateway endpoint.

It's the driver's responsibility to remember what bindings it's
set up, and to work out what needs to change.

So, calling ProgramExternalConnectivity to tell the driver
an endpoint is no longer a gateway has the same effect as
RevokeExternalConnectivity - bindings need to be removed.

That means libnet no longer needs to work out whether to
Program/Revoke, it can just call ProgramExternalConnectivity.
RevokeExternalConnectivity has been removed.

Signed-off-by: Rob Murray <rob.murray@docker.com>
The same sandbox options are passed to Join.

Signed-off-by: Rob Murray <rob.murray@docker.com>
Legacy links were set up by ProgramExternalConnectivity, but
removed by Leave (rather than RevokeExternalConnectivity). The
options needed by legacy links are all available in Join, and
Join will only be called once per Endpoint. So, create legacy
links there.

Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry force-pushed the simplify_gateway_programming branch from f73b550 to 30b9480 Compare July 9, 2025 18:42
@robmry

robmry commented Jul 11, 2025

Copy link
Copy Markdown
Contributor Author

windows-2025 / run (graphdriver) / integration-test-report (graphdriver, builtin) (pull_request) is failing - but, unrelated ... I'll get this merged.

@robmry robmry merged commit 0d189dd into moby:master Jul 11, 2025
251 of 256 checks passed
@robmry robmry deleted the simplify_gateway_programming branch July 21, 2025 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking/d/bridge Networking area/networking/portmapping Networking area/networking Networking kind/refactor PR's that refactor, or clean-up code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants