-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
What version of gRPC-Java are you using?
1.63.0
What is your environment?
Linux Ubuntu 20.04
Openjdk version "21.0.2" 2024-01-16 LTS
What did you expect to see?
No Uncaught exception in the SynchronizationContext. Panic! java.lang.NullPointerException: Cannot invoke "io.grpc.internal.PickFirstLeafLoadBalancer$SubchannelData.getSubchannel()" because "subchannelData" is null errors in client requests.
What did you see instead?
In logs I see NPE errors in PickFirstLeafLoadBalancer, causing client requests to fail.
These errors are intermittent and happen on some apparently random VMs.
After some investigation I found temporary DNS look up failures that happened before these NPE started to appear.
DNS issues is something happening on our side, but I believe client should be able to handle temporary resolution failures.
Logs are provided below:
2024-05-20 17:28:13,099 WARN [grpc-default-executor-8397] i.g.i.ManagedChannelImpl: [Channel<17>: (some.internal.host.net:1234)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host some.internal.host.net, cause=java.lang.RuntimeException: java.net.UnknownHostException: some.internal.host.net
at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.net.UnknownHostException: some.internal.host.net.net
at java.base/java.net.InetAddress$CachedLookup.get(InetAddress.java:988)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1818)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1688)
at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:632)
at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219)
... 5 more
}
2024-05-20 17:28:18,481 ERROR [grpc-default-executor-8395] i.g.i.ManagedChannelImpl: [Channel<17>: (some.internal.host.net:1234)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException: Cannot invoke "io.grpc.internal.PickFirstLeafLoadBalancer$SubchannelData.getSubchannel()" because "subchannelData" is null
at io.grpc.internal.PickFirstLeafLoadBalancer.acceptResolvedAddresses(PickFirstLeafLoadBalancer.java:138)
at io.grpc.internal.AutoConfiguredLoadBalancerFactory$AutoConfiguredLoadBalancer.tryAcceptResolvedAddresses(AutoConfiguredLoadBalancerFactory.java:142)
at io.grpc.internal.ManagedChannelImpl$NameResolverListener$1NamesResolved.run(ManagedChannelImpl.java:1877)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
at io.grpc.internal.ManagedChannelImpl$NameResolverListener.onResult(ManagedChannelImpl.java:1891)
at io.grpc.internal.RetryingNameResolver$RetryingListener.onResult(RetryingNameResolver.java:98)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:333)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Steps to reproduce the bug
I don't have simple steps to reproduce unfortunately, but I think I found the call sequence causing the issue:
io.grpc.internal.DnsNameResolver#resolveAddressesthrowsRuntimeExceptionwithUnknownHostExceptionas a causeio.grpc.internal.DnsNameResolver#doResolvecatches it and returnsresult.error = Status.UNAVAILABLEio.grpc.internal.DnsNameResolver.Resolve#runthen callssavedListener.onError(result.error)
wheresavedListenerisio.grpc.internal.ManagedChannelImpl.NameResolverListenerio.grpc.internal.ManagedChannelImpl.NameResolverListener#onErroreffectively callshelper.lb.handleNameResolutionError(error)
wherelbisio.grpc.internal.PickFirstLeafLoadBalancerio.grpc.internal.PickFirstLeafLoadBalancer#handleNameResolutionErrorfinally callssubchannels.clear(), but does nothing withaddressIndex- as a result we have
subchannelscleared, butaddressIndexstays intact
so we end up with inconsistent state betweensubchannelsandaddressIndex:addressIndex.seekTo(previousAddress)is true, butsubchannels.get(previousAddress)is false
this inconsistency is causing the above NPE