Skip to content

Fix NullPointerException in transport trace logger #132243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

howardhuanghua
Copy link
Contributor

@howardhuanghua howardhuanghua commented Jul 31, 2025

When trace-level logging is enabled, a node might disconnect from
cluster due to an NPE that causes the transport connection closed
between the data node and the master node.

InboundMessages printed by TransportLogger might throw an NPE in the
format function because content might be NULL if another node sends an
abnormal exception response.

Also there's no good reason to close the connection because of a logging
exception, so with this commit we catch all exceptions (rather than just
IOException)

@elasticsearchmachine elasticsearchmachine added v9.2.0 needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jul 31, 2025
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also fix the NPE you found. I see no reason to even call openOrGetStreamInput() within this format method.

@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Network Http and internode communication implementations and removed needs:triage Requires assignment of a team area label labels Jul 31, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 31, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@howardhuanghua
Copy link
Contributor Author

I think we should also fix the NPE you found. I see no reason to even call openOrGetStreamInput() within this format method.

Yes, I also removed unused openOrGetStreamInput() and CompressorFactory function.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, could we also have a test covering this case in org.elasticsearch.transport.TransportLoggerTests?

@howardhuanghua
Copy link
Contributor Author

TransportLoggerTests

Sure, I am going to add a test case.

@DaveCTurner DaveCTurner self-assigned this Jul 31, 2025
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 31, 2025
@DaveCTurner
Copy link
Contributor

Original report:

Issue

When trace-level logging is enabled, a node might disconnect from cluster due to an NPE that causes the transport connection closed between the data node and the master node.

InboundMessages printed by TransportLogger might throw an NPE in the format function because content might be NULL if another node sends an abnormal exception response.

https://github.com/elastic/elasticsearch/blob/fe4a5237edd7503029390f0cb81c9b40ee44fea3/server/src/main/java/org/elasticsearch/transport/InboundMessage.java#L103C27-L103C34

[2025-07-28T16:40:17,554] [WARN ] [o.e.t.TcpTransport]  [node] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.0.0.1:9300, remoteAddress=/10.0.0.2:36424}], closing connection
java. lang.NullPointerException: null
at org.elasticsearch.transport.InboundMessage.open0rGetStreamInput(InboundMessage.java:100) ~[elasticsearch-7.10.1.jar:7.10.1]
at org.elasticsearch.transport.TransportLogger.format(TransportLogger.java:140) ~[elasticsearch-7.10.1.jar:7.10.1]
at org.elasticsearch.transport.TransportLogger.logInboundMessage(TransportLogger.java:52) ~[elasticsearch-7.10.1.jar:7.10.11
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:90) ~[elasticsearch-7.10.1.jar:7.10.11
at org.elasticsearch. transport.TcpTransport.inboundMessage(TcpTransport.java:700) ~[elasticsearch-7.10.1.jar:7.10.1]
at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) ~[elasticsearch-7.10.1.jar:7.10.11
at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.10.1.jar:7.10.1]
at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.10.1.jar:7.10.11
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead (AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler. codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1518) ~[?:?]
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267) ~[?:?]
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[?:?]
at io.netty.handler. codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty. channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~|?:?1
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext. invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]

Resolve

We should only print warn logs instead of throwing NPE to caused connection closed between nodes. So try to catch all the exceptions instead of IOException.

@DaveCTurner DaveCTurner changed the title Fix transport logger trace level log NPE cause node-left issue. Fix NullPointerException in transport trace logger Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed Coordination/Network Http and internode communication implementations external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants