Skip to content

Conversation

yomipq
Copy link

@yomipq yomipq commented Apr 27, 2025

This PR fixes #1873

In Amazon Keyspaces, system.local returns the localhost address and system.peers returns a host that contains the same hostID as localhost. This causes the connection issue in refreshRing(), where host address is overwritten with the localhost address and a host with the same hostID results in the "cannot find host" error.

This commit fixes the issue by making map from a slice of hosts. This approach is also used in func (s *Session) init() https://github.com/apache/cassandra-gocql-driver/blob/trunk/session.go#L272-L275

host_source.go Outdated
Comment on lines 755 to 759
hostMap := make(map[string]*HostInfo, len(hosts))
for _, host := range hosts {
hostMap[host.HostID()] = host
}

Copy link
Contributor

@dkropachev dkropachev Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this solution is straighforward, it relays on the fact that r.GetHosts() provides hosts in the certain order, if that order changes it stop working.
It would be better to update GetHosts to ignore hosts that are already in the list

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing. I will try to fix GetHosts() .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed GetHosts() to ignore localHost if the same Host ID exists.
Could you please confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, can you please add comment why this code is there, mentioning issue number and Amazon Keyspaces

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added the comment.

In Amazon Keyspaces, system.local returns the localhost address and
system.peers returns a host that contains the same hostID as localhost.
This causes the connection issue in refreshRing(), where host address is
overwritten with the localhost address and a host with the same hostID
results in the "cannot find host" error.

This commit fixes GetHosts() so that it ignores localhost if the same
hsotID is included in peerHosts.
@jameshartig
Copy link
Contributor

This needs a test.

Also, this addresses one of the issues but doesn't address the issue with the control connection querying the local table and getting back the wrong information. See #1873 (comment) for more details.

I was going to suggest that we do what that comment suggests and call refreshRing from within setupConn but that's going to end up causing double hosts lookups on initialization and it will not respect DisableInitialHostLookup if that's set.

The problem with setupConn is that it'll end up overwriting the host information with incorrect info since the local table has wrong information in it. It sounds like other drivers just do a full ring refresh but I don't think we can do that in this case. Should we just be ignoring some information in addOrUpdate? @joao-r-reis thoughts?

} else {
hosts = append([]*HostInfo{localHost}, peerHosts...)
}

var partitioner string
if len(hosts) > 0 {
partitioner = hosts[0].Partitioner()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to instead look at localHost instead because I don't think the peers table contains partitioner information.

@joao-r-reis
Copy link
Contributor

The problem with setupConn is that it'll end up overwriting the host information with incorrect info since the local table has wrong information in it. It sounds like other drivers just do a full ring refresh but I don't think we can do that in this case. Should we just be ignoring some information in addOrUpdate? @joao-r-reis thoughts?

Yeah the fix is to make sure we do a full ring refresh instead of just updating the local host so that we can overwrite the local host info with the info from system.peers. Doing a full refresh even without any overwriting would also mitigate the issue because the current behavior of every host turning into 127.0.0.1 eventually after enough reconnects would no longer exist.

@joao-r-reis
Copy link
Contributor

The problem is that (d *refreshDebouncer) refreshNow() will attempt to get the control connection and if it's not up it will try to reconnect it so calling refreshNow() inside the control connection itself might not be ideal? This needs some more exploration to be honest. The easy solution would be to call the full refresh method bypassing the debouncer but at that point might as well remove the refreshNow() method of the debouncer and just use it for topology events.

@yomipq
Copy link
Author

yomipq commented Jun 17, 2025

Thank you for comments. I'm afraid but I don't understand the details for now. I will look into your comments and the codes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CASSGO-72 Connection trouble with Amazon Keyspaces
4 participants