Skip to content

prov/verbs: Missing FI_RECV flag in fi_cq_err_entry for RECV operations. #10847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dsciebu opened this issue Mar 4, 2025 · 9 comments
Open
Assignees

Comments

@dsciebu
Copy link
Contributor

dsciebu commented Mar 4, 2025

Describe the bug
I am experiencing an issue with fi_cq_readfrom in my libfabric-based program. When running fi_cq_readfrom on the completion queue bound to RECV operations, checking if the completion queue entry has FI_RECV bit on always fails (entry.comp.flags & FI_RECV). This issue started occurring after upgrading from libfabric v1.22.0 to v2.0.0.

**To Reproduce **
Steps to reproduce the behavior:

Bind a completion queue to RECV operations.
Run fi_cq_readfrom on the completion queue.
Check the completion queue entry flag masked with FI_RECV (i.e., entry.comp.flags & FI_RECV).
Observe that the masking operation fails in libfabric v2.0.0.
Expected behavior The masking operation with FI_RECV should succeed, as it did in libfabric v1.22.0, allowing the correct identification of RECV completions.

@dsciebu dsciebu added the bug label Mar 4, 2025
@j-xiong
Copy link
Contributor

j-xiong commented Mar 6, 2025

@dsciebu Is your program using FI_EP_RDM or FI_EP_MSG?

@aingerson
Copy link
Contributor

aingerson commented Mar 6, 2025

There was a bug with rxm in 2.0 regarding improper initialization of the recv entry flags so if you're using 2.0 and verbs with RDM, then my guess is that it's the same issue. Can you retest with upstream?
This is the commit that fixes the issue if you want to cherry-pick instead

@j-xiong
Copy link
Contributor

j-xiong commented Mar 6, 2025

@aingerson
Copy link
Contributor

Oh weird, not sure how that commit id linked to the issue instead... I'll fix it. Thanks!

@dsciebu
Copy link
Contributor Author

dsciebu commented Mar 7, 2025

Thank you for your answers - @aingerson it indeed seems to be the reason for my finding!
@j-xiong I do use RDM endpoint.
Are you planning to publish updates to release version v2.0.0 or the policy is to leave it as is and put every bug fix in v2.1.0?

@dsciebu
Copy link
Contributor Author

dsciebu commented Mar 7, 2025

BTW - there is another interesting observation - the patch you pointed, despite fixing the FI_RECV flag existence also should fix the lack of FI_TAGGED among flags. However, from my observation, this flag seems to show up in my cq entries, but not consistently. At this stage I cannot describe precisely when the following happens, but I noticed that 'sometimes' the cq entry flags does not show up, while in others they are there. Any hints?

@j-xiong
Copy link
Contributor

j-xiong commented Mar 7, 2025

We should backport that fix to the v2.0.x branch but there is no immediate plan for v2.0.1 release. v2.1.0 is the next official release that contains all the bug fixes.

@aingerson
Copy link
Contributor

Hmm, I have no idea what could be happening there but I can take a look. Could you share a reproducer if you have it?

@dsciebu
Copy link
Contributor Author

dsciebu commented Mar 8, 2025

Hmm, I have no idea what could be happening there but I can take a look. Could you share a reproducer if you have it?

I cannot share the piece of code that reveals the problem + it is sporadic. Maybe I can figure out sth simpler on Monday...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants