Skip to content

NIFI-14426: Add support for HSSF format in ExcelReader processor #9874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 10, 2025

Conversation

zhtk
Copy link
Contributor

@zhtk zhtk commented Apr 11, 2025

Summary

NIFI-14426: Add support for HSSF format in ExcelReader processor

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@exceptionfactory
Copy link
Contributor

Thanks @zhtk, can you rebase this pull request from the updated main branch? That will address an unrelated unit test failure.

@zhtk
Copy link
Contributor Author

zhtk commented Apr 11, 2025

Yes, rebased @exceptionfactory

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @zhtk. The general approach looks straightforward. Although there are potential memory implications for the older XLS format, this seems to be the best implementation strategy for now.

I noted one particular issue related to password handling for XLS files, and the lack of thread safety. As mentioned, I think the best option for now is to not support encrypted XLS, and let the read operation fail.

@zhtk
Copy link
Contributor Author

zhtk commented May 7, 2025

After reading whole discussion I decided to remove type detection and to add Input File Type property. There are 2 reasons for that:

  1. In case of passworded Excel file, to determine file type the whole Flow File stream must be read. At the moment the stream is resettable only up to 1024 * 1024 mark (i.e. in createRecordReader there is a call in.mark(1024 * 1024);). Reseting stream on files larger than 1 MB could be problematic. Working around this problem would require contributions to both POI and StreamingReader libraries.
  2. Performance characteristics of reading XLS files is different than XLSX.

@dan-s1
Copy link
Contributor

dan-s1 commented May 7, 2025

@zhtk From message below it seems you must rebase in order to resolve conflicts.

Copy link
Contributor

@dan-s1 dan-s1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhtk great work so far. I had a few more suggestions.

@dan-s1
Copy link
Contributor

dan-s1 commented May 8, 2025

@zhtk Can you please fix the conflicts you have locally and then push?

@zhtk
Copy link
Contributor Author

zhtk commented May 8, 2025

TBH I don't see conflicts, github interface says: "No conflicts with base branch. Changes can be cleanly merged.", and local merge also was performed cleanly. To make sure everything is ok, I have rebased and squashed all commits on the branch, hope now it's okay

@dan-s1
Copy link
Contributor

dan-s1 commented May 8, 2025

TBH I don't see conflicts, github interface says: "No conflicts with base branch. Changes can be cleanly merged.", and local merge also was performed cleanly. To make sure everything is ok, I have rebased and squashed all commits on the branch, hope now it's okay

Things look okay now with the force push. Thanks!

Copy link
Contributor

@dan-s1 dan-s1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed one more thing (hopefully the last :) )

…ain/java/org/apache/nifi/excel/InputFileType.java

Co-authored-by: dan-s1 <[email protected]>
@dan-s1
Copy link
Contributor

dan-s1 commented May 9, 2025

@exceptionfactory I approve these changes. Is there anything else you see which needs changing?

@exceptionfactory
Copy link
Contributor

@exceptionfactory I approve these changes. Is there anything else you see which needs changing?

Thanks, I will follow up soon.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this to completion @zhtk, and thanks for the reviews along the way @dan-s1, the latest version looks good.

I was going to note that the displayName() element on the property is not necessary, but I see that follows the other properties for now, so removing the displayName() across the board can be handled separately.

+1 merging

@exceptionfactory exceptionfactory merged commit 27522cf into apache:main May 10, 2025
6 checks passed
@zhtk
Copy link
Contributor Author

zhtk commented May 12, 2025

I also thank you for reviews and for patience in getting these changes merged 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants