Skip to content

Add MLJ compliant docstrings #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 18, 2025
Merged

Add MLJ compliant docstrings #130

merged 11 commits into from
Mar 18, 2025

Conversation

josephsdavid
Copy link
Contributor

@josephsdavid josephsdavid commented Oct 26, 2022

In service of #913, as documented here !

@yaxxie
Copy link
Contributor

yaxxie commented Oct 27, 2022

Hi @josephsdavid
Thanks for the contribution!

A couple of remarks;

  1. Please avoid simply reformatting code. This makes diffs harder to read and muddies the purpose of the contribution
  2. Please fill in the description to the PR. For example, a link to the documentation about "MLJ docstrings" would be useful to the reader.
  3. Please also don't forget to add yourself to the contributors list CONTRIBUTORS.md 🙂

@yaxxie yaxxie changed the title Add MLJ compliant docstrings! Add MLJ compliant docstrings Oct 27, 2022
weights=true,
descr="Microsoft LightGBM FFI wrapper: Classifier",
weights=true
# descr="Microsoft LightGBM FFI wrapper: Classifier",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, how come you're commenting these ones out? And if there's a good reason for it, I'd expect it to be deleted rather than commented out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh whoops! I meant to delete them! There is a good reason, the existence of a docstring after the model metadata is created overwrites the descr field i believe, making it no longer needed (paging @ablaom to confirm, there is a reason but i may have mixed it up :) )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MLJ model trait docstring (alias descr) used to be for a short summary string, which was not that useful, in retrospect. Now it is not to be overloaded but instead falls back to the full docstring (the one @josephsdavid has worked on here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, yes, these should be deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and replaced with human_name="LightGBM classifier" and regressor respectively, @ablaom l saw it was suggested in previous comments, let me know if this is fine, or should I remove completely this entry.

@josephsdavid
Copy link
Contributor Author

  1. Please fill in the description to the PR. For example, a link to the documentation about "MLJ docstrings" would be useful to the reader.

hah i was so excited to have all the parameters documented i missed the other pieces of work 😓

Copy link
Contributor

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josephsdavid Thanks for this mammoth effort. 🦣

I don't see sections "Fitted parameters" or "Report", which are required.

Given the fact that all the models have a lot of hyper-parameters in common, I wonder if you would consider, for easier maintenance, interpolating a string constant for the common ones?

I've looked over the first docstring for now. Please ping me when you've addressed my comments and I'll review the others too.

@josephsdavid
Copy link
Contributor Author

@josephsdavid Thanks for this mammoth effort. 🦣

I don't see sections "Fitted parameters" or "Report", which are required.

Given the fact that all the models have a lot of hyper-parameters in common, I wonder if you would consider, for easier maintenance, interpolating a string constant for the common ones?

I've looked over the first docstring for now. Please ping me when you've addressed my comments and I'll review the others too.

Will do! going to go over more closely over the weekend :)

Copy link
Contributor

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @josephsdavid for the progress! We're getting there.

Particular attention is still needed in the examples. If you could please check they run, that will save me some review time.

@kainkad
Copy link
Contributor

kainkad commented Jan 14, 2025

Hi @josephsdavid it's been quite a while since this PR was submitted and you've put a significant effort to add the MLJ compliant docstrings. I was wondering if it was possible for you to update your branch with the latest LightGBM.jl master and push it again so hopefully the PR can be re-reviewed and merged :) Also, do you happen to know what's the best way to test these docs are rendering correctly.

@ablaom
Copy link
Contributor

ablaom commented Feb 11, 2025

I think we can safely say @josephsdavid has moved on. This work was initiated as part of a Google Season of Docs project that finished more than a year ago.

I can have a go at rebasing, and adding my own suggestions above, but it would be good to have some reassurance that a maintainer is available to make a final review in the next few weeks, before I invest the time, thanks!

@kainkad
Copy link
Contributor

kainkad commented Feb 12, 2025

I think we can safely say @josephsdavid has moved on. This work was initiated as part of a Google Season of Docs project that finished more than a year ago.

I can have a go at rebasing, and adding my own suggestions above, but it would be good to have some reassurance that a maintainer is available to make a final review in the next few weeks, before I invest the time, thanks!

That's amazing. Thank you so much @ablaom for offering to help with rebasing and your suggestions. I'll make sure we can conduct a final review within the next few weeks so this work on MLJ compliant docs can be merged in. Just to double check, these changes would apply to the MLJInterface only and won't be required in the rest of the code? It's to get an idea of the future maintenance in case some changes were done, for e.g. there's a piece of work to bring all parameters in (which are currently not available on master/released LightGBM.jl) so we can then document them accordingly to what's been done in this PR.

@kainkad
Copy link
Contributor

kainkad commented Mar 4, 2025

I think we can safely say @josephsdavid has moved on. This work was initiated as part of a Google Season of Docs project that finished more than a year ago.

I can have a go at rebasing, and adding my own suggestions above, but it would be good to have some reassurance that a maintainer is available to make a final review in the next few weeks, before I invest the time, thanks!

Hi @ablaom. I have rebased with the latest LightGBM master and made updates in this initial PR as per previous comments to address them so just wanted to let you know this won't be required. I updated all examples as they were not working before and tested locally so they're fine now. I noticed the parameters don't have constraints in the docs, e.g. num_leaves has a default = 31 but the constraints are: (1 < _ <= 131072). Would MLJ users benefit from including these constraints as well? If so, I'll go through the remainder of the params and update the docs where relevant. It would be great if you could then review the final and see if there's anything from MLJ point of view that should be included. Also, if you know any good practice to follow on how to render/test these docs to make sure they are displayed correctly when using with MLJ, this would be helpful. All CI tests passed, except for the Documentation builder so I'm looking into this as well.

@ablaom
Copy link
Contributor

ablaom commented Mar 4, 2025

@kainkad Thanks for moving this along, which will make my job easier. And thanks for your patience.

I don't have good suggestion for testing the doc-string rendering but I'll do my best in my review. The current state can always be checked by going here (for the regressor). This entry will only be updated after you have tagged a new release and the MLJ model registry has been updated. Ideally, post an issue at MLJModels.jl requesting the update, although this happens from time to time anyway, and the latest versions of all models are revisited in that update.

You can can wrap example code to make it execute as part of doc-generation, but we generally haven't done this for the following reason: The MLJ idiomatic way of loading mode code uses @load but that function won't work until after the model is registered. In this case, I it wouldn't be a problem, because the models are already registered. However, unless you have some experience doing this sort of thing with Documenter.jl, you may want to leave the examples as static text.

Let me know if you have further questions, and when you are ready for a final review. You can find the full official docstring spec here.

@kainkad
Copy link
Contributor

kainkad commented Mar 10, 2025

@kainkad Thanks for moving this along, which will make my job easier. And thanks for your patience.

I don't have good suggestion for testing the doc-string rendering but I'll do my best in my review. The current state can always be checked by going here (for the regressor). This entry will only be updated after you have tagged a new release and the MLJ model registry has been updated. Ideally, post an issue at MLJModels.jl requesting the update, although this happens from time to time anyway, and the latest versions of all models are revisited in that update.

You can can wrap example code to make it execute as part of doc-generation, but we generally haven't done this for the following reason: The MLJ idiomatic way of loading mode code uses @load but that function won't work until after the model is registered. In this case, I it wouldn't be a problem, because the models are already registered. However, unless you have some experience doing this sort of thing with Documenter.jl, you may want to leave the examples as static text.

Let me know if you have further questions, and when you are ready for a final review. You can find the full official docstring spec here.

Thank you @ablaom for all the information and the links. I have updated the docs accordingly so it's ready for the final review. Just a comment on the parameters. The currently released LightGBM.jl has about 60% of all available parameters so the documentation in this PR includes only those available (both in the core lgbm wrapper and via the MLJInterface) and not all parameters available in the underlying C code. I have been working on implementing the remainder which is currently on a separate branch so the only question I have once this work is ready to be released and more params are available, is the current approach to include all available parameters in the MLJ documentation, or for e.g. could there an external reference to the lgbm parameters made instead? The reason why I'm asking is that there's over 130 so the parameters section can be quite lengthy but as long as it's not an issue I don't mind updating the remainder when it's fully implemented.

@ablaom
Copy link
Contributor

ablaom commented Mar 10, 2025

No you raise a good point. It's not good to have all this parameter documentation duplicated. I think it is fine to have an external link for parameters that correspond to parameters in the core implementation. We have allowed this is for our XGBoost wrapper as well. Perhaps you just list the parameters that are provided, especially if this is different from the full lightgbm set.

@kainkad
Copy link
Contributor

kainkad commented Mar 12, 2025

No you raise a good point. It's not good to have all this parameter documentation duplicated. I think it is fine to have an external link for parameters that correspond to parameters in the core implementation. We have allowed this is for our XGBoost wrapper as well. Perhaps you just list the parameters that are provided, especially if this is different from the full lightgbm set.

I've updated the link to the parameters like it's done for xgboost and given that the current version doesn't support all params and some of the defaults are different, I just listed the available params and their defaults instead of full descriptions and their interactions which can be checked by following the link to the official docs. I also moved the docs to a separate file which I think keeps the MLJInterface neat. When the release for all parameters is ready, then just a link should be fine because that work also includes aligning the defaults with the official docs so there are no discrepancies.

@ablaom
Copy link
Contributor

ablaom commented Mar 12, 2025

Thanks @kainkad, I'll try to review by the end of next week.

@ablaom
Copy link
Contributor

ablaom commented Mar 16, 2025

Somehow, when I try to inspect the docstring I'm not getting the new doctoring. @kainkad Any idea what's going on here?

help?> LGBMRegression
search: LGBMRegression make_regression

  LGBMRegression(; [
      objective = "regression",
      boosting = "gbdt",
      num_iterations = 100,
      learning_rate = .1,
      num_leaves = 31,
      max_depth = -1,
      tree_learner = "serial",
      num_threads = 0,
      histogram_pool_size = -1.,
      min_data_in_leaf = 20,
      min_sum_hessian_in_leaf = 1e-3,
      max_delta_step = 0.,
      lambda_l1 = 0.,
      lambda_l2 = 0.,
      min_gain_to_split = 0.,
      feature_fraction = 1.,
      feature_fraction_bynode = 1.,
      feature_fraction_seed = 2,
      bagging_fraction = 1.,
      bagging_freq = 0,
      bagging_seed = 3,
      early_stopping_round = 0,
      extra_trees = false
      extra_seed = 6,
      max_bin = 255,
      bin_construct_sample_cnt = 200000,
      data_random_seed = 1,
      is_enable_sparse = true,
      save_binary = false,
      categorical_feature = Int[],
      use_missing = true,
      linear_tree = false,
      feature_pre_filter = true,
      is_unbalance = false,
      boost_from_average = true,
      alpha = 0.9,
      drop_rate = 0.1,
      max_drop = 50,
      skip_drop = 0.5,
      xgboost_dart_mode = false,
      uniform_drop = false,
      drop_seed = 4,
      top_rate = 0.2,
      other_rate = 0.1,
      min_data_per_group = 100,
      max_cat_threshold = 32,
      cat_l2 = 10.0,
      cat_smooth = 10.0,
      metric = [""],
      metric_freq = 1,
      is_provide_training_metric = false,
      eval_at = Int[1, 2, 3, 4, 5],
      num_machines = 1,
      local_listen_port = 12400,
      time_out = 120,
      machine_list_filename = "",
      device_type="cpu",
      gpu_use_dp = false,
      gpu_platform_id = -1,
      gpu_device_id = -1,
      num_gpu = 1,
      force_col_wise = false
      force_row_wise = false
  ])

  Return a LGBMRegression estimator.

@kainkad
Copy link
Contributor

kainkad commented Mar 17, 2025

Somehow, when I try to inspect the docstring I'm not getting the new doctoring. @kainkad Any idea what's going on here?

help?> LGBMRegression
search: LGBMRegression make_regression

  LGBMRegression(; [
      objective = "regression",
      boosting = "gbdt",
      num_iterations = 100,
      learning_rate = .1,
      num_leaves = 31,
      max_depth = -1,
      tree_learner = "serial",
      num_threads = 0,
      histogram_pool_size = -1.,
      min_data_in_leaf = 20,
      min_sum_hessian_in_leaf = 1e-3,
      max_delta_step = 0.,
      lambda_l1 = 0.,
      lambda_l2 = 0.,
      min_gain_to_split = 0.,
      feature_fraction = 1.,
      feature_fraction_bynode = 1.,
      feature_fraction_seed = 2,
      bagging_fraction = 1.,
      bagging_freq = 0,
      bagging_seed = 3,
      early_stopping_round = 0,
      extra_trees = false
      extra_seed = 6,
      max_bin = 255,
      bin_construct_sample_cnt = 200000,
      data_random_seed = 1,
      is_enable_sparse = true,
      save_binary = false,
      categorical_feature = Int[],
      use_missing = true,
      linear_tree = false,
      feature_pre_filter = true,
      is_unbalance = false,
      boost_from_average = true,
      alpha = 0.9,
      drop_rate = 0.1,
      max_drop = 50,
      skip_drop = 0.5,
      xgboost_dart_mode = false,
      uniform_drop = false,
      drop_seed = 4,
      top_rate = 0.2,
      other_rate = 0.1,
      min_data_per_group = 100,
      max_cat_threshold = 32,
      cat_l2 = 10.0,
      cat_smooth = 10.0,
      metric = [""],
      metric_freq = 1,
      is_provide_training_metric = false,
      eval_at = Int[1, 2, 3, 4, 5],
      num_machines = 1,
      local_listen_port = 12400,
      time_out = 120,
      machine_list_filename = "",
      device_type="cpu",
      gpu_use_dp = false,
      gpu_platform_id = -1,
      gpu_device_id = -1,
      num_gpu = 1,
      force_col_wise = false
      force_row_wise = false
  ])

  Return a LGBMRegression estimator.

Thank you for sending this and for checking. So the LGBMRegression is the name of the model in core LightGBM.jl and the docs for this are pulled from the struct docstrings in estimators.jl. In the MLJInterface the models are called slightly differently: LGBMRegressor and LGBMClassifier . For the help info in julia they need to be called ?LightGBM.MLJInterface.LGBMRegressor as in the below:

image

It's not very user friendly. Exporting the LGBMRegressor and LGBMClassifier could make it easier as long as it won't introduce any unnecessary breaking change on MLJ side or users namespace. Calling the full path shows where they come from but I can imagine the user wouldn't necessarily know the provenience of the model types.

@ablaom
Copy link
Contributor

ablaom commented Mar 17, 2025

Ah, I get it, thank you!

I don't think it's necessary to make any new exports, unless you want to for some other reason. In idiomatic MLJ, you use @load to load model code, and there is no need for these names to be public for that purpose.

Copy link
Contributor

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just the one suggestion to improve readability.

@ablaom
Copy link
Contributor

ablaom commented Mar 17, 2025

When you have tagged a new release, open an issue herehttps://github.com/JuliaAI/MLJModels.jl/issues) to update the MLJ model registry.

@kainkad kainkad merged commit 1443672 into IQVIA-ML:master Mar 18, 2025
25 checks passed
@kainkad
Copy link
Contributor

kainkad commented Mar 18, 2025

When you have tagged a new release, open an issue herehttps://github.com/JuliaAI/MLJModels.jl/issues) to update the MLJ model registry.

Thank you for reviewing this PR. These changes have been released and tagged. I created an issue: JuliaAI/MLJModels.jl#586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants