Skip to content

Failed parsing of V1beta1ParameterSpec in python 3.11.9 & kubeflow-katib: 0.18.0 #2542

Open
@david-thrower

Description

@david-thrower

What happened?

Issue described:

Environment details:

  • Python 3.11.9 (Canonical managed Kubeflow Jupyter environment)
  • Kubeflow SDK versions
    • kubeflow: 0.0.1rc0
    • kubeflow-katib: 0.18.0
    • kubeflow-training: 1.9.0

TLDR:

  • I tried to include the param "activation": katib.search.categorical(["relu","gelu", "elu"]) in a parameters argument for KatibClient.tune().
  • this returned the error: HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: feasibleSpace must be specified in spec.parameters[1]: {activation categorical { [] }}","code":400}

For reference, the relevant block of code:

def objective(parameters):
    # This is rather long, hence omitted.
    print f"val_binary_accuracy={result}"

# Import the SDK

import kubeflow.katib as katib

from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1ParameterSpec


# Workaround for failed parameter spec parsing ... katib.search.categorical(["relu","gelu", "elu"]),

foo = V1beta1ParameterSpec(
        parameter_type="categorical",
        feasible_space=V1beta1FeasibleSpace(distribution = ["relu","gelu", "elu"], list = ["relu","gelu", "elu"]),
    )
print(type(foo))
print(foo)

parameters = {
    "embedding_n": katib.search.int(min=10, max=18, step=1),

    # Workaround that worked
   "activation": foo,
    
    # Failing code that should have worked
    # "activation": katib.search.categorical(["relu","gelu", "elu"]),
    
    "predecessor_level_connection_affinity_factor_first": katib.search.double(min=0.1, max=50, step=0.1),
    "predecessor_level_connection_affinity_factor_main": katib.search.double(min=0.1, max=50, step=0.1),
    "max_consecutive_lateral_connections": katib.search.int(min=1, max=50, step=1),
    "p_lateral_connection":  katib.search.double(min=0.1, max=50, step=0.1),
    "num_lateral_connection_tries_per_unit": katib.search.int(min=1, max=50, step=1),
    "learning_rate": katib.search.double(min=10 ** -5, max=0.3, step=10 ** -5),
    "epochs": katib.search.int(min=1, max=25, step=1),
    "batch_size": katib.search.int(min=1, max=35, step=1),
    "dropout":  katib.search.double(min=0.05, max=0.95, step=0.05),
    "maximum_units_per_level": katib.search.int(min=5, max=10, step=1),
    "maximum_neurons_per_unit": katib.search.int(min=1, max=9, step=1),
    "temperature": katib.search.int(min=10**4, max=10**6, step=1000),
}

katib_client = katib.KatibClient(namespace=NAMESPACE)

algorithm_config = {
    "algorithm_name": "multivariate-tpe",
    "algorithm_settings": [
        {"name": "n_startup_trials", "value": "7"},
        {"name": "n_ei_candidates", "value": "24"},
        {"name": "random_state", "value": "42"}
    ]
}


katib_client.tune(
    name=JOB_NAME,
    objective=objective,
    parameters=parameters,
    objective_metric_name="val_binary_accuracy",
    algorithm_name = "multivariate-tpe",
    algorithm_settings= algorithm_config,
    max_trial_count=25,
    parallel_trial_count=2,
    max_failed_trial_count=15,
    resources_per_trial={"cpu": "7", "memory": "21Gi"}
)

I did a little troubleshooting:

  • I did some investigating by browsing the SDK code: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py
  • I found that the function search.categorical() has a parameter named list.
  • I suspect that this could be the first potential problem, as this is a naming collision with the default Python keyword list in the global scope.
  • Digging deeper, however, I found that there is a problem in line 65 on https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py. search.categorical() passes the argument for the parameter list to V1beta1FeasibleSpace without specifying which arg it pertains to.
  • When I manually reproduce what search.categorical() is doing internally as seen below and print this out, you see the issue becomes obvious ...
  • The property distribution is successfully set to the list which search.categorical() was configured with, but the property list is not being getting set, as you see below V1beta1ParameterSpec.feasible_space.list is set to None.

I do this:

foo = V1beta1ParameterSpec(
        parameter_type="categorical",
        feasible_space=V1beta1FeasibleSpace(["relu","gelu", "elu"]),
    )
print(type(foo))
print(foo)

This prints out:

<class 'kubeflow.katib.models.v1beta1_parameter_spec.V1beta1ParameterSpec'>
{'feasible_space': {'distribution': ['relu', 'gelu', 'elu'],
                    'list': None,  #   <------------------------------------------------------------------------------------<<<<   # I think this is the problem
                    'max': None,
                    'min': None,
                    'step': None},
 'name': None,
 'parameter_type': 'categorical'}

If I repeat this, but I explicitly reference the parameters*: distribution and list when instantiating the V1beta1FeasibleSpace which is used to instantiate V1beta1ParameterSpec, the properties ..distribution and ..list are both properly set:

my_list = ["relu","gelu", "elu"]
foo = V1beta1ParameterSpec(
        parameter_type="categorical",
        feasible_space=V1beta1FeasibleSpace(distribution = my_list, 
                                                                        list = my_list),
    )

print(type(foo))
print(foo)

Now this looks correct:

<class 'kubeflow.katib.models.v1beta1_parameter_spec.V1beta1ParameterSpec'>
{'feasible_space': {'distribution': ['relu', 'gelu', 'elu'],
                    'list': ['relu', 'gelu', 'elu'],
                    'max': None,
                    'min': None,
                    'step': None},
 'name': None,
 'parameter_type': 'categorical'}

When I replaced the call search.categorical(['relu', 'gelu', 'elu']]) with the correctly, manually created V1beta1ParameterSpec ('activations': foo), in the parameters and submit the job using CatibClient.tune(), the API accepted and created the job without the error.

Suggested revisions:

  1. In line 65 on https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/api/search.py: Reference the parameters distribution and list explicitly OR in V1beta1FeasibleSpace and V1beta1ParameterSpec, make sure that init is setting both self.distribution and self.list.
  2. Rename the parameter 'list', as this is a naming collision with a Python keyword referencing the function list(), which may be causing an empty list to be instantiated and assigned when V1beta1FeasibleSpace.list() sets self._list = list, which the empty list may be getting cast to a null type in searialization, explaining the None value.

What did you expect to happen?

  1. I would have expected the parameter "activation": katib.search.categorical(["relu","gelu", "elu"]) to have properly instantiated a V1beta1ParameterSpec object with both list and distribution set to ["relu","gelu", "elu"].
  2. I would have expected the submission as such to have been accepted and Katib to have run the training task.

Environment

Kubernetes version:

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.14", GitCommit:"f678fbc2f5af87ec6552b7ac6ebbe26258b0d120", GitTreeState:"clean", BuildDate:"2024-05-14T10:53:32Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.10", GitCommit:"f0c1ea863533246b6d3fda3e6addb7c13c8a6359", GitTreeState:"clean", BuildDate:"2024-10-23T15:29:12Z", GoVersion:"go1.22.8", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.29) exceeds the supported minor version skew of +/-1

Katib controller version:


# Managed instance. I don't have access to this. I am on the Azure deployment of Charmed Kubeflow 1.9



Katib Python SDK version:

Version: 0.18.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions