diff --git a/23.2/tutorial/Model Retraining.ipynb b/23.2/tutorial/Model Retraining.ipynb index 6c56ba0..b0808a2 100644 --- a/23.2/tutorial/Model Retraining.ipynb +++ b/23.2/tutorial/Model Retraining.ipynb @@ -1,615 +1,615 @@ { - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "267a5abb-312f-436f-99ee-43b9bba7a743", - "metadata": { - "id": "267a5abb-312f-436f-99ee-43b9bba7a743" - }, - "source": [ - "# Before you start with this Model Retraining Notebook\n", - "\n", - "This notebook is the primary notebook used as part of the Vectice Tutorial. It illustrates how to use Vectice with a realistic Business scenario. As part of this Notebook we will retrain a model by droping a column from our dataset. This Notebook maps to the \"Model Retraining\" phase of the **\"Tutorial: Forecast in store-unit sales\"** project you can find in your personal Vectice workspace.\n", - "\n", - "### Pre-requisites:\n", - "Before using this notebook you will need:\n", - "* An account in Vectice\n", - "* An API token to connect to Vectice through the APIs\n", - "* Copy the your Model Retraining Phase Id\n", - "\n", - "Refer to Vectice Tutorial Guide for more detailed instructions: https://docs.vectice.com/getting-started/tutorial\n", - "\n", - "\n", - "### Other Resources\n", - "* Vectice Documentation: https://docs.vectice.com/
\n", - "* Vectice API documentation: https://api-docs.vectice.com/" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "cc1036be", - "metadata": { - "id": "cc1036be" - }, - "source": [ - "## We assume in this exercise that we want to retrain a Ridge model because the variable 'postal code' was left accidentally inside our initial modeling dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8b0ced23-cc10-4fd0-b1d3-b032efb3fcfc", - "metadata": { - "id": "8b0ced23-cc10-4fd0-b1d3-b032efb3fcfc" - }, - "outputs": [], - "source": [ - "import warnings\n", - "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "9c94ff6a-2eb2-4a27-8ee4-2b9712b47b17", - "metadata": { - "id": "9c94ff6a-2eb2-4a27-8ee4-2b9712b47b17" - }, - "source": [ - "## Install the latest Vectice Python client library" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8fd7daee", - "metadata": { - "id": "8fd7daee" - }, - "outputs": [], - "source": [ - "%pip install -q vectice -U" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9925c2b4-1873-42ba-a487-6ba08c71df12", - "metadata": { - "id": "9925c2b4-1873-42ba-a487-6ba08c71df12" - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.metrics import mean_absolute_error\n", - "from sklearn.linear_model import Ridge\n", - "from sklearn.pipeline import make_pipeline\n", - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.preprocessing import OneHotEncoder, StandardScaler" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "e4b94319-e1d0-4a0b-baf6-0eb59ed5e06e", - "metadata": { - "id": "e4b94319-e1d0-4a0b-baf6-0eb59ed5e06e" - }, - "source": [ - "## Get started by connecting to Vectice\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "ae4e31a6", - "metadata": { - "id": "ae4e31a6" - }, - "source": [ - "
\n", - "Automated code lineage: The code lineage functionalities are not covered as part of this Tutorial as they require to first setting up a Git repository.\n", - "
" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "9e94fb33", - "metadata": { - "id": "9e94fb33" - }, - "source": [ - "**First, we need to authenticate to the Vectice server. Before proceeding further:**\n", - "\n", - "- Visit the Vectice app to create and copy an API token (cf. https://docs.vectice.com/getting-started/create-an-api-token)\n", - "\n", - "- Paste the API token in the code below" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f15b5e87-ceea-41cf-90f2-11fa6c7b4633", - "metadata": { - "id": "f15b5e87-ceea-41cf-90f2-11fa6c7b4633" - }, - "outputs": [], - "source": [ - "import vectice\n", - "\n", - "vec = vectice.connect(api_token=\"my-api-token\") #Paste your API token" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "14681ed0-a7c5-4175-b95c-87186fb28bc2", - "metadata": { - "id": "14681ed0-a7c5-4175-b95c-87186fb28bc2" - }, - "source": [ - "## Specify which project phase you want to document\n", - "\n", - "- In Vectice UI, navigate to your personal workspace, navigate to your Tutorial: Forecast in store-unit sales project.\n", - "\n", - "- Go to the Model Retraining Phase and copy and paste your Phase Id below. (cf. https://docs.vectice.com/getting-started/tutorial#duplicate-phase)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "552f544a-8b61-45dc-b2c5-d7bd1972e659", - "metadata": { - "id": "552f544a-8b61-45dc-b2c5-d7bd1972e659" - }, - "outputs": [], - "source": [ - "phase = vec.phase(\"PHA-xxxx\") # Paste your own Model Retraining phase ID" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "36522f13-0925-4f14-80a9-072890887f68", - "metadata": { - "id": "36522f13-0925-4f14-80a9-072890887f68" - }, - "source": [ - "### Next we are going to create an iteration\n", - "An iteration allows you to organize your work in repeatable sequences of steps. You can have multiple iteration within a phase." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "16d37d91-8f63-4fbc-b6eb-8aaff588e84c", - "metadata": { - "id": "16d37d91-8f63-4fbc-b6eb-8aaff588e84c" - }, - "outputs": [], - "source": [ - "model_iteration = phase.create_iteration()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "ee0aba94-baf2-49bb-a9bc-9b2c16852dc2", - "metadata": { - "id": "ee0aba94-baf2-49bb-a9bc-9b2c16852dc2" - }, - "source": [ - "## Re-create the modeling Dataset without the postal code\n", - "\n", - "Load the data from GitHub. This DataFrame has already been cleaned as part of the Data Preparation Phase." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "02c4e722-a008-4989-84bb-500cc5f0534c", - "metadata": { - "id": "02c4e722-a008-4989-84bb-500cc5f0534c" - }, - "outputs": [], - "source": [ - "df = pd.read_csv(\"https://raw.githubusercontent.com/vectice/GettingStarted/main/23.2/tutorial/ProductSales%20Cleaned.csv\", converters = {'Postal Code': str})\n", - "df.head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "eb3c8bf4", - "metadata": { - "id": "eb3c8bf4" - }, - "source": [ - "### Remove Postal code" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "99534610-4ed8-4672-8ab3-d8b2df33664b", - "metadata": { - "id": "99534610-4ed8-4672-8ab3-d8b2df33664b" - }, - "outputs": [], - "source": [ - "X = df.drop([\"Sales\", \"Postal Code\"],axis=1)\n", - "y = df[\"Sales\"]\n", - "print(X.shape)\n", - "print(y.shape)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3423b731-454e-4909-850a-d503ca9f2bf5", - "metadata": { - "id": "3423b731-454e-4909-850a-d503ca9f2bf5" - }, - "outputs": [], - "source": [ - "X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0d074b66-37a5-4882-b600-b630a0288b71", - "metadata": { - "id": "0d074b66-37a5-4882-b600-b630a0288b71" - }, - "outputs": [], - "source": [ - "# Save the modeling train test split datasets as csv files\n", - "train_df = X_train.copy()\n", - "test_df = X_test.copy()\n", - "\n", - "train_df[\"Sales\"] = y_train\n", - "test_df[\"Sales\"] = y_test\n", - "\n", - "train_df.to_csv(\"train dataset.csv\", index=False)\n", - "test_df.to_csv(\"test dataset.csv\", index=False)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "bb81eaf7-f880-4c77-923a-f5deb7832cf3", - "metadata": { - "id": "bb81eaf7-f880-4c77-923a-f5deb7832cf3" - }, - "source": [ - "### Log the recreated modeling Dataset in Vectice\n", - "The Vectice resource will automatically extract pertinent metadata from the local dataset file and collect statistics (optional) from the pandas dataframe. This information will be documented within the iteration as part of a Dataset version." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "016c1281-3880-498e-8a8c-ab45113f2688", - "metadata": { - "id": "016c1281-3880-498e-8a8c-ab45113f2688" - }, - "outputs": [], - "source": [ - "train_ds = vectice.FileResource(paths=\"train dataset.csv\", dataframes=train_df)\n", - "test_ds = vectice.FileResource(paths=\"test dataset.csv\", dataframes=test_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0416ab44-16c4-4905-a582-306cb87b6566", - "metadata": { - "id": "0416ab44-16c4-4905-a582-306cb87b6566" - }, - "outputs": [], - "source": [ - "#Declare the modeling Dataset\n", - "modeling_dataset = vectice.Dataset.modeling(\n", - " name=\"ProductSales Modeling\",\n", - " training_resource=train_ds,\n", - " testing_resource=test_ds, \n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f7d80585-cbfe-40a0-b360-ef7798d1cfed", - "metadata": { - "id": "f7d80585-cbfe-40a0-b360-ef7798d1cfed" - }, - "outputs": [], - "source": [ - "# Log the new modeling Dataset to the model input data step to document it inside the iteration\n", - "model_iteration.step_model_input_data = modeling_dataset" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "BD3YdscrnPHb", - "metadata": { - "id": "BD3YdscrnPHb" - }, - "source": [ - "## Log a comment to indicate you removed the \"postal code\" column\n", - "\n", - "Passing a `string` to a step will log a comment. We will use += operator to document more than one item in the same step." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "JqqLmQK-ll9q", - "metadata": { - "id": "JqqLmQK-ll9q" - }, - "outputs": [], - "source": [ - "model_iteration.step_model_input_data += \"The postal code column was removed from the modeling dataset\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "377d7923", - "metadata": { - "id": "377d7923" - }, - "source": [ - "## Retrain a Ridge regressor model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ff62ae55-87c8-41ea-a02b-0ec109ab7438", - "metadata": { - "id": "ff62ae55-87c8-41ea-a02b-0ec109ab7438" - }, - "outputs": [], - "source": [ - "OHE = OneHotEncoder(handle_unknown='infrequent_if_exist')\n", - "scaler = StandardScaler()\n", - "\n", - "cat_cols = ['Ship Mode', 'Segment', 'Country', 'City', 'State','Region', 'Category', 'Sub-Category']\n", - "num_cols = ['Quantity', 'Discount', 'Profit']\n", - "\n", - "transformer = ColumnTransformer([('cat_cols', OHE, cat_cols),\n", - " ('num_cols', scaler, num_cols)])\n", - "\n", - "model = make_pipeline(transformer,Ridge())\n", - "model.fit(X,y)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cfaa69c9-83d5-4fa8-82c6-37628ad258b6", - "metadata": { - "id": "cfaa69c9-83d5-4fa8-82c6-37628ad258b6" - }, - "outputs": [], - "source": [ - "# Making Prediction with the training data\n", - "y_train_pred = model.predict(X_train)\n", - "#Evaluating the model \n", - "mae_train=mean_absolute_error(y_train, y_train_pred)\n", - "print(round(mae_train,2))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8e16f056-b366-4f5f-b1a3-a031b8316074", - "metadata": { - "id": "8e16f056-b366-4f5f-b1a3-a031b8316074" - }, - "outputs": [], - "source": [ - "# Making Prediction with the testing data\n", - "y_test_pred = model.predict(X_test)\n", - "#Evaluating the model \n", - "mae_test = mean_absolute_error(y_test, y_test_pred)\n", - "print(round(mae_test,2))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "de507cdf-0576-4ef0-ba5e-10fd1f69c176", - "metadata": { - "id": "de507cdf-0576-4ef0-ba5e-10fd1f69c176" - }, - "outputs": [], - "source": [ - "#Generate feature importance\n", - "feature_names = transformer.get_feature_names_out()\n", - "feature_importances = model.named_steps['ridge'].coef_\n", - "\n", - "feat_imf = pd.Series(feature_importances, index=feature_names).sort_values()\n", - "\n", - "feat_imf.tail(10).plot(kind=\"barh\")\n", - "plt.ylabel(\"Features\")\n", - "plt.xlabel(\"Importance\")\n", - "plt.title(\"Feature Importance\")\n", - "plt.tight_layout()\n", - "plt.savefig(\"Feature Importance.png\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f53c5948-d301-4bfe-82a4-83758c155f99", - "metadata": { - "id": "f53c5948-d301-4bfe-82a4-83758c155f99" - }, - "source": [ - "## Log the retrained Ridge model we created with the feature importance graph as attachment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ce8afa6c-ac94-4846-9032-80457024d40f", - "metadata": { - "id": "ce8afa6c-ac94-4846-9032-80457024d40f" - }, - "outputs": [], - "source": [ - "#Declare the Ridge model with the information you want to document\n", - "vect_model = vectice.Model(library=\"scikit-learn\", \n", - " technique=\"Ridge Regression\",\n", - " metrics={\"mae_train\": round(mae_train,2), \"mae_test\": round(mae_test,2)}, \n", - " properties=model.named_steps, \n", - " predictor=model, # Pass your model as a predictor to save it as a pickle file\n", - " derived_from=modeling_dataset, # Pass your modeling dataset to document the lineage\n", - " attachments=\"Feature Importance.png\") # Pass your Feature Important graph as an attachment" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d4e6bc65-158a-48ea-9dfd-8f2643a3443c", - "metadata": { - "id": "d4e6bc65-158a-48ea-9dfd-8f2643a3443c" - }, - "outputs": [], - "source": [ - "# Log the Ridge model to the build model step to document it inside the iteration\n", - "model_iteration.step_build_model += vect_model" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "3ddfad85", - "metadata": { - "id": "3ddfad85" - }, - "source": [ - "You can add multiple models to a single step by using the `+=` operator." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "3aa5e8d5-154d-4420-ae95-bf7a8bf7e9bb", - "metadata": { - "id": "3aa5e8d5-154d-4420-ae95-bf7a8bf7e9bb" - }, - "source": [ - "## Log a comment \n", - "\n", - "Similarly to the **model input data** step, passing a `string` to a step will log a comment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3b2560ad-6cab-4978-98fb-06a8a2cbf0a3", - "metadata": { - "id": "3b2560ad-6cab-4978-98fb-06a8a2cbf0a3" - }, - "outputs": [], - "source": [ - "# Specify some validation metrics that will be used for reviewing the model\n", - "model_iteration.step_model_validation = f\"Model passed acceptance criteria\\nMAE Train: {round(mae_train,2)}\\nMAE Test: {round(mae_test,2)}\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "6556088b", - "metadata": { - "id": "6556088b" - }, - "source": [ - "### Once you are satisfied with your iteration you can complete it." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ab7554b8", - "metadata": { - "id": "ab7554b8" - }, - "outputs": [], - "source": [ - "model_iteration.complete()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "15437520", - "metadata": { - "id": "15437520" - }, - "source": [ - "### Completed iterations can't be modified anymore. This can be useful as part of the review process.
\n", - "
" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2b88cde9", - "metadata": { - "id": "2b88cde9" - }, - "source": [ - "## 🥇 Congrats! You learn how to succesfully use Vectice to auto-document the Model Retraining phase of the Tutorial Project.
\n", - "### Next we encourage you to follow [part 2](https://docs.vectice.com/getting-started/tutorial#part-2) of the Tutorial guide to continue learning about Vectice." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "481b0162-f5de-4137-8271-9524090202f7", - "metadata": { - "id": "481b0162-f5de-4137-8271-9524090202f7" - }, - "source": [ - "✴ You can view your registered assets and comments in the UI by clicking the links in the output messages.." - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "267a5abb-312f-436f-99ee-43b9bba7a743", + "metadata": { + "id": "267a5abb-312f-436f-99ee-43b9bba7a743" + }, + "source": [ + "# Before you start with this Model Retraining Notebook\n", + "\n", + "This notebook is the primary notebook used as part of the Vectice Tutorial. It illustrates how to use Vectice with a realistic Business scenario. As part of this Notebook we will retrain a model by droping a column from our dataset. This Notebook maps to the \"Model Retraining\" phase of the **\"Tutorial: Forecast in store-unit sales\"** project you can find in your personal Vectice workspace.\n", + "\n", + "### Pre-requisites:\n", + "Before using this notebook you will need:\n", + "* An account in Vectice\n", + "* An API token to connect to Vectice through the APIs\n", + "* Copy the your Model Retraining Phase Id\n", + "\n", + "Refer to Vectice Tutorial Guide for more detailed instructions: https://docs.vectice.com/getting-started/tutorial\n", + "\n", + "\n", + "### Other Resources\n", + "* Vectice Documentation: https://docs.vectice.com/
\n", + "* Vectice API documentation: https://api-docs.vectice.com/" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "cc1036be", + "metadata": { + "id": "cc1036be" + }, + "source": [ + "## We assume in this exercise that we want to retrain a Ridge model because the variable 'postal code' was left accidentally inside our initial modeling dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0ced23-cc10-4fd0-b1d3-b032efb3fcfc", + "metadata": { + "id": "8b0ced23-cc10-4fd0-b1d3-b032efb3fcfc" + }, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "9c94ff6a-2eb2-4a27-8ee4-2b9712b47b17", + "metadata": { + "id": "9c94ff6a-2eb2-4a27-8ee4-2b9712b47b17" + }, + "source": [ + "## Install the latest Vectice Python client library" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8fd7daee", + "metadata": { + "id": "8fd7daee" + }, + "outputs": [], + "source": [ + "%pip install -q vectice -U" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9925c2b4-1873-42ba-a487-6ba08c71df12", + "metadata": { + "id": "9925c2b4-1873-42ba-a487-6ba08c71df12" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.metrics import mean_absolute_error\n", + "from sklearn.linear_model import Ridge\n", + "from sklearn.pipeline import make_pipeline\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e4b94319-e1d0-4a0b-baf6-0eb59ed5e06e", + "metadata": { + "id": "e4b94319-e1d0-4a0b-baf6-0eb59ed5e06e" + }, + "source": [ + "## Get started by connecting to Vectice\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "ae4e31a6", + "metadata": { + "id": "ae4e31a6" + }, + "source": [ + "
\n", + "Automated code lineage: The code lineage functionalities are not covered as part of this Tutorial as they require to first setting up a Git repository.\n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "9e94fb33", + "metadata": { + "id": "9e94fb33" + }, + "source": [ + "**First, we need to authenticate to the Vectice server. Before proceeding further:**\n", + "\n", + "- Visit the Vectice app to create and copy an API token (cf. https://docs.vectice.com/getting-started/create-an-api-token)\n", + "\n", + "- Paste the API token in the code below" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f15b5e87-ceea-41cf-90f2-11fa6c7b4633", + "metadata": { + "id": "f15b5e87-ceea-41cf-90f2-11fa6c7b4633" + }, + "outputs": [], + "source": [ + "import vectice\n", + "\n", + "vec = vectice.connect(api_token=\"my-api-token\") #Paste your API token" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "14681ed0-a7c5-4175-b95c-87186fb28bc2", + "metadata": { + "id": "14681ed0-a7c5-4175-b95c-87186fb28bc2" + }, + "source": [ + "## Specify which project phase you want to document\n", + "\n", + "- In Vectice UI, navigate to your personal workspace, navigate to your Tutorial: Forecast in store-unit sales project.\n", + "\n", + "- Go to the Model Retraining Phase and copy and paste your Phase Id below. (cf. https://docs.vectice.com/getting-started/tutorial#duplicate-phase)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "552f544a-8b61-45dc-b2c5-d7bd1972e659", + "metadata": { + "id": "552f544a-8b61-45dc-b2c5-d7bd1972e659" + }, + "outputs": [], + "source": [ + "phase = vec.phase(\"PHA-xxxx\") # Paste your own Model Retraining phase ID" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "36522f13-0925-4f14-80a9-072890887f68", + "metadata": { + "id": "36522f13-0925-4f14-80a9-072890887f68" + }, + "source": [ + "### Next we are going to create an iteration\n", + "An iteration allows you to organize your work in repeatable sequences of steps. You can have multiple iteration within a phase." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16d37d91-8f63-4fbc-b6eb-8aaff588e84c", + "metadata": { + "id": "16d37d91-8f63-4fbc-b6eb-8aaff588e84c" + }, + "outputs": [], + "source": [ + "model_iteration = phase.create_or_get_current_iteration()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "ee0aba94-baf2-49bb-a9bc-9b2c16852dc2", + "metadata": { + "id": "ee0aba94-baf2-49bb-a9bc-9b2c16852dc2" + }, + "source": [ + "## Re-create the modeling Dataset without the postal code\n", + "\n", + "Load the data from GitHub. This DataFrame has already been cleaned as part of the Data Preparation Phase." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02c4e722-a008-4989-84bb-500cc5f0534c", + "metadata": { + "id": "02c4e722-a008-4989-84bb-500cc5f0534c" + }, + "outputs": [], + "source": [ + "df = pd.read_csv(\"https://raw.githubusercontent.com/vectice/GettingStarted/main/23.2/tutorial/ProductSales%20Cleaned.csv\", converters = {'Postal Code': str})\n", + "df.head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "eb3c8bf4", + "metadata": { + "id": "eb3c8bf4" + }, + "source": [ + "### Remove Postal code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99534610-4ed8-4672-8ab3-d8b2df33664b", + "metadata": { + "id": "99534610-4ed8-4672-8ab3-d8b2df33664b" + }, + "outputs": [], + "source": [ + "X = df.drop([\"Sales\", \"Postal Code\"],axis=1)\n", + "y = df[\"Sales\"]\n", + "print(X.shape)\n", + "print(y.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3423b731-454e-4909-850a-d503ca9f2bf5", + "metadata": { + "id": "3423b731-454e-4909-850a-d503ca9f2bf5" + }, + "outputs": [], + "source": [ + "X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d074b66-37a5-4882-b600-b630a0288b71", + "metadata": { + "id": "0d074b66-37a5-4882-b600-b630a0288b71" + }, + "outputs": [], + "source": [ + "# Save the modeling train test split datasets as csv files\n", + "train_df = X_train.copy()\n", + "test_df = X_test.copy()\n", + "\n", + "train_df[\"Sales\"] = y_train\n", + "test_df[\"Sales\"] = y_test\n", + "\n", + "train_df.to_csv(\"train dataset.csv\", index=False)\n", + "test_df.to_csv(\"test dataset.csv\", index=False)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "bb81eaf7-f880-4c77-923a-f5deb7832cf3", + "metadata": { + "id": "bb81eaf7-f880-4c77-923a-f5deb7832cf3" + }, + "source": [ + "### Log the recreated modeling Dataset in Vectice\n", + "The Vectice resource will automatically extract pertinent metadata from the local dataset file and collect statistics (optional) from the pandas dataframe. This information will be documented within the iteration as part of a Dataset version." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "016c1281-3880-498e-8a8c-ab45113f2688", + "metadata": { + "id": "016c1281-3880-498e-8a8c-ab45113f2688" + }, + "outputs": [], + "source": [ + "train_ds = vectice.FileResource(paths=\"train dataset.csv\", dataframes=train_df)\n", + "test_ds = vectice.FileResource(paths=\"test dataset.csv\", dataframes=test_df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0416ab44-16c4-4905-a582-306cb87b6566", + "metadata": { + "id": "0416ab44-16c4-4905-a582-306cb87b6566" + }, + "outputs": [], + "source": [ + "#Declare the modeling Dataset\n", + "modeling_dataset = vectice.Dataset.modeling(\n", + " name=\"ProductSales Modeling\",\n", + " training_resource=train_ds,\n", + " testing_resource=test_ds, \n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f7d80585-cbfe-40a0-b360-ef7798d1cfed", + "metadata": { + "id": "f7d80585-cbfe-40a0-b360-ef7798d1cfed" + }, + "outputs": [], + "source": [ + "# Log the new modeling Dataset to the model input data step to document it inside the iteration\n", + "model_iteration.step_model_input_data = modeling_dataset" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "BD3YdscrnPHb", + "metadata": { + "id": "BD3YdscrnPHb" + }, + "source": [ + "## Log a comment to indicate you removed the \"postal code\" column\n", + "\n", + "Passing a `string` to a step will log a comment. We will use += operator to document more than one item in the same step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "JqqLmQK-ll9q", + "metadata": { + "id": "JqqLmQK-ll9q" + }, + "outputs": [], + "source": [ + "model_iteration.step_model_input_data += \"The postal code column was removed from the modeling dataset\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "377d7923", + "metadata": { + "id": "377d7923" + }, + "source": [ + "## Retrain a Ridge regressor model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff62ae55-87c8-41ea-a02b-0ec109ab7438", + "metadata": { + "id": "ff62ae55-87c8-41ea-a02b-0ec109ab7438" + }, + "outputs": [], + "source": [ + "OHE = OneHotEncoder(handle_unknown='infrequent_if_exist')\n", + "scaler = StandardScaler()\n", + "\n", + "cat_cols = ['Ship Mode', 'Segment', 'Country', 'City', 'State','Region', 'Category', 'Sub-Category']\n", + "num_cols = ['Quantity', 'Discount', 'Profit']\n", + "\n", + "transformer = ColumnTransformer([('cat_cols', OHE, cat_cols),\n", + " ('num_cols', scaler, num_cols)])\n", + "\n", + "model = make_pipeline(transformer,Ridge())\n", + "model.fit(X_train,y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfaa69c9-83d5-4fa8-82c6-37628ad258b6", + "metadata": { + "id": "cfaa69c9-83d5-4fa8-82c6-37628ad258b6" + }, + "outputs": [], + "source": [ + "# Making Prediction with the training data\n", + "y_train_pred = model.predict(X_train)\n", + "#Evaluating the model \n", + "mae_train=mean_absolute_error(y_train, y_train_pred)\n", + "print(round(mae_train,2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e16f056-b366-4f5f-b1a3-a031b8316074", + "metadata": { + "id": "8e16f056-b366-4f5f-b1a3-a031b8316074" + }, + "outputs": [], + "source": [ + "# Making Prediction with the testing data\n", + "y_test_pred = model.predict(X_test)\n", + "#Evaluating the model \n", + "mae_test = mean_absolute_error(y_test, y_test_pred)\n", + "print(round(mae_test,2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de507cdf-0576-4ef0-ba5e-10fd1f69c176", + "metadata": { + "id": "de507cdf-0576-4ef0-ba5e-10fd1f69c176" + }, + "outputs": [], + "source": [ + "#Generate feature importance\n", + "feature_names = transformer.get_feature_names_out()\n", + "feature_importances = model.named_steps['ridge'].coef_\n", + "\n", + "feat_imf = pd.Series(feature_importances, index=feature_names).sort_values()\n", + "\n", + "feat_imf.tail(10).plot(kind=\"barh\")\n", + "plt.ylabel(\"Features\")\n", + "plt.xlabel(\"Importance\")\n", + "plt.title(\"Feature Importance\")\n", + "plt.tight_layout()\n", + "plt.savefig(\"Feature Importance.png\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f53c5948-d301-4bfe-82a4-83758c155f99", + "metadata": { + "id": "f53c5948-d301-4bfe-82a4-83758c155f99" + }, + "source": [ + "## Log the retrained Ridge model we created with the feature importance graph as attachment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce8afa6c-ac94-4846-9032-80457024d40f", + "metadata": { + "id": "ce8afa6c-ac94-4846-9032-80457024d40f" + }, + "outputs": [], + "source": [ + "#Declare the Ridge model with the information you want to document\n", + "vect_model = vectice.Model(library=\"scikit-learn\", \n", + " technique=\"Ridge Regression\",\n", + " metrics={\"mae_train\": round(mae_train,2), \"mae_test\": round(mae_test,2)}, \n", + " properties=model.named_steps, \n", + " predictor=model, # Pass your model as a predictor to save it as a pickle file\n", + " derived_from=modeling_dataset, # Pass your modeling dataset to document the lineage\n", + " attachments=\"Feature Importance.png\") # Pass your Feature Important graph as an attachment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4e6bc65-158a-48ea-9dfd-8f2643a3443c", + "metadata": { + "id": "d4e6bc65-158a-48ea-9dfd-8f2643a3443c" + }, + "outputs": [], + "source": [ + "# Log the Ridge model to the build model step to document it inside the iteration\n", + "model_iteration.step_build_model += vect_model" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "3ddfad85", + "metadata": { + "id": "3ddfad85" + }, + "source": [ + "You can add multiple models to a single step by using the `+=` operator." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "3aa5e8d5-154d-4420-ae95-bf7a8bf7e9bb", + "metadata": { + "id": "3aa5e8d5-154d-4420-ae95-bf7a8bf7e9bb" + }, + "source": [ + "## Log a comment \n", + "\n", + "Similarly to the **model input data** step, passing a `string` to a step will log a comment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b2560ad-6cab-4978-98fb-06a8a2cbf0a3", + "metadata": { + "id": "3b2560ad-6cab-4978-98fb-06a8a2cbf0a3" + }, + "outputs": [], + "source": [ + "# Specify some validation metrics that will be used for reviewing the model\n", + "model_iteration.step_model_validation = f\"Model passed acceptance criteria\\nMAE Train: {round(mae_train,2)}\\nMAE Test: {round(mae_test,2)}\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "6556088b", + "metadata": { + "id": "6556088b" + }, + "source": [ + "### Once you are satisfied with your iteration you can complete it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab7554b8", + "metadata": { + "id": "ab7554b8" + }, + "outputs": [], + "source": [ + "model_iteration.complete()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "15437520", + "metadata": { + "id": "15437520" + }, + "source": [ + "### Completed iterations can't be modified anymore. This can be useful as part of the review process.
\n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2b88cde9", + "metadata": { + "id": "2b88cde9" + }, + "source": [ + "## 🥇 Congrats! You learn how to succesfully use Vectice to auto-document the Model Retraining phase of the Tutorial Project.
\n", + "### Next we encourage you to follow [part 2](https://docs.vectice.com/getting-started/tutorial#part-2) of the Tutorial guide to continue learning about Vectice." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "481b0162-f5de-4137-8271-9524090202f7", + "metadata": { + "id": "481b0162-f5de-4137-8271-9524090202f7" + }, + "source": [ + "✴ You can view your registered assets and comments in the UI by clicking the links in the output messages.." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/23.2/tutorial/Modeling.ipynb b/23.2/tutorial/Modeling.ipynb index fc79cab..a3dafd0 100644 --- a/23.2/tutorial/Modeling.ipynb +++ b/23.2/tutorial/Modeling.ipynb @@ -361,7 +361,7 @@ " ('num_cols', scaler, num_cols)])\n", "\n", "model = make_pipeline(transformer,Ridge())\n", - "model.fit(X,y)" + "model.fit(X_train,y_train)" ] }, { @@ -545,7 +545,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.6" + "version": "3.9.5" } }, "nbformat": 4,