|
44 | 44 | "source": [
|
45 | 45 | "## Why using Azure blobstorage is necessary\n",
|
46 | 46 | "\n",
|
47 |
| - "In this tutorial we copy the data from an Exasol Saas database into an Azure Blobstorage Container. This is necessary because while AzureML has functionality to import directly from SQL databases, the Exasol SQL dialect is not supported by AzureML at the moment of writing.\n" |
| 47 | + "In this tutorial we copy the data from an Exasol Saas Database into an Azure Blob Storage Container. This is necessary because while AzureML has functionality to import directly from SQL databases, the Exasol SQL dialect is not supported by AzureML at the moment of writing.\n" |
48 | 48 | ],
|
49 | 49 | "metadata": {
|
50 | 50 | "collapsed": false
|
|
129 | 129 | {
|
130 | 130 | "cell_type": "markdown",
|
131 | 131 | "source": [
|
132 |
| - "## Data Preprocessing\n", |
133 |
| - "explanation\n", |
134 |
| - " - why here not in azure\n", |
135 |
| - " - what gets done\n", |
| 132 | + "We will also need to get access to the Azure Storage Account, which we will use later to transfer the data. For that, you need to insert your Azure Storage Account Name and Access Key. To find your Access Key, in the Azure portal navigate to your Storage Account, and click on \"Access Keys\" under \"Security + networking\" and copy one of your Access Keys.\n", |
136 | 133 | "\n",
|
137 |
| - "\"There are two things we need to do:\n", |
138 |
| - "\n", |
139 |
| - " Split into train and validation data\n", |
140 |
| - " Replace CLASS column by a column with boolean values\n", |
141 |
| - "\n", |
142 |
| - "For the split we add a column SPLIT that has a random value between 0 and 1, so we can partition the data by a condition on that column.\n", |
143 |
| - "\n", |
144 |
| - "In addition, we replace the CLASS with the text values pos and neg by a new column CLASS_POS with boolean values.\"\n", |
145 |
| - " - mention test table to" |
| 134 | + "\n" |
146 | 135 | ],
|
147 | 136 | "metadata": {
|
148 | 137 | "collapsed": false
|
|
153 | 142 | "execution_count": null,
|
154 | 143 | "outputs": [],
|
155 | 144 | "source": [
|
156 |
| - "all_columns = exasol.export_to_pandas(\"SELECT * FROM IDA.TRAIN LIMIT 1;\")\n", |
157 |
| - "column_names = list(all_columns)\n", |
158 |
| - "column_names.remove(\"CLASS\")\n", |
159 |
| - "exasol.execute(\"\"\"CREATE OR REPLACE TABLE IDA.TRAIN_PREPARED AS (\n", |
160 |
| - " SELECT RANDOM() AS SPLIT,\n", |
161 |
| - " (CLASS = 'pos') as CLASS_POS, {all_columns_except_class!q} FROM IDA.TRAIN)\"\"\",\n", |
162 |
| - " {\"all_columns_except_class\": column_names})\n", |
163 |
| - "\n", |
| 145 | + "from azure.ai.ml.entities import AccountKeyConfiguration\n", |
164 | 146 | "\n",
|
| 147 | + "my_storage_account_name = \"your_storage_account_name\" # change\n", |
| 148 | + "account_key=\"your_storage_account_key\" # change\n", |
165 | 149 | "\n",
|
166 |
| - "exasol.export_to_pandas(\"SELECT * FROM IDA.TRAIN_PREPARED LIMIT 4\")" |
| 150 | + "credentials= AccountKeyConfiguration(account_key)" |
167 | 151 | ],
|
168 | 152 | "metadata": {
|
169 | 153 | "collapsed": false,
|
|
172 | 156 | }
|
173 | 157 | }
|
174 | 158 | },
|
| 159 | + { |
| 160 | + "cell_type": "markdown", |
| 161 | + "source": [ |
| 162 | + "### Data Preprocessing\n", |
| 163 | + "\n", |
| 164 | + "Now that we are set up for the data transfer, we are first going to preprocess the data in the Exasol Database before pulling the data into Azure. We want to replace the text based \"CLASS\" column all data tables with a boolean column called \"CLASS_POS\" which will make classifying easier.\n", |
| 165 | + "\n", |
| 166 | + "For your own project, you need to evaluate which preprocessing steps to run in the efficient Exasol Database and which might be easier to accomplish later on the CSV files in Azure Blob Storage.\n", |
| 167 | + "\n", |
| 168 | + "First, we create a new table \"TRAIN_PREPARED\" which is a copy of the \"TRAIN\" table, with the replaced \"CLASS_POS\" column." |
| 169 | + ], |
| 170 | + "metadata": { |
| 171 | + "collapsed": false |
| 172 | + } |
| 173 | + }, |
175 | 174 | {
|
176 | 175 | "cell_type": "code",
|
177 | 176 | "execution_count": null,
|
178 | 177 | "outputs": [],
|
179 | 178 | "source": [
|
180 |
| - "exasol.execute(\"\"\"CREATE OR REPLACE TABLE IDA.TEST_PREPARED AS (\n", |
181 |
| - " SELECT\n", |
182 |
| - " (CLASS = 'pos') as CLASS_POS, {all_columns_except_class!q} FROM IDA.TEST)\"\"\",\n", |
| 179 | + "all_columns = exasol.export_to_pandas(\"SELECT * FROM IDA.TRAIN LIMIT 1;\")\n", |
| 180 | + "column_names = list(all_columns)\n", |
| 181 | + "column_names.remove(\"CLASS\")\n", |
| 182 | + "exasol.execute(\"\"\"CREATE OR REPLACE TABLE IDA.TRAIN_PREPARED AS (\n", |
| 183 | + " SELECT\n", |
| 184 | + " (CLASS = 'pos') as CLASS_POS, {all_columns_except_class!q} FROM IDA.TRAIN)\"\"\",\n", |
183 | 185 | " {\"all_columns_except_class\": column_names})\n",
|
184 | 186 | "\n",
|
185 | 187 | "\n",
|
186 | 188 | "\n",
|
187 |
| - "exasol.export_to_pandas(\"SELECT * FROM IDA.TEST_PREPARED LIMIT 4\")" |
| 189 | + "exasol.export_to_pandas(\"SELECT * FROM IDA.TRAIN_PREPARED LIMIT 4\")" |
188 | 190 | ],
|
189 | 191 | "metadata": {
|
190 | 192 | "collapsed": false,
|
|
196 | 198 | {
|
197 | 199 | "cell_type": "markdown",
|
198 | 200 | "source": [
|
199 |
| - "\n", |
200 |
| - "### Load data into AzureML Blobstore\n", |
201 |
| - "\n", |
202 |
| - "\n", |
203 |
| - "For this step, we need to access the Azure Storage Account. For that you need to insert your Azure storage account name and access key. To find your access key, in the Azure portal navigate to your storage account, and click on \"Access Keys\" under \"Security + networking\" and copy one of your access Keys.\n", |
204 |
| - "\n", |
205 |
| - "\n" |
| 201 | + "Then we create a new \"TEST_PREPARED\" table as a copy of the \"TEST\" table with replaced \"CLASS_POS\" column." |
206 | 202 | ],
|
207 | 203 | "metadata": {
|
208 | 204 | "collapsed": false
|
|
213 | 209 | "execution_count": null,
|
214 | 210 | "outputs": [],
|
215 | 211 | "source": [
|
216 |
| - "from azure.ai.ml.entities import AccountKeyConfiguration\n", |
| 212 | + "exasol.execute(\"\"\"CREATE OR REPLACE TABLE IDA.TEST_PREPARED AS (\n", |
| 213 | + " SELECT\n", |
| 214 | + " (CLASS = 'pos') as CLASS_POS, {all_columns_except_class!q} FROM IDA.TEST)\"\"\",\n", |
| 215 | + " {\"all_columns_except_class\": column_names})\n", |
217 | 216 | "\n",
|
218 |
| - "my_storage_account_name = \"your_storage_account_name\" # change\n", |
219 |
| - "account_key=\"your_storage_account_key\" # change\n", |
220 | 217 | "\n",
|
221 |
| - "credentials= AccountKeyConfiguration(account_key)" |
| 218 | + "\n", |
| 219 | + "exasol.export_to_pandas(\"SELECT * FROM IDA.TEST_PREPARED LIMIT 4\")" |
222 | 220 | ],
|
223 | 221 | "metadata": {
|
224 | 222 | "collapsed": false,
|
|
230 | 228 | {
|
231 | 229 | "cell_type": "markdown",
|
232 | 230 | "source": [
|
233 |
| - "Lastly, we use an \"EXPORT TABLE\" command for each of our data tables to export them into a CSV file in our Blobstorage using \"INTO CSV AT CLOUD AZURE BLOBSTORAGE\". You can find [the domumentation for this export command](https://docs.exasol.com/db/latest/sql/export.htm) in the Exasol documentation.\n", |
234 |
| - "If you choose an existing \"azure_storage_container_name\", this command will save your files in this container. Otherwise, an azure storage container with that name will be created automatically.\n", |
235 |
| - "When you created your AzureML workspace, an Azure blob container was [created automatically](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data) and added as a Datastore named \"workspaceblobstore\" to your workspace. You can use it here and then scip the \"Create a Datastore\" step below if you want. For this you would need to find its name (\"azureml-blobstore-some-ID\") in the datastore info and insert it here." |
| 231 | + "### Load Data into AzureML Blob Storage\n", |
| 232 | + "\n", |
| 233 | + "Now that our data is prepared and we have access to our Azure Storage Account and our Exasol Saas Cluster, we use an \"EXPORT TABLE\" command for each of our data tables to export them into a CSV file in our Blob Storage using \"INTO CSV AT CLOUD AZURE BLOBSTORAGE\". You can find [the domumentation for this export command](https://docs.exasol.com/db/latest/sql/export.htm) in the Exasol documentation.\n", |
| 234 | + "If you choose an existing Azure Blob Storage container, this command will save your files in this container. Otherwise, a new container with the given name will be created automatically.\n", |
| 235 | + "When you created your AzureML Workspace, an Azure Blob Container was [created automatically](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data) and added as a Datastore named \"workspaceblobstore\" to your workspace. You can use it here and then skip the \"Create a Datastore\" step below if you want. For this you would need to find its name (\"azureml-blobstore-some-ID\") in the Datastore Information and insert it here." |
236 | 236 | ],
|
237 | 237 | "metadata": {
|
238 | 238 | "collapsed": false
|
|
241 | 241 | {
|
242 | 242 | "cell_type": "markdown",
|
243 | 243 | "source": [
|
244 |
| - "## todo\n", |
245 |
| - "- change and add explanation to preprocessing\n", |
246 |
| - "- update image of loaded tables (reload without split column beforehand)\n", |
247 |
| - "- add notze about selecing columns\n", |
248 |
| - "- add note about importing more than once -> appends not make new file!" |
| 244 | + "Some of the 170 features of the Scania Trucks dataset do not have a notable influence on the classification or contain a big amount of empty values. Because of this we select only some columns to actually use for the training. Since we only want to use them, we import only these features to Azure.\n", |
| 245 | + "\n", |
| 246 | + "Once we have selected the column names we want to use, we transfer the \"TEST_PREPARED\" table using the exasol EXPORT command." |
249 | 247 | ],
|
250 | 248 | "metadata": {
|
251 | 249 | "collapsed": false
|
|
264 | 262 | " 'CS_001', 'DD_000', 'DE_000', 'DN_000', 'DS_000', 'DU_000', 'DV_000', 'EB_000', 'EE_005']\n",
|
265 | 263 | "\n",
|
266 | 264 | "blobstorage_name = \"azureml-tutorial\" # change, remember to you might need to remove the \"_datastore\" suffix\n",
|
| 265 | + "\n", |
267 | 266 | "save_path = f'{blobstorage_name}/ida/{table}'\n",
|
268 |
| - "sql_export = \"EXPORT (SELECT {column_names!q}\" + f\" FROM IDA.{table}) INTO CSV AT CLOUD AZURE BLOBSTORAGE 'DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net'\"\\\n", |
269 |
| - " f\"USER '{my_storage_account_name}' IDENTIFIED BY '{credentials.account_key}' FILE '{save_path}' WITH COLUMN NAMES\"\n", |
270 |
| - "exasol.execute(sql_export, {\"column_names\": column_names})\n", |
| 267 | + "sql_export = \"\"\"EXPORT (SELECT {column_names!q} FROM IDA.{table!q})\n", |
| 268 | + " INTO CSV AT CLOUD AZURE BLOBSTORAGE 'DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net'\n", |
| 269 | + " USER '{my_storage_account_name!q}' IDENTIFIED BY '{account_key!q}'\n", |
| 270 | + " FILE '{save_path!q}' WITH COLUMN NAMES REPLACE\"\"\"\n", |
| 271 | + "\n", |
| 272 | + "\n", |
| 273 | + "exasol.execute(sql_export, {\"column_names\": column_names,\n", |
| 274 | + " \"table\": table,\n", |
| 275 | + " \"my_storage_account_name\": my_storage_account_name,\n", |
| 276 | + " \"account_key\": credentials.account_key,\n", |
| 277 | + " \"save_path\": save_path})\n", |
271 | 278 | "print(f\"saved {table} in file {save_path}\")"
|
272 | 279 | ],
|
273 | 280 | "metadata": {
|
|
277 | 284 | }
|
278 | 285 | }
|
279 | 286 | },
|
| 287 | + { |
| 288 | + "cell_type": "markdown", |
| 289 | + "source": [ |
| 290 | + "Then we do the same with the TRAIN_PREPARED table:" |
| 291 | + ], |
| 292 | + "metadata": { |
| 293 | + "collapsed": false |
| 294 | + } |
| 295 | + }, |
280 | 296 | {
|
281 | 297 | "cell_type": "code",
|
282 | 298 | "execution_count": null,
|
283 | 299 | "outputs": [],
|
284 | 300 | "source": [
|
285 |
| - "\n", |
286 | 301 | "table = \"TRAIN_PREPARED\"\n",
|
287 | 302 | "save_path = f'{blobstorage_name}/ida/{table}'\n",
|
288 |
| - "sql_export = \"EXPORT (SELECT {column_names!q}\" + f\" FROM IDA.{table} WHERE SPLIT <= 0.8) INTO CSV AT CLOUD AZURE BLOBSTORAGE 'DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net'\"\\\n", |
289 |
| - " f\"USER '{my_storage_account_name}' IDENTIFIED BY '{credentials.account_key}' FILE '{save_path}' WITH COLUMN NAMES\"\n", |
290 |
| - "exasol.execute(sql_export, {\"column_names\": column_names})\n", |
291 |
| - "print(f\"saved {table} in file {save_path}\")\n", |
292 |
| - "\n", |
293 |
| - "save_path = f'{blobstorage_name}/ida/VALIDATE_PREPARED'\n", |
294 |
| - "sql_export = \"EXPORT (SELECT {column_names!q}\" + f\" FROM IDA.{table} WHERE SPLIT > 0.8) INTO CSV AT CLOUD AZURE BLOBSTORAGE 'DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net'\"\\\n", |
295 |
| - " f\"USER '{my_storage_account_name}' IDENTIFIED BY '{credentials.account_key}' FILE '{save_path}' WITH COLUMN NAMES\"\n", |
296 |
| - "exasol.execute(sql_export, {\"column_names\": column_names})\n", |
| 303 | + "\n", |
| 304 | + "exasol.execute(sql_export, {\"column_names\": column_names,\n", |
| 305 | + " \"table\": table,\n", |
| 306 | + " \"my_storage_account_name\": my_storage_account_name,\n", |
| 307 | + " \"account_key\": credentials.account_key,\n", |
| 308 | + " \"save_path\": save_path})\n", |
297 | 309 | "print(f\"saved {table} in file {save_path}\")"
|
298 | 310 | ],
|
299 | 311 | "metadata": {
|
|
303 | 315 | }
|
304 | 316 | }
|
305 | 317 | },
|
| 318 | + { |
| 319 | + "cell_type": "markdown", |
| 320 | + "source": [ |
| 321 | + "Delete the temporary tables from the Exasol Saas Database in order to not pollute the database." |
| 322 | + ], |
| 323 | + "metadata": { |
| 324 | + "collapsed": false |
| 325 | + } |
| 326 | + }, |
306 | 327 | {
|
307 | 328 | "cell_type": "code",
|
308 | 329 | "execution_count": null,
|
|
351 | 372 | "You can now see your data directly in AzureML by navigating to \"Datastores\" and clicking on <your_datastore_name> . If you then change into the \"Browse\" view you can open your files and have a look at them if you want.\n",
|
352 | 373 | "\n",
|
353 | 374 | "\n",
|
354 |
| - "Great, we successfully connected to our Exasol Saas instance and loaded data from there into our Azure Blobstorage!\n", |
| 375 | + "Great, we successfully connected to our Exasol Saas instance and loaded data from there into our Azure Blob Storage!\n", |
355 | 376 | "\n",
|
356 | 377 | "Now we move on to [working with the data in AzureML and training a model on it](TrainModelInAzureML.ipynb)."
|
357 | 378 | ],
|
|
0 commit comments