|
174 | 174 | "\n",
|
175 | 175 | "- `SelectKBest` removes all but the highest scoring features\n",
|
176 | 176 | "\n",
|
177 |
| - "- `SelectPercentile` removes all but a user-specified highest scoring percentage of features using common univariate statistical tests for each feature: false positive rate `SelectFpr`, false discovery rate `SelectFdr`, or family wise error `SelectFwe`.\n", |
| 177 | + "- `SelectPercentile` removes all but a user-specified highest scoring percentage of features \n", |
178 | 178 | "\n",
|
179 |
| - "- `GenericUnivariateSelect` allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.\n", |
| 179 | + "- `GenericUnivariateSelect` allows to perform univariate feature selection with a configurable strategy\n", |
180 | 180 | "\n",
|
181 | 181 | "These objects take as input a scoring function that returns univariate scores and p-values (or only scores for `SelectKBest` and `SelectPercentile`):\n",
|
182 | 182 | "\n",
|
|
210 | 210 | "id": "7e76a9cc",
|
211 | 211 | "metadata": {},
|
212 | 212 | "source": [
|
213 |
| - "### Pearson Correlation Coefficient\n", |
| 213 | + "### Correlation Coefficient\n", |
214 | 214 | "Correlation is a measure of the linear relationship of 2 or more variables. We would assume that the **good variables** are **highly correlated** with the target. Also, sometimes we would want to remove either one of the two variables that are highly correlated. \n",
|
215 | 215 | "<br><br>\n",
|
216 | 216 | "<div align=\"center\">\n",
|
217 | 217 | " <img alt=\"Several sets of (x, y) points, with the correlation coefficient of x and y for each set.\" src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/1920px-Correlation_examples2.svg.png\" width=\"400\" height=\"200\"><br>\n",
|
218 | 218 | " <sup>Sample datasets and their pearson correlation coefficients.<sup>\n",
|
219 | 219 | "</div>\n",
|
220 | 220 | " \n",
|
221 |
| - "We will show an example that drop the variable which has a lower correlation coefficient value with the target variable. We need to set an absolute value, for example, 0.4 as the threshold for selecting the variables." |
| 221 | + "We will show an example that drop the variable which has a lower *Pearson* correlation coefficient value with the target variable. We need to set an absolute value, for example, 0.4 as the threshold for selecting the variables.\n", |
| 222 | + " \n", |
| 223 | + "You may use other correlation coefficient such as *Spearman* or *Kendall*." |
222 | 224 | ]
|
223 | 225 | },
|
224 | 226 | {
|
|
248 | 250 | "metadata": {},
|
249 | 251 | "source": [
|
250 | 252 | "### Variance Threshold\n",
|
251 |
| - "`VarianceThreshold` is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples." |
| 253 | + "`VarianceThreshold` is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.\n", |
| 254 | + "\n", |
| 255 | + "The estimator only works with numeric data and it will raise an error if there are categorical features present in the dataframe." |
252 | 256 | ]
|
253 | 257 | },
|
254 | 258 | {
|
|
365 | 369 | "metadata": {},
|
366 | 370 | "source": [
|
367 | 371 | "### Exhaustive Feature Selection\n",
|
368 |
| - "This is a brute-force evaluation of each feature subset. It tries every possible combination of the variables and returns the best performing subset but also take longer time." |
| 372 | + "This is a brute-force evaluation of each feature subset. It tries every possible combination of the variables and returns the best performing subset but also takes longer time." |
369 | 373 | ]
|
370 | 374 | },
|
371 | 375 | {
|
|
0 commit comments