You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In classes like KendallTauCorrelation, there's a conversion of series to integer if they are of datetime type. This conversion is expensive.
Conversion to categorical vars in KendallTauCorrelation is repeated for sr_a and sr_b, which could be abstracted away.
In Mean._compute_metric, subtracting and then adding an array of zeros is redundant. It could be simplified to np.nanmean(sr.values).
The StandardDeviation class performs a sort before trimming for outliers, which may not be necessary if a significant number of values are removed.
In EarthMoversDistance, we can perform the operations we need on pd series more efficiently.
Also, we use dict to count unique values, but pandas has functions for that -- we are possibly losing efficiency here as well.
We can use np.nan_to_num in EarthMoversDistanceBinned to handle nans more explicitly.
In several metrics, there's a check for empty series or specific conditions that return default value. We need to move them up to be more efficient (not critical)
The check_column_types method is quite similar across different classes -- we might want to abstract that (not critical)
Overall, it seems like the there could be less code duplication (not critical)
### Tasks
- [ ] Refactor datetime conversion in `KendallTauCorrelation`
- [ ] Implement a utility function for categorical conversion in `KendallTauCorrelation`
- [ ] Update `Mean._compute_metric to use `np.nanmean(sr.values)`
- [ ] Update `StandardDeviation` to decide if sorting is necessary
- [ ] Use pandas ops in `EarthMoversDistance`
- [ ] Use `np.nan_to_num` in `EarthMoversDistanceBinned` to handle nans
- [ ] Move up the checks for empty series / special conditions
- [ ] Abstract away the similar `check_column_types` methods across all classes
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
In classes like
KendallTauCorrelation
, there's a conversion of series to integer if they are of datetime type. This conversion is expensive.Conversion to categorical vars in
KendallTauCorrelation
is repeated forsr_a
andsr_b
, which could be abstracted away.In
Mean._compute_metric
, subtracting and then adding an array of zeros is redundant. It could be simplified tonp.nanmean(sr.values)
.The
StandardDeviation
class performs a sort before trimming for outliers, which may not be necessary if a significant number of values are removed.In
EarthMoversDistance
, we can perform the operations we need on pd series more efficiently.Also, we use
dict
to count unique values, but pandas has functions for that -- we are possibly losing efficiency here as well.We can use
np.nan_to_num
inEarthMoversDistanceBinned
to handle nans more explicitly.In several metrics, there's a check for empty series or specific conditions that return default value. We need to move them up to be more efficient (not critical)
The check_column_types method is quite similar across different classes -- we might want to abstract that (not critical)
Overall, it seems like the there could be less code duplication (not critical)
The text was updated successfully, but these errors were encountered: