outliers#
- class pdcleaner.detection.multivariate.outliers(obj, detector=None, eps=None, min_samples=2)[source]#
Bases:
_QuantiDataFramesDetectorDetects outliers in a numeric DataFrame using a clustering DBScan algorithm
This detection methods flags outliers in N-dimensional numerical datasets. The detection is performed using a density based clustering method DBScan (with its scikit-learn’s implementation).
The DBSCAN algorithm is performed on a column-scaled values of the initial datasets. A defaut set of rules is used for the DBSCAN parameters: eps is set to the max standard deviation of the scaled columns and min_samples is set to 2. These values can be modified to fit particular purposes.
The samples that are not part of a cluster are flagged as potential errors.
Rows with missing values are not considered and not flagged as errors.
- Parameters:
eps (float) – The maximum euclidean distance between two samples in the normalized dataset for one to be considered as in the neighborhood of the other. By default, it is set to maximum standard deviation among all normalized variables.
min_samples (int, default 2) – The number of samples to form a cluster.
- Raises:
TypeError – when eps is not a number when min_samples is not an integer
ValueError – when sklearn’s DBSCAN throws an exception for the given dataset and set of parameters
References
[1] https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Examples
>>> import pandas as pd
>>> df = pd.DataFrame({'x': [1, 1.1, 4], 'y': [1.1, 1, 4], 'z': [1, 1.1, 4]}) >>> detector = df.cleaner.detect.outliers() >>> print(detector.is_error()) 0 False 1 False 2 True dtype: bool
Rows with missing values are ignored and not flagged as errors
>>> import numpy as np >>> df = pd.DataFrame({'x': [1, 1.1, 4, np.nan], 'y': [1.1, 1, 4, 5], 'z': [1, 1.1, 4, 5]}) >>> detector = df.cleaner.detect.outliers() >>> print(detector.is_error()) 0 False 1 False 2 True 3 False dtype: bool
Attributes Summary
epsilon value see [1]
Indices of the rows detected as errors
min_samples value see [1]
Number of rows detected as errors
The object (Series or DataFrame) containing the data to which the detection is applied
Methods Summary
detected()Series or DataFrame containing only the detected errors
Returns True if any error has been detected, False otherwise
is_error()Return a boolean same-sized object indicating if the values are flagged as errors
Return a boolean same-sized object indicating if the values are NOT flagged as errors
report()prints a detection report
valid()Series or DataFrame containing only the valid values
Attributes Documentation
- eps#
epsilon value see [1]
- index#
Indices of the rows detected as errors
- min_samples#
min_samples value see [1]
- n_errors#
Number of rows detected as errors
- name = 'outliers'#
- obj#
The object (Series or DataFrame) containing the data to which the detection is applied
Methods Documentation
- detected()#
Series or DataFrame containing only the detected errors
- has_errors() bool#
Returns True if any error has been detected, False otherwise
- is_error() Series#
Return a boolean same-sized object indicating if the values are flagged as errors
- not_error() Series#
Return a boolean same-sized object indicating if the values are NOT flagged as errors
- report()#
prints a detection report
- valid()#
Series or DataFrame containing only the valid values