outliers#

class pdcleaner.detection.multivariate.outliers(obj, detector=None, eps=None, min_samples=2)[source]#

Bases: _QuantiDataFramesDetector

Detects outliers in a numeric DataFrame using a clustering DBScan algorithm

This detection methods flags outliers in N-dimensional numerical datasets. The detection is performed using a density based clustering method DBScan (with its scikit-learn’s implementation).

The DBSCAN algorithm is performed on a column-scaled values of the initial datasets. A defaut set of rules is used for the DBSCAN parameters: eps is set to the max standard deviation of the scaled columns and min_samples is set to 2. These values can be modified to fit particular purposes.

The samples that are not part of a cluster are flagged as potential errors.

Rows with missing values are not considered and not flagged as errors.

Parameters:

eps (float) – The maximum euclidean distance between two samples in the normalized dataset for one to be considered as in the neighborhood of the other. By default, it is set to maximum standard deviation among all normalized variables.
min_samples (int, default 2) – The number of samples to form a cluster.

Raises:

TypeError – when eps is not a number when min_samples is not an integer
ValueError – when sklearn’s DBSCAN throws an exception for the given dataset and set of parameters

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Examples

>>> import pandas as pd

>>> df = pd.DataFrame({'x': [1, 1.1, 4],
                       'y': [1.1, 1, 4],
                       'z': [1, 1.1, 4]})
>>> detector = df.cleaner.detect.outliers()
>>> print(detector.is_error())
    0    False
    1    False
    2     True
    dtype: bool

Rows with missing values are ignored and not flagged as errors

>>> import numpy as np
>>> df = pd.DataFrame({'x': [1, 1.1, 4, np.nan],
                       'y': [1.1, 1, 4, 5],
                       'z': [1, 1.1, 4, 5]})
>>> detector = df.cleaner.detect.outliers()
>>> print(detector.is_error())
    0    False
    1    False
    2     True
    3    False
    dtype: bool

Attributes Summary

`eps`	epsilon value see [1]
`index`	Indices of the rows detected as errors
`min_samples`	min_samples value see [1]
`n_errors`	Number of rows detected as errors
`name`
`obj`	The object (Series or DataFrame) containing the data to which the detection is applied

Methods Summary

`detected`()	Series or DataFrame containing only the detected errors
`has_errors`()	Returns True if any error has been detected, False otherwise
`is_error`()	Return a boolean same-sized object indicating if the values are flagged as errors
`not_error`()	Return a boolean same-sized object indicating if the values are NOT flagged as errors
`report`()	prints a detection report
`valid`()	Series or DataFrame containing only the valid values

Attributes Documentation

eps#: epsilon value see [1]

index#: Indices of the rows detected as errors

min_samples#: min_samples value see [1]

n_errors#: Number of rows detected as errors

name = 'outliers'#

obj#: The object (Series or DataFrame) containing the data to which the detection is applied

Methods Documentation

detected()#: Series or DataFrame containing only the detected errors

has_errors() → bool#: Returns True if any error has been detected, False otherwise

is_error() → Series#: Return a boolean same-sized object indicating if the values are flagged as errors

not_error() → Series#: Return a boolean same-sized object indicating if the values are NOT flagged as errors

report()#: prints a detection report

valid()#: Series or DataFrame containing only the valid values