modzscore#

class pdcleaner.detection.gaussian.modzscore(obj, detector=None, threshold=3.5, inclusive='both', sided='both', normaltest='ignore', pvalue=0.001, transform=None)[source]#

Bases: _GaussianSeriesDetector

Detect outliers as potential errors in a Series using the modified Z-score.

Intended to be used by the detect method with the keyword ‘modzscore’

>>> series.cleaner.detect.modzscore(...)
>>> series.cleaner.detect('modzscore',...)

This detection method flag values as errors wherever the corresponding Series element has a modified Z-score above a given threshold.

The modified Z-scores is used to quantify the unusualness of an observation when data follow a normal distribution. It is defined as:

modified Z score = 0.6745 * (value - median) / (median absolute deviation)

The further away an observation’s modified Z-score is from zero, the more unusual it is.

A modified Z-score is more robust than a Z-score because it uses the median as opposed to the mean, which is known to be influenced by outliers.

The standard cut-off values (threshold) for finding outliers are modified Z-scores of +/- 3.5 (default here).

Note

NA values are not treated as errors.

Warning

A normality test is performed [see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html ]

If the distribution does not follow a gaussian/normal distribution:

This is ignored if normaltest=’ignore’ (default)
A warning is raised if normaltest = ‘warning’
An exception is raised if normaltest = ‘error’

If the series length is no more than 8, it is considered as not normal

Tip

The series can be “normalized” before applying the detector, using a power-series transformation:

Box-cox with a shift to deal with positive values [see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html]
Yeo-Johnson [see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.yeojohnson.html]

Using the scipy.stats implementations, the optimal parameter lambda is calculated and used for the transformations and its inverse functions.

When a transformation is applied, some parameters are expressed for the transformed series and informations are made available via report().

Parameters:

threshold (float, default 3.5) –
inclusive ({“both”, “neither”, “left”, “right”}, default "both") – Include boundaries. Whether to set each bound as closed or open.
sided ({“both”, “left”, “right”}, default "both") – Specifies which limits should be applied. If “left”, only apply lower limit If “right”, only apply upper limit If “both”, apply both upper and lower limits
normaltest ({'ignore', 'warn', 'error'} default: 'ignore') – wether to ignore, raise a warning or raise en exception if the normality test fails
pvalue (float, default 1e-3) – pvalue associated with the normality test

Raises:

TypeError – when threshold is not a number
TypeError – when pvalue is not a number
ValueError – when threshold is negative
ValueError – when pvalue is negative
ValueError – if sided or inclusive has an unvalid value
ValueError – if normaltest is not ‘ignore’, ‘warn’ or ‘error’
UserWarning – if the series is not normal and normaltest = ‘warn’
Exception – if the series is not normal and normaltest = ‘error’

Examples

>>> s = pd.Series([0, 0, 0, 0, -1, 1, -1, 1, -6, 6])
>>> modzscore_detector = s.cleaner.detect.modzscore()
>>> modzscore_detector.n_errors
2

>>> modzscore_detector.lower, modzscore_detector.upper
(-5.405405405405405, 5.405405405405405)

>>> s_test = pd.Series([1, 100])
>>> s_test.cleaner.detect(modzscore_detector).is_error()
0    False
1     True
dtype: bool

Using a transformation

>>> s = pd.Series([0, 0, 0, 0, -100, 1, -1, 1, -6, 6])
>>> modzscore_detector = s.cleaner.detect.modzscore(transform='boxcox')
>>> modzscore_detector.report()
                            Detection report
==============================================================================
Method:                    modzscore      Nb samples:                       10
Date:                  March 23,2022      Nb errors:                         3
Time:                       10:19:07      Nb rows with NaN:                  0
------------------------------------------------------------------------------
lower             -5.571754332925835      upper              5.256895307993972
inclusive                       both      sided                           both
------------------------------------------------------------------------------
            modzscore parameters after boxcox transformation
median             7213.695043599123      mad               148.85820508550978
threshold                        3.5      transform                     boxcox
lmbda             2.0840437865755472
------------------------------------------------------------------------------
Series distribution is not normal/gaussian (A boxcox transformation has been
applied)
==============================================================================

If the series is tested as normal, the transformation is not useful hence not applied

>>> s = pd.Series([0, 0, 0, 0, -1, 1, -1, 1, -6, 6])
>>> modzscore_detector = s.cleaner.detect.modzscore(transform='yeojohnson')
>>> modzscore_detector.report()
                            Detection report
==============================================================================
Method:                    modzscore      Nb samples:                       10
Date:                  March 23,2022      Nb errors:                         2
Time:                       10:16:45      Nb rows with NaN:                  0
------------------------------------------------------------------------------
lower             -5.405405405405405      upper              5.405405405405405
inclusive                       both      sided                           both
------------------------------------------------------------------------------
                            modzscore parameters
median                           0.0      mad                              1.0
threshold                        3.5
------------------------------------------------------------------------------
Series distribution has been tested as normal with p=0.001(The transformation
has not been applied)
==============================================================================

Attributes Summary

`inclusive`	Keyword to indicate if boundaries are included {“both”, “neither”, “left”, “right”}
`index`	Indices of the rows detected as errors
`isnormal`	is the series normal/gaussian according to the test ?
`lmbda`	The lambda that maximizes the log-likelihood function of the transformation
`lower`	Lower bound
`mad`	Median absolute deviation of the distribution used to calculate modified Z-scores
`median`	Median value used to calculate modified Z-scores
`n_errors`	Number of rows detected as errors
`name`
`normaltest`	Normality test result behavior
`obj`	The object (Series or DataFrame) containing the data to which the detection is applied
`pvalue`	pvalue for normality test
`sided`	Keyword to indicate if detection is one side or both {"both", "right", "left"}
`threshold`	Threshold value used to detect outliers
`transform`	Distribution transformation
`upper`	Upper bound

Methods Summary

`detected`()	Series or DataFrame containing only the detected errors
`has_errors`()	Returns True if any error has been detected, False otherwise
`is_error`()	Return a boolean same-sized object indicating if the values are flagged as errors
`not_error`()	Return a boolean same-sized object indicating if the values are NOT flagged as errors
`plot`([color, errors_color, compact, limits, ...])	plot a visualization representing an overview of the treated data and colored according to the validity of the values:
`report`()	prints a detection report
`valid`()	Series or DataFrame containing only the valid values

Attributes Documentation

inclusive#: Keyword to indicate if boundaries are included {“both”, “neither”, “left”, “right”}

index#: Indices of the rows detected as errors

isnormal#: is the series normal/gaussian according to the test ?

lmbda#: The lambda that maximizes the log-likelihood function of the transformation

lower#: Lower bound

mad#: Median absolute deviation of the distribution used to calculate modified Z-scores

median#: Median value used to calculate modified Z-scores

n_errors#: Number of rows detected as errors

name = 'modzscore'#

normaltest#: Normality test result behavior

obj#: The object (Series or DataFrame) containing the data to which the detection is applied

pvalue#: pvalue for normality test

sided#: Keyword to indicate if detection is one side or both {“both”, “right”, “left”}

threshold#: Threshold value used to detect outliers

transform#: Distribution transformation

upper#: Upper bound

Methods Documentation

detected()#: Series or DataFrame containing only the detected errors

has_errors() → bool#: Returns True if any error has been detected, False otherwise

is_error() → Series#: Return a boolean same-sized object indicating if the values are flagged as errors

not_error() → Series#: Return a boolean same-sized object indicating if the values are NOT flagged as errors

plot(color='green', errors_color='red', compact=False, limits=True, figsize=None)#

plot a visualization representing an overview of the treated data and colored according to the validity of the values:

a scatter plot representing the values in the treated series.
a histogram representing the distribution of values.
a kernel density estimate plot visualizing the distribution of values.
a boxplot showing the distribution of values.

Parameters:

color (palette name (Default: "green")) – Color associated to legitimate values. Should be something that can be interpreted by seaborn’s color_palette()
errors_color (palette name (Default: "red")) – Color associated to erroneous values. Should be something that can be interpreted by seaborn’s color_palette()
compact (Bool (Default: False)) – If True, compact the plots around valid values and show the number of erroneous values on the scatter plot
limits (Bool (Default: True)) – If True, draw horizontal lines showing the lower and upper values delimiting the allowed values
figsize ((float, float) (Default: None)) – width and height of the figure.

Returns:

axs – an array of length 4 containing the matplotlib axes representing the plots

Return type:

array of matplotlib.axes._subplots.AxesSubplot

Examples

>>> series = pd.Series([-5, 1, 2 , 3, 8, 12])
>>> detector = series.cleaner.detect.bounded(lower=0, upper=10)
>>> detector.plot()

report()#: prints a detection report

valid()#: Series or DataFrame containing only the valid values