zscore#

class pdcleaner.detection.gaussian.zscore(obj, detector=None, threshold=1.96, inclusive='both', sided='both', normaltest='ignore', pvalue=0.001, transform=None)[source]#

Bases: _GaussianSeriesDetector

Detect outliers as potential errors in a Series using the Z-score method.

Intended to be used by the detect method with the keyword ‘zscore’

>>> series.cleaner.detect.zscore(...)
>>> series.cleaner.detect('zscore',...)

This detection method flag values as errors wherever the corresponding Series element has a Z-score above a given threshold.

Z-scores are the number of standard deviations above and below the mean that each value falls.

Z = (value - mean) / (standard deviation)

Z-scores are used to quantify the unusualness of an observation when data follow a normal distribution. The further away an observation’s Z-score is from zero, the more unusual it is.

Standard cut-off values (thresholds) for finding outliers are Z-scores of

  • +/- 1.96 corresponding to a 5% confidence that the value is an outlier (default here)

  • +/-3 often used in practice

Note

NA values are not treated as errors.

Warning

A normality test is performed [see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html ]

If the distribution does not follow a gaussian/normal distribution:

  • This is ignored if normaltest=’ignore’ (default)

  • A warning is raised if normaltest = ‘warning’

  • An exception is raised if normaltest = ‘error’

If the series length is no more than 8, it is considered as not normal

Tip

The series can be “normalized” before applying the detector, using a power-series transformation:

Using the scipy.stats implementations, the optimal parameter lambda is calculated and used for the transformations and its inverse functions.

When a transformation is applied, some parameters are expressed for the transformed series and informations are made available via report().

Parameters:
  • threshold (float, default 1.96) –

  • inclusive ({“both”, “neither”, “left”, “right”}, default "both") – Include boundaries. Whether to set each bound as closed or open.

  • sided ({“both”, “left”, “right”}, default "both") – Specifies which limits should be applied. If “left”, only apply lower limit If “right”, only apply upper limit If “both”, apply both upper and lower limits

  • normaltest ({'ignore', 'warn', 'error'} default: 'ignore') – wether to ignore, raise a warning or raise en exception if the normality test fails

  • pvalue (float, default 1e-3) – pvalue associated with the normality test

Raises:
  • TypeError – when threshold is not a number

  • TypeError – when pvalue is not a number

  • ValueError – when threshold is negative

  • ValueError – when pvalue is negative

  • ValueError – if sided or inclusive has an unvalid value

  • ValueError – if normaltest is not ‘ignore’, ‘warn’ or ‘error’

  • UserWarning – if the series is not normal and normaltest = ‘warn’

  • Exception – if the series is not normal and normaltest = ‘error’

Examples

>>> s = pd.Series([0, 0, 0, 0, -1, 1, -1, 1, -6, 6])
>>> zscore_detector = s.cleaner.detect.zscore()
>>> zscore_detector.n_errors
2
>>> zscore_detector.lower, zscore_detector.upper
(-4.800999895855028, 4.800999895855028)
>>> s_test = pd.Series([1, 100])
>>> s_test.cleaner.detect(zscore_detector).is_error()
0    False
1     True
dtype: bool

Using a transformation

>>> s = pd.Series([0, 0, 0, 0, -100, 1, -1, 1, -6, 6])
>>> zscore_detector = s.cleaner.detect.modzscore(transform='boxcox')
>>> zzscore_detector.report()
                            Detection report
==============================================================================
Method:                       zscore      Nb samples:                       10
Date:                  March 23,2022      Nb errors:                         1
Time:                       10:22:20      Nb rows with NaN:                  0
------------------------------------------------------------------------------
lower             -47.08667824994567      upper             23.077126492888723
inclusive                       both      sided                           both
------------------------------------------------------------------------------
                zscore parameters after boxcox transformation
mean               6513.202636676919      std               2328.4345612025554
threshold                       1.96      transform                     boxcox
lmbda             2.0840437865755472
------------------------------------------------------------------------------
Series distribution is not normal/gaussian (A boxcox transformation has been
applied)
==============================================================================

If the series is tested as normal, the transformation is not useful hence not applied

>>> s = pd.Series([0, 0, 0, 0, -1, 1, -1, 1, -6, 6])
>>> zscore_detector = s.cleaner.detect.zscore(transform='boxcox')
>>> zscore_detector.report()
                            Detection report
==============================================================================
Method:                       zscore      Nb samples:                       10
Date:                  March 23,2022      Nb errors:                         2
Time:                       10:15:30      Nb rows with NaN:                  0
------------------------------------------------------------------------------
lower             -5.695627952893147      upper              5.695627952893147
inclusive                       both      sided                           both
------------------------------------------------------------------------------
                            zscore parameters
mean                             0.0      std               2.9059326290271157
threshold                       1.96
------------------------------------------------------------------------------
Series distribution has been tested as normal with p=0.001(The transformation
has not been applied)
==============================================================================

Attributes Summary

inclusive

Keyword to indicate if boundaries are included {“both”, “neither”, “left”, “right”}

index

Indices of the rows detected as errors

isnormal

is the series normal/gaussian according to the test ?

lmbda

The lambda that maximizes the log-likelihood function of the transformation

lower

Lower bound

mean

Mean value used to calculate Z-scores

n_errors

Number of rows detected as errors

name

normaltest

Normality test result behavior

obj

The object (Series or DataFrame) containing the data to which the detection is applied

pvalue

pvalue for normality test

sided

Keyword to indicate if detection is one side or both {"both", "right", "left"}

std

Standard deviation used to calculate Z-scores

threshold

Threshold value used to detect outliers

transform

Distribution transformation

upper

Upper bound

Methods Summary

detected()

Series or DataFrame containing only the detected errors

has_errors()

Returns True if any error has been detected, False otherwise

is_error()

Return a boolean same-sized object indicating if the values are flagged as errors

not_error()

Return a boolean same-sized object indicating if the values are NOT flagged as errors

plot([color, errors_color, compact, limits, ...])

plot a visualization representing an overview of the treated data and colored according to the validity of the values:

report()

prints a detection report

valid()

Series or DataFrame containing only the valid values

Attributes Documentation

inclusive#

Keyword to indicate if boundaries are included {“both”, “neither”, “left”, “right”}

index#

Indices of the rows detected as errors

isnormal#

is the series normal/gaussian according to the test ?

lmbda#

The lambda that maximizes the log-likelihood function of the transformation

lower#

Lower bound

mean#

Mean value used to calculate Z-scores

n_errors#

Number of rows detected as errors

name = 'zscore'#
normaltest#

Normality test result behavior

obj#

The object (Series or DataFrame) containing the data to which the detection is applied

pvalue#

pvalue for normality test

sided#

Keyword to indicate if detection is one side or both {“both”, “right”, “left”}

std#

Standard deviation used to calculate Z-scores

threshold#

Threshold value used to detect outliers

transform#

Distribution transformation

upper#

Upper bound

Methods Documentation

detected()#

Series or DataFrame containing only the detected errors

has_errors() bool#

Returns True if any error has been detected, False otherwise

is_error() Series#

Return a boolean same-sized object indicating if the values are flagged as errors

not_error() Series#

Return a boolean same-sized object indicating if the values are NOT flagged as errors

plot(color='green', errors_color='red', compact=False, limits=True, figsize=None)#

plot a visualization representing an overview of the treated data and colored according to the validity of the values:

  • a scatter plot representing the values in the treated series.

  • a histogram representing the distribution of values.

  • a kernel density estimate plot visualizing the distribution of values.

  • a boxplot showing the distribution of values.

Parameters:
  • color (palette name (Default: "green")) – Color associated to legitimate values. Should be something that can be interpreted by seaborn’s color_palette()

  • errors_color (palette name (Default: "red")) – Color associated to erroneous values. Should be something that can be interpreted by seaborn’s color_palette()

  • compact (Bool (Default: False)) – If True, compact the plots around valid values and show the number of erroneous values on the scatter plot

  • limits (Bool (Default: True)) – If True, draw horizontal lines showing the lower and upper values delimiting the allowed values

  • figsize ((float, float) (Default: None)) – width and height of the figure.

Returns:

axs – an array of length 4 containing the matplotlib axes representing the plots

Return type:

array of matplotlib.axes._subplots.AxesSubplot

Examples

>>> series = pd.Series([-5, 1, 2 , 3, 8, 12])
>>> detector = series.cleaner.detect.bounded(lower=0, upper=10)
>>> detector.plot()
../_static/plot_numseries.png
report()#

prints a detection report

valid()#

Series or DataFrame containing only the valid values