modzscore#
- class pdcleaner.detection.gaussian.modzscore(obj, detector=None, threshold=3.5, inclusive='both', sided='both', normaltest='ignore', pvalue=0.001, transform=None)[source]#
Bases:
_GaussianSeriesDetectorDetect outliers as potential errors in a Series using the modified Z-score.
Intended to be used by the detect method with the keyword ‘modzscore’
>>> series.cleaner.detect.modzscore(...) >>> series.cleaner.detect('modzscore',...)
This detection method flag values as errors wherever the corresponding Series element has a modified Z-score above a given threshold.
The modified Z-scores is used to quantify the unusualness of an observation when data follow a normal distribution. It is defined as:
modified Z score = 0.6745 * (value - median) / (median absolute deviation)
The further away an observation’s modified Z-score is from zero, the more unusual it is.
A modified Z-score is more robust than a Z-score because it uses the median as opposed to the mean, which is known to be influenced by outliers.
The standard cut-off values (threshold) for finding outliers are modified Z-scores of +/- 3.5 (default here).
Note
NA values are not treated as errors.
Warning
A normality test is performed [see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html ]
If the distribution does not follow a gaussian/normal distribution:
This is ignored if normaltest=’ignore’ (default)
A warning is raised if normaltest = ‘warning’
An exception is raised if normaltest = ‘error’
If the series length is no more than 8, it is considered as not normal
Tip
The series can be “normalized” before applying the detector, using a power-series transformation:
Box-cox with a shift to deal with positive values [see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html]
Yeo-Johnson [see https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.yeojohnson.html]
Using the scipy.stats implementations, the optimal parameter lambda is calculated and used for the transformations and its inverse functions.
When a transformation is applied, some parameters are expressed for the transformed series and informations are made available via report().
- Parameters:
threshold (float, default 3.5) –
inclusive ({“both”, “neither”, “left”, “right”}, default "both") – Include boundaries. Whether to set each bound as closed or open.
sided ({“both”, “left”, “right”}, default "both") – Specifies which limits should be applied. If “left”, only apply lower limit If “right”, only apply upper limit If “both”, apply both upper and lower limits
normaltest ({'ignore', 'warn', 'error'} default: 'ignore') – wether to ignore, raise a warning or raise en exception if the normality test fails
pvalue (float, default 1e-3) – pvalue associated with the normality test
- Raises:
TypeError – when threshold is not a number
TypeError – when pvalue is not a number
ValueError – when threshold is negative
ValueError – when pvalue is negative
ValueError – if sided or inclusive has an unvalid value
ValueError – if normaltest is not ‘ignore’, ‘warn’ or ‘error’
UserWarning – if the series is not normal and normaltest = ‘warn’
Exception – if the series is not normal and normaltest = ‘error’
Examples
>>> s = pd.Series([0, 0, 0, 0, -1, 1, -1, 1, -6, 6]) >>> modzscore_detector = s.cleaner.detect.modzscore() >>> modzscore_detector.n_errors 2
>>> modzscore_detector.lower, modzscore_detector.upper (-5.405405405405405, 5.405405405405405)
>>> s_test = pd.Series([1, 100]) >>> s_test.cleaner.detect(modzscore_detector).is_error() 0 False 1 True dtype: bool
Using a transformation
>>> s = pd.Series([0, 0, 0, 0, -100, 1, -1, 1, -6, 6]) >>> modzscore_detector = s.cleaner.detect.modzscore(transform='boxcox') >>> modzscore_detector.report() Detection report ============================================================================== Method: modzscore Nb samples: 10 Date: March 23,2022 Nb errors: 3 Time: 10:19:07 Nb rows with NaN: 0 ------------------------------------------------------------------------------ lower -5.571754332925835 upper 5.256895307993972 inclusive both sided both ------------------------------------------------------------------------------ modzscore parameters after boxcox transformation median 7213.695043599123 mad 148.85820508550978 threshold 3.5 transform boxcox lmbda 2.0840437865755472 ------------------------------------------------------------------------------ Series distribution is not normal/gaussian (A boxcox transformation has been applied) ==============================================================================
If the series is tested as normal, the transformation is not useful hence not applied
>>> s = pd.Series([0, 0, 0, 0, -1, 1, -1, 1, -6, 6]) >>> modzscore_detector = s.cleaner.detect.modzscore(transform='yeojohnson') >>> modzscore_detector.report() Detection report ============================================================================== Method: modzscore Nb samples: 10 Date: March 23,2022 Nb errors: 2 Time: 10:16:45 Nb rows with NaN: 0 ------------------------------------------------------------------------------ lower -5.405405405405405 upper 5.405405405405405 inclusive both sided both ------------------------------------------------------------------------------ modzscore parameters median 0.0 mad 1.0 threshold 3.5 ------------------------------------------------------------------------------ Series distribution has been tested as normal with p=0.001(The transformation has not been applied) ==============================================================================
Attributes Summary
Keyword to indicate if boundaries are included {“both”, “neither”, “left”, “right”}
Indices of the rows detected as errors
is the series normal/gaussian according to the test ?
The lambda that maximizes the log-likelihood function of the transformation
Lower bound
Median absolute deviation of the distribution used to calculate modified Z-scores
Median value used to calculate modified Z-scores
Number of rows detected as errors
Normality test result behavior
The object (Series or DataFrame) containing the data to which the detection is applied
pvalue for normality test
Keyword to indicate if detection is one side or both {"both", "right", "left"}
Threshold value used to detect outliers
Distribution transformation
Upper bound
Methods Summary
detected()Series or DataFrame containing only the detected errors
Returns True if any error has been detected, False otherwise
is_error()Return a boolean same-sized object indicating if the values are flagged as errors
Return a boolean same-sized object indicating if the values are NOT flagged as errors
plot([color, errors_color, compact, limits, ...])plot a visualization representing an overview of the treated data and colored according to the validity of the values:
report()prints a detection report
valid()Series or DataFrame containing only the valid values
Attributes Documentation
- inclusive#
Keyword to indicate if boundaries are included {“both”, “neither”, “left”, “right”}
- index#
Indices of the rows detected as errors
- isnormal#
is the series normal/gaussian according to the test ?
- lmbda#
The lambda that maximizes the log-likelihood function of the transformation
- lower#
Lower bound
- mad#
Median absolute deviation of the distribution used to calculate modified Z-scores
- median#
Median value used to calculate modified Z-scores
- n_errors#
Number of rows detected as errors
- name = 'modzscore'#
- normaltest#
Normality test result behavior
- obj#
The object (Series or DataFrame) containing the data to which the detection is applied
- pvalue#
pvalue for normality test
- sided#
Keyword to indicate if detection is one side or both {“both”, “right”, “left”}
- threshold#
Threshold value used to detect outliers
- transform#
Distribution transformation
- upper#
Upper bound
Methods Documentation
- detected()#
Series or DataFrame containing only the detected errors
- has_errors() bool#
Returns True if any error has been detected, False otherwise
- is_error() Series#
Return a boolean same-sized object indicating if the values are flagged as errors
- not_error() Series#
Return a boolean same-sized object indicating if the values are NOT flagged as errors
- plot(color='green', errors_color='red', compact=False, limits=True, figsize=None)#
plot a visualization representing an overview of the treated data and colored according to the validity of the values:
a scatter plot representing the values in the treated series.
a histogram representing the distribution of values.
a kernel density estimate plot visualizing the distribution of values.
a boxplot showing the distribution of values.
- Parameters:
color (palette name (Default: "green")) – Color associated to legitimate values. Should be something that can be interpreted by seaborn’s color_palette()
errors_color (palette name (Default: "red")) – Color associated to erroneous values. Should be something that can be interpreted by seaborn’s color_palette()
compact (Bool (Default: False)) – If True, compact the plots around valid values and show the number of erroneous values on the scatter plot
limits (Bool (Default: True)) – If True, draw horizontal lines showing the lower and upper values delimiting the allowed values
figsize ((float, float) (Default: None)) – width and height of the figure.
- Returns:
axs – an array of length 4 containing the matplotlib axes representing the plots
- Return type:
array of matplotlib.axes._subplots.AxesSubplot
Examples
>>> series = pd.Series([-5, 1, 2 , 3, 8, 12]) >>> detector = series.cleaner.detect.bounded(lower=0, upper=10) >>> detector.plot()
- report()#
prints a detection report
- valid()#
Series or DataFrame containing only the valid values