10 minutes to pandas-cleaner#
This is a short overview, for new users, of what you can do with pandas cleaner. Much more detailed information can be found in the rest of the documentation.
Once installed, pandas-cleaner is normally imported as follows:
import pandas as pd
import pdcleaner
The objective of the package is to extend pandas capabilities by introducing methods to detect, analyze and clean potential errors in a dataset.
Let us see that in practice with a very simple series of numbers.
series = pd.Series([0, 1, 0.8, 1e3, 1.3, -1])
series
0 0.0
1 1.0
2 0.8
3 1000.0
4 1.3
5 -1.0
dtype: float64
The value 1000 is obviously an outlier ! The negative value may be an error or not depending the case.
With such simple and short series, it is very easy to spot potentially problematic values. But imagine you have to deal with 1 millions rows, this can not be done manually. Pandas clear offers functionnalities to do so automatically.
Detection#
Let us say we know that every value must be positive and the maximum
possible value is 2. In order to detect errors, we can create a
so-called detector, based on our series and using the detection method
called bounded as follows:
detector = series.cleaner.detect('bounded', lower=0., upper=2.)
Note
In pandas cleaner, all detection method are called with the same API
.cleaner.detect() applied to pandas series or dataframes.
An equivalent way of creating the detector is as follows, with the same keyword but as a method.
detector = series.cleaner.detect.bounded(lower=0., upper=2.)
The second syntax allows to access the documentation and examples
related to the bounded method (with Shift+Tab in a notebook for
example).
Details on the arguments and examples are accessible via help(series.cleaner.detect.bounded) or in the API reference section of this doc.
Note
Pandas cleaner offers a set of detection methods for different kind of quantitative and/or qualitative data:
numerical series
categorical series
numerical series attached to different categories
set of categories/sub-categories
multivariate numerical dataframes
A comprehensive list, along with the associated keywords, can be found in the API reference section of this documentation.
Examples of use for the different types of data are given the next section of this user guide.
Analysis#
Once the detector created, many information can be retrevied about the detected errors.
Are the any errors ?
detector.has_errors()
True
How many ?
detector.n_errors
2
Which are they ?
detector.detected()
3 1000.0
5 -1.0
dtype: float64
Each row can easily be tagged as an error or not:
detector.is_error()
0 False
1 False
2 False
3 True
4 False
5 True
dtype: bool
One can plot valid values, limits and number of errors detected above and below the limits
detector.plot(compact=True)
Note
Pandas-cleaner provides plotting utilities to visualize:
Outliers in numerical series,
Inconsistencies in categories/sub-categories associations,
Multiple formatting for the classes in categorical series.
Check the rest of the documentation, the user guide and the API reference for examples and details.
Cleaning#
Say we want to get rid of the detected errors to work with a clean dataset (as an input of a machine learning model or for a BI dashboard…).
There are more than one way to clean the dataset, depending of the
objective. With pandas-cleaner, different methods can be used with the
same interface function cleaner.clean(). Let us see two examples.
One can simply get rid of the lines with errors. To do so, the drop
method is appropriate and usable as follows :
clean_series = series.cleaner.clean('drop', detector)
clean_series
0 0.0
1 1.0
2 0.8
4 1.3
dtype: float64
or, alternatively, just as for detect :
clean_series = series.cleaner.clean.drop(detector)
clean_series
0 0.0
1 1.0
2 0.8
4 1.3
dtype: float64
The documentation is accessible via help(series.cleaner.clean.drop)
or in the API reference section of this doc.
Consider now that you want to “clip” the errors, meaning all negative values should be set to zero and all above above the maximum capped at the maximum value. This is simply done with the clip method, that uses lower and upper values that are attributes of the detector.
print(f"min : {detector.lower}, max: {detector.upper}")
min : 0.0, max: 2.0
series.cleaner.clean('clip', detector)
0 0.0
1 1.0
2 0.8
3 2.0
4 1.3
5 0.0
dtype: float64
Note
The cleaning methods have an inplace option to overwrite the original data.
series.cleaner.clean('clip', detector, inplace='True')
series
0 0.0
1 1.0
2 0.8
3 2.0
4 1.3
5 0.0
dtype: float64
Note
Other cleaning methods are available:
replaceto replace problematic cells by a value, or using a dict or a callable functionto_nato empty problematic cells and then use any missing-value imputation methodsome other methods are specific to categorical detection methods, such as
bykeysthat is used along withkeycollisiondetectors to identify typos or alternative formulations, e.g when Linus Torvald and torvald, linus should be same.
See details along with examples in the user guide or the API reference.
Reapply to fresh data#
Say you have determined the min and max valid values on a set of 1M rows. If the dataset is updated, you may not want/need to recalculate the detector parameters, but simply apply them to clean a samaller set of new rows.
This can for example be the case if the detector is part of an ETL cleaning pipeline followed by the fitting of a machine learning model. In this case, the detector has to be “fitted” on the train set, and then applied as a tranformation step on the test set or, later, to new data during the inference/prediction step.
To do so, pandas cleaner detect API method can be called with an
already “fitted” detector. For example, if we want to apply our bounded
detector to a new series:
series2 = pd.Series([-1, 1, 235])
this can be done as follows:
detector2 = series2.cleaner.detect(detector)
This detects the following problematic lines :
detector2.detected()
0 -1
2 235
dtype: int64
Going further#
The following sections in this user guide give more detailed examples to use pandas-cleaner with different kind of data.