10 minutes to pandas-cleaner#

This is a short overview, for new users, of what you can do with pandas cleaner. Much more detailed information can be found in the rest of the documentation.

Once installed, pandas-cleaner is normally imported as follows:

import pandas as pd
import pdcleaner

The objective of the package is to extend pandas capabilities by introducing methods to detect, analyze and clean potential errors in a dataset.

Let us see that in practice with a very simple series of numbers.

series = pd.Series([0, 1, 0.8, 1e3, 1.3, -1])
series
0       0.0
1       1.0
2       0.8
3    1000.0
4       1.3
5      -1.0
dtype: float64

The value 1000 is obviously an outlier ! The negative value may be an error or not depending the case.

With such simple and short series, it is very easy to spot potentially problematic values. But imagine you have to deal with 1 millions rows, this can not be done manually. Pandas clear offers functionnalities to do so automatically.

Detection#

Let us say we know that every value must be positive and the maximum possible value is 2. In order to detect errors, we can create a so-called detector, based on our series and using the detection method called bounded as follows:

detector = series.cleaner.detect('bounded', lower=0., upper=2.)

Note

In pandas cleaner, all detection method are called with the same API .cleaner.detect() applied to pandas series or dataframes.

An equivalent way of creating the detector is as follows, with the same keyword but as a method.

detector = series.cleaner.detect.bounded(lower=0., upper=2.)

The second syntax allows to access the documentation and examples related to the bounded method (with Shift+Tab in a notebook for example).

Details on the arguments and examples are accessible via help(series.cleaner.detect.bounded) or in the API reference section of this doc.

Note

Pandas cleaner offers a set of detection methods for different kind of quantitative and/or qualitative data:

  • numerical series

  • categorical series

  • numerical series attached to different categories

  • set of categories/sub-categories

  • multivariate numerical dataframes

A comprehensive list, along with the associated keywords, can be found in the API reference section of this documentation.

Examples of use for the different types of data are given the next section of this user guide.

Analysis#

Once the detector created, many information can be retrevied about the detected errors.

  • Are the any errors ?

detector.has_errors()
True
  • How many ?

detector.n_errors
2
  • Which are they ?

detector.detected()
3    1000.0
5      -1.0
dtype: float64
  • Each row can easily be tagged as an error or not:

detector.is_error()
0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool
  • One can plot valid values, limits and number of errors detected above and below the limits

detector.plot(compact=True)
../_images/detector_plot_10min.png

Note

Pandas-cleaner provides plotting utilities to visualize:

  • Outliers in numerical series,

  • Inconsistencies in categories/sub-categories associations,

  • Multiple formatting for the classes in categorical series.

Check the rest of the documentation, the user guide and the API reference for examples and details.

Cleaning#

Say we want to get rid of the detected errors to work with a clean dataset (as an input of a machine learning model or for a BI dashboard…).

There are more than one way to clean the dataset, depending of the objective. With pandas-cleaner, different methods can be used with the same interface function cleaner.clean(). Let us see two examples.

One can simply get rid of the lines with errors. To do so, the drop method is appropriate and usable as follows :

clean_series = series.cleaner.clean('drop', detector)
clean_series
0    0.0
1    1.0
2    0.8
4    1.3
dtype: float64

or, alternatively, just as for detect :

clean_series = series.cleaner.clean.drop(detector)
clean_series
0    0.0
1    1.0
2    0.8
4    1.3
dtype: float64

The documentation is accessible via help(series.cleaner.clean.drop) or in the API reference section of this doc.

Consider now that you want to “clip” the errors, meaning all negative values should be set to zero and all above above the maximum capped at the maximum value. This is simply done with the clip method, that uses lower and upper values that are attributes of the detector.

print(f"min : {detector.lower}, max: {detector.upper}")
min : 0.0, max: 2.0
series.cleaner.clean('clip', detector)
0    0.0
1    1.0
2    0.8
3    2.0
4    1.3
5    0.0
dtype: float64

Note

The cleaning methods have an inplace option to overwrite the original data.

series.cleaner.clean('clip', detector, inplace='True')
series
0    0.0
1    1.0
2    0.8
3    2.0
4    1.3
5    0.0
dtype: float64

Note

Other cleaning methods are available:

  • replace to replace problematic cells by a value, or using a dict or a callable function

  • to_na to empty problematic cells and then use any missing-value imputation method

  • some other methods are specific to categorical detection methods, such as bykeys that is used along with keycollision detectors to identify typos or alternative formulations, e.g when Linus Torvald and torvald, linus should be same.

See details along with examples in the user guide or the API reference.

Reapply to fresh data#

Say you have determined the min and max valid values on a set of 1M rows. If the dataset is updated, you may not want/need to recalculate the detector parameters, but simply apply them to clean a samaller set of new rows.

This can for example be the case if the detector is part of an ETL cleaning pipeline followed by the fitting of a machine learning model. In this case, the detector has to be “fitted” on the train set, and then applied as a tranformation step on the test set or, later, to new data during the inference/prediction step.

To do so, pandas cleaner detect API method can be called with an already “fitted” detector. For example, if we want to apply our bounded detector to a new series:

series2 = pd.Series([-1, 1, 235])

this can be done as follows:

detector2 = series2.cleaner.detect(detector)

This detects the following problematic lines :

detector2.detected()
0     -1
2    235
dtype: int64

Going further#

The following sections in this user guide give more detailed examples to use pandas-cleaner with different kind of data.