10 minutes to pandas-cleaner
============================

This is a short overview, for new users, of what you can do with pandas cleaner.
Much more detailed information can be found in the rest of the
documentation.

Once installed, pandas-cleaner is normally imported as follows:

.. code:: ipython3

    import pandas as pd
    import pdcleaner

The objective of the package is to extend pandas capabilities by
introducing methods to detect, analyze and clean potential errors in a dataset.

Let us see that in practice with a very simple series of numbers.

.. code:: ipython3

    series = pd.Series([0, 1, 0.8, 1e3, 1.3, -1])
    series


.. parsed-literal::

    0       0.0
    1       1.0
    2       0.8
    3    1000.0
    4       1.3
    5      -1.0
    dtype: float64


The value 1000 is obviously an outlier ! The negative value may be an
error or not depending the case.

With such simple and short series, it is very easy to spot potentially
problematic values. But imagine you have to deal with 1 millions rows,
this can not be done manually. Pandas clear offers functionnalities to do so
automatically.

Detection
---------

Let us say we know that every value must be positive and the maximum
possible value is 2. In order to detect errors, we can create a
so-called detector, based on our series and using the detection method
called ``bounded`` as follows:

.. code:: ipython3

    detector = series.cleaner.detect('bounded', lower=0., upper=2.)

.. note::

    In pandas cleaner, all detection method are called with the same API
    ``.cleaner.detect()`` applied to pandas series or dataframes.


An equivalent way of creating the detector is as follows, with the same keyword but as a method.

.. code:: ipython3

    detector = series.cleaner.detect.bounded(lower=0., upper=2.)

The second syntax allows to access the documentation and examples
related to the ``bounded`` method (with ``Shift+Tab`` in a notebook for
example).

Details on the arguments and examples are accessible via `help(series.cleaner.detect.bounded)` or in the API reference section of this doc.

.. note::
    Pandas cleaner offers a set of detection methods for different kind of
    quantitative and/or qualitative data:

    -  numerical series

    -  categorical series

    -  numerical series attached to different categories

    -  set of categories/sub-categories

    -  multivariate numerical dataframes

    A comprehensive list, along with the associated keywords, can be found
    in the API reference section of this documentation. 
    
    Examples of use for     the different types of data are given the next section of this user guide.

Analysis
--------

Once the detector created, many information can be retrevied about the
detected errors.

-  Are the any errors ?

.. code:: ipython3

    detector.has_errors()


.. parsed-literal::

    True


-  How many ?

.. code:: ipython3

    detector.n_errors


.. parsed-literal::

    2


-  Which are they ?

.. code:: ipython3

    detector.detected()


.. parsed-literal::

    3    1000.0
    5      -1.0
    dtype: float64


-  Each row can easily be tagged as an error or not:

.. code:: ipython3

    detector.is_error()


.. parsed-literal::

    0    False
    1    False
    2    False
    3     True
    4    False
    5     True
    dtype: bool

- One can plot valid values, limits and number of errors detected above and below the limits

.. code:: ipython3

    detector.plot(compact=True)

.. image:: detector_plot_10min.png

.. note::
    Pandas-cleaner provides plotting utilities to visualize:

    -  Outliers in numerical series,

    -  Inconsistencies in categories/sub-categories associations,

    -  Multiple formatting for the classes in categorical series.

    Check the rest of the documentation, the user guide and the API
    reference for examples and details.

Cleaning
--------

Say we want to get rid of the detected errors to work with a clean dataset
(as an input of a machine learning model or for a BI dashboard…).

There are more than one way to clean the dataset, depending of the
objective. With pandas-cleaner, different methods can be used with the
same interface function ``cleaner.clean()``. Let us see two examples.

One can simply get rid of the lines with errors. To do so, the ``drop``
method is appropriate and  usable as follows :

.. code:: ipython3

    clean_series = series.cleaner.clean('drop', detector)
    clean_series


.. parsed-literal::

    0    0.0
    1    1.0
    2    0.8
    4    1.3
    dtype: float64


or, alternatively, just as for ``detect`` :

.. code:: ipython3

    clean_series = series.cleaner.clean.drop(detector)
    clean_series


.. parsed-literal::

    0    0.0
    1    1.0
    2    0.8
    4    1.3
    dtype: float64


The documentation is accessible via ``help(series.cleaner.clean.drop)``
or in the API reference section of this doc.

Consider now that you want to “clip” the errors, meaning all negative
values should be set to zero and all above above the maximum capped at
the maximum value. This is simply done with the clip method, that uses
lower and upper values that are attributes of the detector.

.. code:: ipython3

    print(f"min : {detector.lower}, max: {detector.upper}")


.. parsed-literal::

    min : 0.0, max: 2.0
    

.. code:: ipython3

    series.cleaner.clean('clip', detector)


.. parsed-literal::

    0    0.0
    1    1.0
    2    0.8
    3    2.0
    4    1.3
    5    0.0
    dtype: float64


.. note::
    The cleaning methods have an ``inplace`` option to overwrite the original data.

.. code:: ipython3

    series.cleaner.clean('clip', detector, inplace='True')
    series


.. parsed-literal::

    0    0.0
    1    1.0
    2    0.8
    3    2.0
    4    1.3
    5    0.0
    dtype: float64


.. note::

    Other cleaning methods are available:

    * ``replace`` to replace problematic cells by a value, or using a dict or a callable function
  
    * ``to_na`` to empty problematic cells and then use any missing-value imputation method
  
    * some other methods are specific to categorical detection methods, such as ``bykeys`` that is used along with ``keycollision`` detectors to identify typos or alternative formulations, e.g when `Linus Torvald` and `torvald, linus` should be same.

    See details along with examples in the user guide or the API reference.

Reapply to fresh data
---------------------

Say you have determined the min and max valid values on a set of 1M
rows. If the dataset is updated, you may not want/need to recalculate
the detector parameters, but simply apply them to clean a samaller set of
new rows.

This can for example be the case if the detector is part of an ETL
cleaning pipeline followed by the fitting of a machine learning model.
In this case, the detector has to be “fitted” on the train set, and then
applied as a tranformation step on the test set or, later, to new data during the
inference/prediction step.

To do so, pandas cleaner ``detect`` API method can be called with an
already "fitted" detector. For example, if we want to apply our bounded
detector to a new series:

.. code:: ipython3

    series2 = pd.Series([-1, 1, 235])

this can be done as follows:

.. code:: ipython3

    detector2 = series2.cleaner.detect(detector)

This detects the following problematic lines :

.. code:: ipython3

    detector2.detected()


.. parsed-literal::

    0     -1
    2    235
    dtype: int64


Going further
-------------

The following sections in this user guide give more detailed examples to use pandas-cleaner with different kind of data.