duplicated#

class pdcleaner.detection.basic.duplicated(obj, detector=None, subset=None, keep='first')[source]#

Bases: _Detector

Detect duplicated elements

Intended to be used by the detect method with the keyword ‘duplicated’. Can be used with series or dataframe

>>> df.cleaner.detect.duplicated(...)
>>> df.cleaner.detect('duplicated',...)
Parameters:
  • subset (list of string, optional) – Column to be used for identifying duplicates

  • keep (string or bool, default = 'first') –

    • ‘first’ : detected as error duplicated elements except for the first occurence.

    • ’last’ : detected as error duplicated elements except for the last occurence.

    • False: dectected as error all duplicated elements.

Raises:
  • NameError – When unknown value is given to keep parameter.

  • KeyError – When inexistant column name is given in subset.

Examples

>>> import pandas as pd
>>> import pdcleaner
>>> df = pd.DataFrame({'col1' : ['Alice', 'Bob', 'Alice', 'Bob', 'Alice'],
                       'col2' : [15, 13, 15, 10, 13] })
>>> detector = df.cleaner.detect.duplicated(subset=['col1', 'col2'], keep='first')
>>> print(detector.is_error())
0    False
1    False
2     True
3    False
4    False
dtype: bool
>>> detector = df.cleaner.detect.duplicated(subset=['col1'], keep='last')
>>> print(detector.is_error())
0     True
1     True
2     True
3    False
4    False
dtype: bool

Attributes Summary

index

Indices of the rows detected as errors

keep

Which occurrence to consider as non duplicated

n_errors

Number of rows detected as errors

name

obj

The object (Series or DataFrame) containing the data to which the detection is applied

subset

List of subset column

Methods Summary

detected()

Series or DataFrame containing only the detected errors

has_errors()

Returns True if any error has been detected, False otherwise

is_error()

Return a boolean same-sized object indicating if the values are flagged as errors

not_error()

Return a boolean same-sized object indicating if the values are NOT flagged as errors

report()

prints a detection report

valid()

Series or DataFrame containing only the valid values

Attributes Documentation

index#

Indices of the rows detected as errors

keep#

Which occurrence to consider as non duplicated

n_errors#

Number of rows detected as errors

name = 'duplicated'#
obj#

The object (Series or DataFrame) containing the data to which the detection is applied

subset#

List of subset column

Methods Documentation

detected()#

Series or DataFrame containing only the detected errors

has_errors() bool#

Returns True if any error has been detected, False otherwise

is_error() Series#

Return a boolean same-sized object indicating if the values are flagged as errors

not_error() Series#

Return a boolean same-sized object indicating if the values are NOT flagged as errors

report()#

prints a detection report

valid()#

Series or DataFrame containing only the valid values