duplicated#

class pdcleaner.detection.basic.duplicated(obj, detector=None, subset=None, keep='first')[source]#

Bases: _Detector

Detect duplicated elements

Intended to be used by the detect method with the keyword ‘duplicated’. Can be used with series or dataframe

>>> df.cleaner.detect.duplicated(...)
>>> df.cleaner.detect('duplicated',...)

Parameters:

subset (list of string, optional) – Column to be used for identifying duplicates
keep (string or bool, default = 'first') –
- ‘first’ : detected as error duplicated elements except for the first occurence.
- ’last’ : detected as error duplicated elements except for the last occurence.
- False: dectected as error all duplicated elements.

Raises:

NameError – When unknown value is given to keep parameter.
KeyError – When inexistant column name is given in subset.

Examples

>>> import pandas as pd
>>> import pdcleaner

>>> df = pd.DataFrame({'col1' : ['Alice', 'Bob', 'Alice', 'Bob', 'Alice'],
                       'col2' : [15, 13, 15, 10, 13] })
>>> detector = df.cleaner.detect.duplicated(subset=['col1', 'col2'], keep='first')
>>> print(detector.is_error())
0    False
1    False
2     True
3    False
4    False
dtype: bool

>>> detector = df.cleaner.detect.duplicated(subset=['col1'], keep='last')
>>> print(detector.is_error())
0     True
1     True
2     True
3    False
4    False
dtype: bool

Attributes Summary

`index`	Indices of the rows detected as errors
`keep`	Which occurrence to consider as non duplicated
`n_errors`	Number of rows detected as errors
`name`
`obj`	The object (Series or DataFrame) containing the data to which the detection is applied
`subset`	List of subset column

Methods Summary

`detected`()	Series or DataFrame containing only the detected errors
`has_errors`()	Returns True if any error has been detected, False otherwise
`is_error`()	Return a boolean same-sized object indicating if the values are flagged as errors
`not_error`()	Return a boolean same-sized object indicating if the values are NOT flagged as errors
`report`()	prints a detection report
`valid`()	Series or DataFrame containing only the valid values

Attributes Documentation

index#: Indices of the rows detected as errors

keep#: Which occurrence to consider as non duplicated

n_errors#: Number of rows detected as errors

name = 'duplicated'#

obj#: The object (Series or DataFrame) containing the data to which the detection is applied

subset#: List of subset column

Methods Documentation

detected()#: Series or DataFrame containing only the detected errors

has_errors() → bool#: Returns True if any error has been detected, False otherwise

is_error() → Series#: Return a boolean same-sized object indicating if the values are flagged as errors

not_error() → Series#: Return a boolean same-sized object indicating if the values are NOT flagged as errors

report()#: prints a detection report

valid()#: Series or DataFrame containing only the valid values