duplicated#
- class pdcleaner.detection.basic.duplicated(obj, detector=None, subset=None, keep='first')[source]#
Bases:
_DetectorDetect duplicated elements
Intended to be used by the detect method with the keyword ‘duplicated’. Can be used with series or dataframe
>>> df.cleaner.detect.duplicated(...) >>> df.cleaner.detect('duplicated',...)
- Parameters:
subset (list of string, optional) – Column to be used for identifying duplicates
keep (string or bool, default = 'first') –
‘first’ : detected as error duplicated elements except for the first occurence.
’last’ : detected as error duplicated elements except for the last occurence.
False: dectected as error all duplicated elements.
- Raises:
NameError – When unknown value is given to keep parameter.
KeyError – When inexistant column name is given in subset.
Examples
>>> import pandas as pd >>> import pdcleaner
>>> df = pd.DataFrame({'col1' : ['Alice', 'Bob', 'Alice', 'Bob', 'Alice'], 'col2' : [15, 13, 15, 10, 13] }) >>> detector = df.cleaner.detect.duplicated(subset=['col1', 'col2'], keep='first') >>> print(detector.is_error()) 0 False 1 False 2 True 3 False 4 False dtype: bool
>>> detector = df.cleaner.detect.duplicated(subset=['col1'], keep='last') >>> print(detector.is_error()) 0 True 1 True 2 True 3 False 4 False dtype: bool
Attributes Summary
Indices of the rows detected as errors
Which occurrence to consider as non duplicated
Number of rows detected as errors
The object (Series or DataFrame) containing the data to which the detection is applied
List of subset column
Methods Summary
detected()Series or DataFrame containing only the detected errors
Returns True if any error has been detected, False otherwise
is_error()Return a boolean same-sized object indicating if the values are flagged as errors
Return a boolean same-sized object indicating if the values are NOT flagged as errors
report()prints a detection report
valid()Series or DataFrame containing only the valid values
Attributes Documentation
- index#
Indices of the rows detected as errors
- keep#
Which occurrence to consider as non duplicated
- n_errors#
Number of rows detected as errors
- name = 'duplicated'#
- obj#
The object (Series or DataFrame) containing the data to which the detection is applied
- subset#
List of subset column
Methods Documentation
- detected()#
Series or DataFrame containing only the detected errors
- has_errors() bool#
Returns True if any error has been detected, False otherwise
- is_error() Series#
Return a boolean same-sized object indicating if the values are flagged as errors
- not_error() Series#
Return a boolean same-sized object indicating if the values are NOT flagged as errors
- report()#
prints a detection report
- valid()#
Series or DataFrame containing only the valid values