pattern#
- class pdcleaner.detection.strings.pattern(obj, detector=None, pattern='', mode='match', case=True, flags=0)[source]#
Bases:
_ObjectTypeSeriesDetectorDetect strings that do not match a given pattern.
This detection method flags values as potential errors wherever the corresponding Series element does not match a given character sequence or regular expression.
Matching methods ‘match’, ‘fullmatch’ or ‘contains’ (similar to python’s re.search) can be used.
- Parameters:
pattern (string) – Character sequence or regular expression.
mode (string (Default = 'match')) –
test wether:
’match’: there is a match that begins at the first character of the string
’fullmatch’: the entire string matches the regular expression
’contains’: there is a match at any position within the string
case (bool (Default = True)) – If True, the search is case sensitive.
flags (int (Default = 0 = no flags)) – Regex module flags, e.g. re.IGNORECASE.
- Raises:
ValueError – when pattern is empty when mode is neither ‘match’, ‘fullmatch’ nor ‘contains’
Note
Missing values (NaN) are not treated as errors
Examples
Strings are to be not lower cases letters only
>>> series = pd.Series(['Cat','cat','dog','bird','14',np.nan,""]) >>> detector = series.cleaner.detect.pattern(pattern=r"[a-z]*", mode='fullmatch') >>> print(detector.detected()) 0 Cat 4 14 dtype: object
Strings must contain a “d”
>>> series = pd.Series(['Cat','cat','dog','bird','14',np.nan,""]) >>> detector = series.cleaner.detect.values(pattern=r"d", mode='contains') >>> print(detector.detected()) 0 Cat 1 cat 4 14 6 dtype: object
Strings should be ‘cat’ or ‘dog’ whenever the case
>>> series = pd.Series(['Cat','cat','dog','bird','14',np.nan,""]) >>> detector = series.cleaner.detect.values(pattern=r"cat|dog", mode='match', case=False) >>> print(detector.detected()) 3 bird 4 14 6 dtype: object
One can also use a compiled regex. In this case, the arguments case and flag are ignored
>>> series = pd.Series(['Cat','cat','dog','bird','14',np.nan,""]) >>> import re >>> regex = re.compile(r"[a-z]*") >>> detector = series.cleaner.detect.pattern(pattern=regex, mode='fullmatch', case=True) ... UserWarning: case and flag are ignored with a compiled regex >>> print(detector.detected()) 0 Cat 4 14 dtype: object
Attributes Summary
Case sensitivity
Usage of Regex module flags
Indices of the rows detected as errors
'match', 'fullmatch' or 'contains'
Number of rows detected as errors
The object (Series or DataFrame) containing the data to which the detection is applied
Character sequence or regular expression used to detect errors
Methods Summary
detected()Series or DataFrame containing only the detected errors
Returns True if any error has been detected, False otherwise
is_error()Return a boolean same-sized object indicating if the values are flagged as errors
Return a boolean same-sized object indicating if the values are NOT flagged as errors
report()prints a detection report
valid()Series or DataFrame containing only the valid values
Attributes Documentation
- case#
Case sensitivity
- flags#
Usage of Regex module flags
- index#
Indices of the rows detected as errors
- mode#
‘match’, ‘fullmatch’ or ‘contains’
- n_errors#
Number of rows detected as errors
- name = 'pattern'#
- obj#
The object (Series or DataFrame) containing the data to which the detection is applied
- pattern#
Character sequence or regular expression used to detect errors
Methods Documentation
- detected()#
Series or DataFrame containing only the detected errors
- has_errors() bool#
Returns True if any error has been detected, False otherwise
- is_error() Series#
Return a boolean same-sized object indicating if the values are flagged as errors
- not_error() Series#
Return a boolean same-sized object indicating if the values are NOT flagged as errors
- report()#
prints a detection report
- valid()#
Series or DataFrame containing only the valid values