pattern#

class pdcleaner.detection.strings.pattern(obj, detector=None, pattern='', mode='match', case=True, flags=0)[source]#

Bases: _ObjectTypeSeriesDetector

Detect strings that do not match a given pattern.

This detection method flags values as potential errors wherever the corresponding Series element does not match a given character sequence or regular expression.

Matching methods ‘match’, ‘fullmatch’ or ‘contains’ (similar to python’s re.search) can be used.

Parameters:
  • pattern (string) – Character sequence or regular expression.

  • mode (string (Default = 'match')) –

    test wether:

    • ’match’: there is a match that begins at the first character of the string

    • ’fullmatch’: the entire string matches the regular expression

    • ’contains’: there is a match at any position within the string

  • case (bool (Default = True)) – If True, the search is case sensitive.

  • flags (int (Default = 0 = no flags)) – Regex module flags, e.g. re.IGNORECASE.

Raises:

ValueError – when pattern is empty when mode is neither ‘match’, ‘fullmatch’ nor ‘contains’

Note

Missing values (NaN) are not treated as errors

Examples

Strings are to be not lower cases letters only

>>> series = pd.Series(['Cat','cat','dog','bird','14',np.nan,""])
>>> detector = series.cleaner.detect.pattern(pattern=r"[a-z]*", mode='fullmatch')
>>> print(detector.detected())
0    Cat
4     14
dtype: object

Strings must contain a “d”

>>> series = pd.Series(['Cat','cat','dog','bird','14',np.nan,""])
>>> detector = series.cleaner.detect.values(pattern=r"d", mode='contains')
>>> print(detector.detected())
0    Cat
1    cat
4     14
6
dtype: object

Strings should be ‘cat’ or ‘dog’ whenever the case

>>> series = pd.Series(['Cat','cat','dog','bird','14',np.nan,""])
>>> detector = series.cleaner.detect.values(pattern=r"cat|dog", mode='match', case=False)
>>> print(detector.detected())
3    bird
4      14
6
dtype: object

One can also use a compiled regex. In this case, the arguments case and flag are ignored

>>> series = pd.Series(['Cat','cat','dog','bird','14',np.nan,""])
>>> import re
>>> regex = re.compile(r"[a-z]*")
>>> detector = series.cleaner.detect.pattern(pattern=regex, mode='fullmatch', case=True)
... UserWarning: case and flag are ignored with a compiled regex
>>> print(detector.detected())
0    Cat
4     14
dtype: object

Attributes Summary

case

Case sensitivity

flags

Usage of Regex module flags

index

Indices of the rows detected as errors

mode

'match', 'fullmatch' or 'contains'

n_errors

Number of rows detected as errors

name

obj

The object (Series or DataFrame) containing the data to which the detection is applied

pattern

Character sequence or regular expression used to detect errors

Methods Summary

detected()

Series or DataFrame containing only the detected errors

has_errors()

Returns True if any error has been detected, False otherwise

is_error()

Return a boolean same-sized object indicating if the values are flagged as errors

not_error()

Return a boolean same-sized object indicating if the values are NOT flagged as errors

report()

prints a detection report

valid()

Series or DataFrame containing only the valid values

Attributes Documentation

case#

Case sensitivity

flags#

Usage of Regex module flags

index#

Indices of the rows detected as errors

mode#

‘match’, ‘fullmatch’ or ‘contains’

n_errors#

Number of rows detected as errors

name = 'pattern'#
obj#

The object (Series or DataFrame) containing the data to which the detection is applied

pattern#

Character sequence or regular expression used to detect errors

Methods Documentation

detected()#

Series or DataFrame containing only the detected errors

has_errors() bool#

Returns True if any error has been detected, False otherwise

is_error() Series#

Return a boolean same-sized object indicating if the values are flagged as errors

not_error() Series#

Return a boolean same-sized object indicating if the values are NOT flagged as errors

report()#

prints a detection report

valid()#

Series or DataFrame containing only the valid values