url#

class pdcleaner.detection.web.url(obj, detector=None, check_protocol=True)[source]#

Bases: pattern

Detect strings that do not match a url.

Intended to be used by the detect method with the keyword ‘url’

>>> series.cleaner.detect.url(...)
>>> series.cleaner.detect('url',...)

This detection method flags values as potential errors wherever the corresponding Series element does not match a url. URLs can be a regular internet address, or an IP, or localhost

Parameters:

check_protocol (bool (Default = True)) – If True, the ‘http/https’ is mandatory in a regular url.

Note

Missing values (NaN) are not treated as errors

Examples

>>> series = pd.Series([
    'google.com','https://www.google.com/', 'https://127.0.0.1:80', 'dummy'])
>>> detector = series.cleaner.detect.url()
>>> print(detector.detected())
0    google.com
3         dummy
dtype: object

If protocol is not mandatory

>>> series = pd.Series(['google.com','https://www.google.com/'])
>>> detector = series.cleaner.detect('url', check_protocol=False)
>>> print(detector.is_error())
0   False
1   False
dtype: bool

Attributes Summary

case

Case sensitivity

check_protocol

If True, checks if the http or https protocol is present.

flags

Usage of Regex module flags

index

Indices of the rows detected as errors

mode

'match', 'fullmatch' or 'contains'

n_errors

Number of rows detected as errors

name

obj

The object (Series or DataFrame) containing the data to which the detection is applied

pattern

Character sequence or regular expression used to detect errors

Methods Summary

detected()

Series or DataFrame containing only the detected errors

has_errors()

Returns True if any error has been detected, False otherwise

is_error()

Return a boolean same-sized object indicating if the values are flagged as errors

not_error()

Return a boolean same-sized object indicating if the values are NOT flagged as errors

report()

prints a detection report

valid()

Series or DataFrame containing only the valid values

Attributes Documentation

case#

Case sensitivity

check_protocol#

If True, checks if the http or https protocol is present. Otherwise, the protocol is optional

flags#

Usage of Regex module flags

index#

Indices of the rows detected as errors

mode#

‘match’, ‘fullmatch’ or ‘contains’

n_errors#

Number of rows detected as errors

name = 'url'#
obj#

The object (Series or DataFrame) containing the data to which the detection is applied

pattern#

Character sequence or regular expression used to detect errors

Methods Documentation

detected()#

Series or DataFrame containing only the detected errors

has_errors() bool#

Returns True if any error has been detected, False otherwise

is_error() Series#

Return a boolean same-sized object indicating if the values are flagged as errors

not_error() Series#

Return a boolean same-sized object indicating if the values are NOT flagged as errors

report()#

prints a detection report

valid()#

Series or DataFrame containing only the valid values