8.2. Text Parsers¶
8.2.1. Text¶
-
class
easydata.parsers.text.Text(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_key: Optional[Union[str, tuple]] = None, split_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, separator: str = ' ', index: Optional[int] = None, strip: bool = False, **kwargs)[source]¶
Text is a parser that normalizes and manipulates simple
texts like titles or similar.
Getting Started¶
Lets import first easydata module.
>>> import easydata as ed
Text supports query selectors for fetching data.
>>> test_dict = {'title': 'Easybook pro 13'}
>>> ed.Text(ed.key('title')).parse(test_dict)
'Easybook pro 13'
In this example lets process text with bad encoding and multiple spaces between chars.
>>> test_text = 'Easybook Pro 13 <3 ünicode'
>>> ed.Text().parse(test_text)
Easybook Pro 13 <3 ünicode
Floats, integers will get transformed to string automatically.
>>> test_int = 123
>>> ed.Text().parse(test_int)
'123'
>>> test_float = 123.12
>>> ed.Text().parse(test_float)
'123.12'
Parameters¶
-
normalize¶
As seen in example above, text normalization (bad encoding) is
enabled by default through normalize parameter. Lets set normalize
parameter to False to disable text normalization.
>>> test_text = 'Easybook Pro 13 <3 ünicode'
>>> ed.Text(normalize=False).parse(test_text)
Easybook Pro 13 <3 ünicode
-
capitalize¶
We can capitalize first character in our string, by setting capitalize parameter
to True. By default is set to False.
>>> test_text = 'easybook PRO 15'
>>> ed.Text(capitalize=True).parse(test_text)
Easybook PRO 15
-
title¶
We can set all first chars in a word uppercase while other chars in a word
become lowercase with title parameter set to True.
>>> test_text = 'easybook PRO 15'
>>> ed.Text(title=True).parse(test_text)
Easybook Pro 15
-
uppercase¶
We can set all chars in our string to uppercase by uppercase
parameter set to True.
>>> test_text = 'easybook PRO 15'
>>> ed.Text(uppercase=True).parse(test_text)
EASYBOOK PRO 15
-
lowercase¶
We can set all chars in our string to lowercase by lowercase
parameter set to True.
>>> test_text = 'easybook PRO 15'
>>> ed.Text(lowercase=True).parse(test_text)
easybook pro 15
-
replace_keys¶
We can replace chars/words in a string through replace_chars parameter.
replace_chars can accept regex pattern as a lookup key and is not
case sensitive.
>>> test_text = 'Easybook Pro 15'
>>> ed.Text(replace_keys=[('pro', 'Air'), ('15', '13')]).parse(test_text)
Easybook Air 13
-
remove_keys¶
We can remove chars/words in a string through remove_keys parameter.
remove_keys can accept regex pattern as a lookup key and is not
case sensitive.
>>> test_text = 'Easybook Pro 15'
>>> ed.Text(remove_keys=['easy', 'pro']).parse(test_text)
book 15
-
split_key¶
Text can be split by split_key. By default split index is 0.
>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_key='-').parse(test_text)
easybook
Lets specify split index through tuple.
>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_key=('-', -1)).parse(test_text)
pro_13
-
split_keys¶
split_keys work in a same way as split_key but instead of single
split key it accepts list of keys.
>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_keys=[('-', -1), '_']).parse(test_text)
pro
-
take¶
With take parameter we can limit maximum number of chars that are
shown at the end result. Lets see how it works in example bellow.
>>> test_text = 'Easybook Pro 13'
>>> ed.Text(max_chars=8).parse(test_text)
Easybook
-
skip¶
With skip parameter we can skip defined number of chars from the start.
Lets see how it works in example bellow.
>>> test_text = 'Easybook Pro 13'
>>> ed.Text(skip=8).parse(test_text)
Pro 13
-
text_num_to_numeric¶
We can convert all alpha chars that describe numeric values to actual
numbers by setting text_num_to_numeric parameter to True.
>>> test_text = 'two thousand and three words for the first time'
>>> ed.Text(text_num_to_numeric=True).parse(test_text)
2003 words for the 1 time
If our text is in different language we need to change language value in
our language parameter. Currently supported languages are only
en, es, hi and ru.
-
fix_spaces¶
By default all multiple spaces will be removed and left with only single one between chars. Lets test it in our bellow example:
>>> test_text = 'Easybook Pro 15'
>>> ed.Text().parse(test_text)
Easybook Pro 15
Now lets change fix_spaces parameter to False and see what happens.
>>> test_text = 'Easybook Pro 15'
>>> ed.Text(fix_spaces=False).parse(test_text)
Easybook Pro 15
-
escape_new_lines¶
By default all new line characters are converted to empty space as we can see in example bellow:
>>> test_text = 'Easybook\nPro\n15'
>>> ed.Text().parse(test_text)
Easybook Pro 15
Now lets change escape_new_lines parameter to False and see what happens.
>>> test_text = 'Easybook\nPro\n15'
>>> ed.Text(escape_new_lines=False).parse(test_text)
Easybook\nPro\n15
-
new_line_replacement¶
If escape_new_lines is set to True, then by default all new line chars
will be replaced by ' ' as seen in upper example. We can change this
default setting by changing value of new_line_replacement parameter.
>>> test_text = 'Easybook\nPro\n15'
>>> ed.Text(new_line_replacement='<br>').parse(test_text)
Easybook<br>Pro<br>15
-
add_stop¶
We can add stop char at the end of the string by setting add_stop
parameter to True.
>>> test_text = 'Easybook Pro 15'
>>> ed.Text(add_stop=True).parse(test_text)
Easybook Pro 15.
By default . is added but we can provide our custom char if needed. Instead
of setting add_stop parameter to True, we can instead of boolean value
provide char as we can see in example bellow.
>>> test_text = 'Easybook Pro 15'
>>> ed.Text(add_stop='!').parse(test_text)
Easybook Pro 15!