8.2. Text Parsers¶

8.2.1. Text¶

class easydata.parsers.text.Text(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_key: Optional[Union[str, tuple]] = None, split_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, separator: str = ' ', index: Optional[int] = None, strip: bool = False, **kwargs)[source]¶: Bases: easydata.parsers.base.BaseData

Text is a parser that normalizes and manipulates simple texts like titles or similar.

Getting Started¶

Lets import first easydata module.

>>> import easydata as ed

Text supports query selectors for fetching data.

>>> test_dict = {'title': 'Easybook pro 13'}
>>> ed.Text(ed.key('title')).parse(test_dict)
'Easybook pro 13'

In this example lets process text with bad encoding and multiple spaces between chars.

>>> test_text = 'Easybook    Pro 13 &lt;3 uÌˆnicode'
>>> ed.Text().parse(test_text)
Easybook Pro 13 <3 ünicode

Floats, integers will get transformed to string automatically.

>>> test_int = 123
>>> ed.Text().parse(test_int)
'123'

>>> test_float = 123.12
>>> ed.Text().parse(test_float)
'123.12'

Parameters¶

normalize¶

As seen in example above, text normalization (bad encoding) is enabled by default through normalize parameter. Lets set normalize parameter to False to disable text normalization.

>>> test_text = 'Easybook Pro 13 &lt;3 uÌˆnicode'
>>> ed.Text(normalize=False).parse(test_text)
Easybook Pro 13 &lt;3 uÌˆnicode

capitalize¶

We can capitalize first character in our string, by setting capitalize parameter to True. By default is set to False.

>>> test_text = 'easybook PRO 15'
>>> ed.Text(capitalize=True).parse(test_text)
Easybook PRO 15

title¶

We can set all first chars in a word uppercase while other chars in a word become lowercase with title parameter set to True.

>>> test_text = 'easybook PRO 15'
>>> ed.Text(title=True).parse(test_text)
Easybook Pro 15

uppercase¶

We can set all chars in our string to uppercase by uppercase parameter set to True.

>>> test_text = 'easybook PRO 15'
>>> ed.Text(uppercase=True).parse(test_text)
EASYBOOK PRO 15

lowercase¶

We can set all chars in our string to lowercase by lowercase parameter set to True.

>>> test_text = 'easybook PRO 15'
>>> ed.Text(lowercase=True).parse(test_text)
easybook pro 15

replace_keys¶

We can replace chars/words in a string through replace_chars parameter. replace_chars can accept regex pattern as a lookup key and is not case sensitive.

>>> test_text = 'Easybook Pro 15'
>>> ed.Text(replace_keys=[('pro', 'Air'), ('15', '13')]).parse(test_text)
Easybook Air 13

remove_keys¶

We can remove chars/words in a string through remove_keys parameter. remove_keys can accept regex pattern as a lookup key and is not case sensitive.

>>> test_text = 'Easybook Pro 15'
>>> ed.Text(remove_keys=['easy', 'pro']).parse(test_text)
book 15

split_key¶

Text can be split by split_key. By default split index is 0.

>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_key='-').parse(test_text)
easybook

Lets specify split index through tuple.

>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_key=('-', -1)).parse(test_text)
pro_13

split_keys¶

split_keys work in a same way as split_key but instead of single split key it accepts list of keys.

>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_keys=[('-', -1), '_']).parse(test_text)
pro

take¶

With take parameter we can limit maximum number of chars that are shown at the end result. Lets see how it works in example bellow.

>>> test_text = 'Easybook Pro 13'
>>> ed.Text(max_chars=8).parse(test_text)
Easybook

skip¶

With skip parameter we can skip defined number of chars from the start. Lets see how it works in example bellow.

>>> test_text = 'Easybook Pro 13'
>>> ed.Text(skip=8).parse(test_text)
Pro 13

text_num_to_numeric¶

We can convert all alpha chars that describe numeric values to actual numbers by setting text_num_to_numeric parameter to True.

>>> test_text = 'two thousand and three words for the first time'
>>> ed.Text(text_num_to_numeric=True).parse(test_text)
2003 words for the 1 time

If our text is in different language we need to change language value in our language parameter. Currently supported languages are only en, es, hi and ru.

fix_spaces¶

By default all multiple spaces will be removed and left with only single one between chars. Lets test it in our bellow example:

>>> test_text = 'Easybook   Pro  15'
>>> ed.Text().parse(test_text)
Easybook Pro 15

Now lets change fix_spaces parameter to False and see what happens.

>>> test_text = 'Easybook   Pro  15'
>>> ed.Text(fix_spaces=False).parse(test_text)
Easybook   Pro  15

escape_new_lines¶

By default all new line characters are converted to empty space as we can see in example bellow:

>>> test_text = 'Easybook\nPro\n15'
>>> ed.Text().parse(test_text)
Easybook Pro 15

Now lets change escape_new_lines parameter to False and see what happens.

>>> test_text = 'Easybook\nPro\n15'
>>> ed.Text(escape_new_lines=False).parse(test_text)
Easybook\nPro\n15

new_line_replacement¶

If escape_new_lines is set to True, then by default all new line chars will be replaced by ' ' as seen in upper example. We can change this default setting by changing value of new_line_replacement parameter.

>>> test_text = 'Easybook\nPro\n15'
>>> ed.Text(new_line_replacement='<br>').parse(test_text)
Easybook<br>Pro<br>15

add_stop¶

We can add stop char at the end of the string by setting add_stop parameter to True.

>>> test_text = 'Easybook Pro  15'
>>> ed.Text(add_stop=True).parse(test_text)
Easybook Pro 15.

By default . is added but we can provide our custom char if needed. Instead of setting add_stop parameter to True, we can instead of boolean value provide char as we can see in example bellow.

>>> test_text = 'Easybook Pro  15'
>>> ed.Text(add_stop='!').parse(test_text)
Easybook Pro 15!