6.2. Text

6.2.1. Text

class easydata.parsers.text.Text(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_key: Optional[Union[str, tuple]] = None, split_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, separator: str = ' ', index: Optional[int] = None, strip: bool = False, **kwargs)[source]

Bases: easydata.parsers.base.BaseData

Text is a parser that normalizes and manipulates simple texts like titles or similar.

Getting Started

Text supports query selectors for fetching data.

>>> test_dict = {'title': 'Easybook pro 13'}
>>> ed.Text(ed.key('title')).parse(test_dict)
'Easybook pro 13'

In this example lets process text with bad encoding and multiple spaces between chars.

>>> ed.Text().parse('Easybook    Pro 13 <3 ünicode')
Easybook Pro 13 <3 ünicode

Floats, integers will get transformed to string automatically.

>>> ed.Text().parse(123)
'123'

>>> ed.Text().parse(123.12)
'123.12'

Parameters

normalize

As seen in example above, text normalization (bad encoding) is enabled by default through normalize parameter. Lets set normalize parameter to False to disable text normalization.

>>> ed.Text(normalize=False).parse('Easybook Pro 13 &lt;3 ünicode')
Easybook Pro 13 &lt;3 ünicode
capitalize

We can capitalize first character in our string, by setting capitalize parameter to True. By default is set to False.

>>> ed.Text(capitalize=True).parse('easybook PRO 15')
Easybook PRO 15
title

We can set all first chars in a word uppercase while other chars in a word become lowercase with title parameter set to True.

>>> ed.Text(title=True).parse('easybook PRO 15')
Easybook Pro 15
uppercase

We can set all chars in our string to uppercase by uppercase parameter set to True.

>>> ed.Text(uppercase=True).parse('easybook PRO 15')
EASYBOOK PRO 15
lowercase

We can set all chars in our string to lowercase by lowercase parameter set to True.

>>> ed.Text(lowercase=True).parse('easybook PRO 15')
easybook pro 15
replace_keys

We can replace chars/words in a string through replace_chars parameter. replace_chars can accept regex pattern as a lookup key and is not case sensitive.

>>> test_text = 'Easybook Pro 15'
>>> ed.Text(replace_keys=[('pro', 'Air'), ('15', '13')]).parse(test_text)
Easybook Air 13
remove_keys

We can remove chars/words in a string through remove_keys parameter. remove_keys can accept regex pattern as a lookup key and is not case sensitive.

>>> test_text = 'Easybook Pro 15'
>>> ed.Text(remove_keys=['easy', 'pro']).parse(test_text)
book 15
split_key

Text can be split by split_key. By default split index is 0.

>>> ed.Text(split_key='-').parse('easybook-pro_13')
easybook

Lets specify split index through tuple.

>>> ed.Text(split_key=('-', -1)).parse('easybook-pro_13')
pro_13
split_keys

split_keys work in a same way as split_key but instead of single split key it accepts list of keys.

>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_keys=[('-', -1), '_']).parse(test_text)
pro
take

With take parameter we can limit maximum number of chars that are shown at the end result. Lets see how it works in example bellow.

>>> ed.Text(take=8).parse('Easybook Pro 13')
Easybook
skip

With skip parameter we can skip defined number of chars from the start. Lets see how it works in example bellow.

>>> ed.Text(skip=8).parse('Easybook Pro 13')
Pro 13
text_num_to_numeric

We can convert all alpha chars that describe numeric values to actual numbers by setting text_num_to_numeric parameter to True.

>>> test_text = 'two thousand and three words for the first time'
>>> ed.Text(text_num_to_numeric=True).parse(test_text)
2003 words for the 1 time

If our text is in different language we need to change language value in our language parameter. Currently supported languages are only en, es, hi and ru.

language
fix_spaces

By default all multiple spaces will be removed and left with only single one between chars. Lets test it in our bellow example:

>>> ed.Text().parse('Easybook   Pro  15')
Easybook Pro 15

Now lets change fix_spaces parameter to False and see what happens.

>>> ed.Text(fix_spaces=False).parse('Easybook   Pro  15')
Easybook   Pro  15
escape_new_lines

By default all new line characters are converted to empty space as we can see in example bellow:

>>> ed.Text().parse('Easybook\nPro\n15')
Easybook Pro 15

Now lets change escape_new_lines parameter to False and see what happens.

>>> ed.Text(escape_new_lines=False).parse('Easybook\nPro\n15')
Easybook\nPro\n15
new_line_replacement

If escape_new_lines is set to True, then by default all new line chars will be replaced by ' ' as seen in upper example. We can change this default setting by changing value of new_line_replacement parameter.

>>> ed.Text(new_line_replacement='<br>').parse('Easybook\nPro\n15')
Easybook<br>Pro<br>15
add_stop

We can add stop char at the end of the string by setting add_stop parameter to True.

>>> ed.Text(add_stop=True).parse('Easybook Pro  15')
Easybook Pro 15.

By default . is added but we can provide our custom char if needed. Instead of setting add_stop parameter to True, we can instead of boolean value provide char as we can see in example bellow.

>>> ed.Text(add_stop='!').parse('Easybook Pro  15')
Easybook Pro 15!
separator
index
strip

6.2.2. Str

class easydata.parsers.text.Str(*args, normalize: bool = False, escape_new_lines: bool = False, **kwargs)[source]

Bases: easydata.parsers.text.Text

Str parser is same as Text parser but with add_stop and add_stop set to False by default and because of that Str parser is also much performant with it’s default params than Text parser. Use Str parser when performance is critical.

>>> ed.Str().parse('Easybook\nPro\n15 &lt;3 ünicode')
Easybook\nPro\n15 &lt;3 ünicode