6.2. Text

6.2.1. Text

class easydata.parsers.text.Text(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_key: Optional[Union[str, tuple]] = None, split_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, separator: str = ' ', index: Optional[int] = None, strip: bool = False, **kwargs)[source]

Bases: easydata.parsers.base.BaseData

Text is a parser that normalizes and manipulates simple texts like titles or similar.

Getting Started

Text supports query selectors for fetching data.

>>> test_dict = {'title': 'Easybook pro 13'}
>>> ed.Text(ed.key('title')).parse(test_dict)
'Easybook pro 13'

In this example lets process text with bad encoding and multiple spaces between chars.

>>> ed.Text().parse('Easybook    Pro 13 <3 ünicode')
Easybook Pro 13 <3 ünicode

Floats, integers will get transformed to string automatically.

>>> ed.Text().parse(123)

>>> ed.Text().parse(123.12)



As seen in example above, text normalization (bad encoding) is enabled by default through normalize parameter. Lets set normalize parameter to False to disable text normalization.

>>> ed.Text(normalize=False).parse('Easybook Pro 13 &lt;3 ünicode')
Easybook Pro 13 &lt;3 ünicode

We can capitalize first character in our string, by setting capitalize parameter to True. By default is set to False.

>>> ed.Text(capitalize=True).parse('easybook PRO 15')
Easybook PRO 15

We can set all first chars in a word uppercase while other chars in a word become lowercase with title parameter set to True.

>>> ed.Text(title=True).parse('easybook PRO 15')
Easybook Pro 15

We can set all chars in our string to uppercase by uppercase parameter set to True.

>>> ed.Text(uppercase=True).parse('easybook PRO 15')

We can set all chars in our string to lowercase by lowercase parameter set to True.

>>> ed.Text(lowercase=True).parse('easybook PRO 15')
easybook pro 15

We can replace chars/words in a string through replace_chars parameter. replace_chars can accept regex pattern as a lookup key and is not case sensitive.

>>> test_text = 'Easybook Pro 15'
>>> ed.Text(replace_keys=[('pro', 'Air'), ('15', '13')]).parse(test_text)
Easybook Air 13

We can remove chars/words in a string through remove_keys parameter. remove_keys can accept regex pattern as a lookup key and is not case sensitive.

>>> test_text = 'Easybook Pro 15'
>>> ed.Text(remove_keys=['easy', 'pro']).parse(test_text)
book 15

Text can be split by split_key. By default split index is 0.

>>> ed.Text(split_key='-').parse('easybook-pro_13')

Lets specify split index through tuple.

>>> ed.Text(split_key=('-', -1)).parse('easybook-pro_13')

split_keys work in a same way as split_key but instead of single split key it accepts list of keys.

>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_keys=[('-', -1), '_']).parse(test_text)

With take parameter we can limit maximum number of chars that are shown at the end result. Lets see how it works in example bellow.

>>> ed.Text(take=8).parse('Easybook Pro 13')

With skip parameter we can skip defined number of chars from the start. Lets see how it works in example bellow.

>>> ed.Text(skip=8).parse('Easybook Pro 13')
Pro 13

We can convert all alpha chars that describe numeric values to actual numbers by setting text_num_to_numeric parameter to True.

>>> test_text = 'two thousand and three words for the first time'
>>> ed.Text(text_num_to_numeric=True).parse(test_text)
2003 words for the 1 time

If our text is in different language we need to change language value in our language parameter. Currently supported languages are only en, es, hi and ru.


By default all multiple spaces will be removed and left with only single one between chars. Lets test it in our bellow example:

>>> ed.Text().parse('Easybook   Pro  15')
Easybook Pro 15

Now lets change fix_spaces parameter to False and see what happens.

>>> ed.Text(fix_spaces=False).parse('Easybook   Pro  15')
Easybook   Pro  15

By default all new line characters are converted to empty space as we can see in example bellow:

>>> ed.Text().parse('Easybook\nPro\n15')
Easybook Pro 15

Now lets change escape_new_lines parameter to False and see what happens.

>>> ed.Text(escape_new_lines=False).parse('Easybook\nPro\n15')

If escape_new_lines is set to True, then by default all new line chars will be replaced by ' ' as seen in upper example. We can change this default setting by changing value of new_line_replacement parameter.

>>> ed.Text(new_line_replacement='<br>').parse('Easybook\nPro\n15')

We can add stop char at the end of the string by setting add_stop parameter to True.

>>> ed.Text(add_stop=True).parse('Easybook Pro  15')
Easybook Pro 15.

By default . is added but we can provide our custom char if needed. Instead of setting add_stop parameter to True, we can instead of boolean value provide char as we can see in example bellow.

>>> ed.Text(add_stop='!').parse('Easybook Pro  15')
Easybook Pro 15!

6.2.2. Str

class easydata.parsers.text.Str(*args, normalize: bool = False, escape_new_lines: bool = False, **kwargs)[source]

Bases: easydata.parsers.text.Text

Str parser is same as Text parser but with add_stop and add_stop set to False by default and because of that Str parser is also much performant with it’s default params than Text parser. Use Str parser when performance is critical.

>>> ed.Str().parse('Easybook\nPro\n15 &lt;3 ünicode')
Easybook\nPro\n15 &lt;3 ünicode