6.2. Text¶
6.2.1. Text¶
-
class
easydata.parsers.text.
Text
(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_key: Optional[Union[str, tuple]] = None, split_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, separator: str = ' ', index: Optional[int] = None, strip: bool = False, **kwargs)[source]¶
Text
is a parser that normalizes and manipulates simple
texts like titles or similar.
Getting Started¶
Text
supports query selectors for fetching data.
>>> test_dict = {'title': 'Easybook pro 13'}
>>> ed.Text(ed.key('title')).parse(test_dict)
'Easybook pro 13'
In this example lets process text with bad encoding and multiple spaces between chars.
>>> ed.Text().parse('Easybook Pro 13 <3 ünicode')
Easybook Pro 13 <3 ünicode
Floats, integers will get transformed to string automatically.
>>> ed.Text().parse(123)
'123'
>>> ed.Text().parse(123.12)
'123.12'
Parameters¶
-
normalize
¶
As seen in example above, text normalization (bad encoding) is
enabled by default through normalize
parameter. Lets set normalize
parameter to False
to disable text normalization.
>>> ed.Text(normalize=False).parse('Easybook Pro 13 <3 ünicode')
Easybook Pro 13 <3 ünicode
-
capitalize
¶
We can capitalize first character in our string, by setting capitalize
parameter
to True
. By default is set to False
.
>>> ed.Text(capitalize=True).parse('easybook PRO 15')
Easybook PRO 15
-
title
¶
We can set all first chars in a word uppercase while other chars in a word
become lowercase with title
parameter set to True
.
>>> ed.Text(title=True).parse('easybook PRO 15')
Easybook Pro 15
-
uppercase
¶
We can set all chars in our string to uppercase by uppercase
parameter set to True
.
>>> ed.Text(uppercase=True).parse('easybook PRO 15')
EASYBOOK PRO 15
-
lowercase
¶
We can set all chars in our string to lowercase by lowercase
parameter set to True
.
>>> ed.Text(lowercase=True).parse('easybook PRO 15')
easybook pro 15
-
replace_keys
¶
We can replace chars/words in a string through replace_chars
parameter.
replace_chars
can accept regex pattern as a lookup key and is not
case sensitive.
>>> test_text = 'Easybook Pro 15'
>>> ed.Text(replace_keys=[('pro', 'Air'), ('15', '13')]).parse(test_text)
Easybook Air 13
-
remove_keys
¶
We can remove chars/words in a string through remove_keys
parameter.
remove_keys
can accept regex pattern as a lookup key and is not
case sensitive.
>>> test_text = 'Easybook Pro 15'
>>> ed.Text(remove_keys=['easy', 'pro']).parse(test_text)
book 15
-
split_key
¶
Text can be split by split_key
. By default split index is 0
.
>>> ed.Text(split_key='-').parse('easybook-pro_13')
easybook
Lets specify split index through tuple.
>>> ed.Text(split_key=('-', -1)).parse('easybook-pro_13')
pro_13
-
split_keys
¶
split_keys
work in a same way as split_key
but instead of single
split key it accepts list of keys.
>>> test_text = 'easybook-pro_13'
>>> ed.Text(split_keys=[('-', -1), '_']).parse(test_text)
pro
-
take
¶
With take
parameter we can limit maximum number of chars that are
shown at the end result. Lets see how it works in example bellow.
>>> ed.Text(take=8).parse('Easybook Pro 13')
Easybook
-
skip
¶
With skip
parameter we can skip defined number of chars from the start.
Lets see how it works in example bellow.
>>> ed.Text(skip=8).parse('Easybook Pro 13')
Pro 13
-
text_num_to_numeric
¶
We can convert all alpha chars that describe numeric values to actual
numbers by setting text_num_to_numeric
parameter to True
.
>>> test_text = 'two thousand and three words for the first time'
>>> ed.Text(text_num_to_numeric=True).parse(test_text)
2003 words for the 1 time
If our text is in different language we need to change language value in
our language
parameter. Currently supported languages are only
en, es, hi and ru
.
-
language
¶
-
fix_spaces
¶
By default all multiple spaces will be removed and left with only single one between chars. Lets test it in our bellow example:
>>> ed.Text().parse('Easybook Pro 15')
Easybook Pro 15
Now lets change fix_spaces
parameter to False
and see what happens.
>>> ed.Text(fix_spaces=False).parse('Easybook Pro 15')
Easybook Pro 15
-
escape_new_lines
¶
By default all new line characters are converted to empty space as we can see in example bellow:
>>> ed.Text().parse('Easybook\nPro\n15')
Easybook Pro 15
Now lets change escape_new_lines
parameter to False
and see what happens.
>>> ed.Text(escape_new_lines=False).parse('Easybook\nPro\n15')
Easybook\nPro\n15
-
new_line_replacement
¶
If escape_new_lines
is set to True
, then by default all new line chars
will be replaced by ' '
as seen in upper example. We can change this
default setting by changing value of new_line_replacement
parameter.
>>> ed.Text(new_line_replacement='<br>').parse('Easybook\nPro\n15')
Easybook<br>Pro<br>15
-
add_stop
¶
We can add stop char at the end of the string by setting add_stop
parameter to True
.
>>> ed.Text(add_stop=True).parse('Easybook Pro 15')
Easybook Pro 15.
By default .
is added but we can provide our custom char if needed. Instead
of setting add_stop
parameter to True
, we can instead of boolean value
provide char as we can see in example bellow.
>>> ed.Text(add_stop='!').parse('Easybook Pro 15')
Easybook Pro 15!
-
separator
¶
-
index
¶
-
strip
¶
6.2.2. Str¶
-
class
easydata.parsers.text.
Str
(*args, normalize: bool = False, escape_new_lines: bool = False, **kwargs)[source]¶ Bases:
easydata.parsers.text.Text
Str
parser is same as Text
parser but with add_stop
and add_stop
set to False
by default and because of that Str
parser is also much performant
with it’s default params than Text
parser. Use Str
parser when performance
is critical.
>>> ed.Str().parse('Easybook\nPro\n15 <3 ünicode')
Easybook\nPro\n15 <3 ünicode