6.8. Description¶
Description parsers by default will remove redundant spaces, capitalize sentences, fix bad encoding and add stop keys if they are missing in a sentences. They can also parse html tables into readable sentences and offer many options to manipulate outcome of parsed sentences.
6.8.1. Sentences¶
-
class
easydata.parsers.desc.
Sentences
(*args, language: Optional[str] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, normalize: bool = True, capitalize: bool = True, title: bool = False, uppercase: bool = False, lowercase: bool = False, min_chars: int = 5, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, replace_keys_raw_text: Optional[list] = None, remove_keys_raw_text: Optional[list] = None, split_inline_breaks: bool = True, inline_breaks: Optional[List[str]] = None, merge_sentences: bool = True, stop_key: str = '.', stop_keys_split: Optional[List[str]] = None, stop_keys_ignore: Optional[List[str]] = None, sentence_separator: str = ' ', feature_split_keys: Optional[List[str]] = None, text_num_to_numeric: bool = False, autodetect_html: bool = True, html_text_to_sentences: bool = True, css_query: Optional[str] = None, exclude_css: Optional[Union[List[str], str]] = None, **kwargs)[source]¶ Bases:
easydata.parsers.desc.BaseDescription
Sentences
parser will extract and split sentences from given data source.
Getting Started¶
Lets import first easydata
module and use pq
selector module.
pq
is a css query selector using PyQuery
library under the hood.
>>> import easydata as ed
In our first example we will show how to parse badly structured text.
>>> test_text = ' first sentence... Bad ünicode. HTML entities <3!'
>>> ed.Sentences().transform(test_text)
['First sentence...', 'Bad ünicode.', 'HTML entities <3!']
Now lets try with simple
>>> test_text = ' first sentence... Bad ünicode. HTML entities <3!'
>>> ed.Sentences().transform(test_text)
['First sentence...', 'Bad ünicode.', 'HTML entities <3!']
Now lets try with simple
>>> test_text = ' first sentence... Bad ünicode. HTML entities <3!'
>>> ed.Sentences().parse(test_text)
['First sentence...', 'Bad ünicode.', 'HTML entities <3!']
Now lets try with simple HTML
text.
<div class="description">
<p><b>this</b> is description.</p>
<ul id="features">
<li>* Next-generation Thunderbolt.</li>
<li>* FaceTime HD camera </li>
</ul>
</div>
Lets assume that we loaded HTML
above into test_html
variable.
>>> parsers.Sentences().parse(test_html)
['This is description.', 'Next-generation Thunderbolt.', 'FaceTime HD camera.']
Now lets use pq
selector to select specific html nodes in order to
be processed.
>>> ed.Sentences(ed.pq('#features').html()).transform(test_html)
['Next-generation Thunderbolt.', 'FaceTime HD camera.']
Another example with be processed.
>>> ed.Sentences(ed.pq('#features').html()).transform(test_html)
['Next-generation Thunderbolt.', 'FaceTime HD camera.']
Another example with be processed.
>>> ed.Sentences(ed.pq('#features').html()).parse(test_html)
['Next-generation Thunderbolt.', 'FaceTime HD camera.']
Another example with pq
selector ignoring specific parts od html nodes.
>>> ed.Sentences(ed.pq('.description').rm('#features').html()).parse(test_html)
['This is description.']
Description parsers can also process html tables.
Without a header example:
<div class="description">
<p><b>this</b> is description.</p>
<table>
<tr>
<td scope="row">Type</td>
<td>Easybook Pro</td>
</tr>
<tr>
<td scope="row">Operating system</td>
<td>etOS</td>
</tr>
</table>
</div>
>>> ed.Sentences(ed.pq('.description').html()).parse(test_html)
['This is description.', 'Type: Easybook Pro.', 'Operating system: etOS.']
With a header example:
<div class="description">
<p><b>this</b> is description.</p>
<table>
<tr>
<th>Height</th><th>Width</th><th>Depth</th>
</tr>
<tr>
<td>10</td><td>12</td><td>5</td>
</tr>
<tr>
<td>2</td><td>3</td><td>5</td>
</tr>
</table>
</div>
>>> ed.Sentences(ed.pq('.description').html()).parse(test_html)
['This is description.', 'Height/Width/Depth: 10/12/5.', 'Height/Width/Depth: 2/3/5.']
Note
When using pq
selector we must always call method .html()
so that
raw html is passed down to description parser because if we call .text()
,
then all html tags will be stripped down and sentences won’t be processed
correctly because description parsers rely on html nodes when extracting
and structuring sentences.
Parameters¶
-
language
¶
If we are parsing text in other language than english then we need to specify language parameter in order to determine to which language our text belongs to so that sentences are split properly around abbreviations.
>>> test_text = 'primera oracion? Segunda oración. tercera oración'
>>> ed.Sentences(language='es').parse(test_text)
['Primera oracion?', 'Segunda oración.', 'Tercera oración.']
Please note that currently only en
and es
language parameter values
are supported. Support for more is under way
Note
Default value of language parameter can be defined through a config variable ED_LANGUAGE in a config file or a model.
-
allow
¶
We can control which sentences we want to get extracted by providing list of
keywords into allow
parameter. Provided keys are not case sensitive.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> ed.Sentences(allow=['first', 'third']).parse(test_text)
['First sentence?', 'Third sentence.']
Regex pattern is also supported as parameter value:
>>> ed.Sentences(allow=[r'\bfirst']).parse(test_text)
-
callow
¶
callow
is similar to allow
but with exception that provided keys
are case sensitive. Regex pattern as a key is also supported.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> ed.Sentences(allow=['First', 'Third']).parse(test_text)
['Third sentence.']
-
from_allow
¶
We can skip sentences by providing keys in from_allow
parameter.
Keys are not case sensitive and regex pattern is also supported.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> ed.Sentences(from_allow=['second']).parse(test_text)
['Second txt.', 'Third Txt.', 'FOUR txt.']
-
from_callow
¶
from_callow
is similar to from_allow
but with exception that
provided keys are case sensitive. Regex pattern as a key is also supported.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> ed.Sentences(from_allow=['Second']).parse(test_text)
['Second txt.', 'Third Txt.', 'FOUR txt.']
Lets recreate same example as before but with lowercase key.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> ed.Sentences(from_allow=['second']).parse(test_text)
[]
-
to_allow
¶
to_allow
is similar to from_allow
but in reverse order. Here
are sentences skipped after provided key is found. Keys are not case
sensitive and regex pattern is also supported.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> ed.Sentences(to_allow=['four']).parse(test_text)
['First txt.', 'Second txt.', 'Third Txt.']
-
to_callow
¶
to_callow
is similar to to_allow
but with exception that
provided keys are case sensitive. Regex pattern is also supported.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> ed.Sentences(to_callow=['FOUR']).parse(test_text)
['First txt.', 'Second txt.', 'Third Txt.']
Lets recreate same example as before but with a lowercase key.
>>> test_text = 'First txt. Second txt. Third Txt. FOUR txt.'
>>> ed.Sentences(to_callow=['four']).parse(test_text)
['First txt.', 'Second txt.', 'Third Txt.', 'FOUR txt.']
-
deny
¶
We can control which sentences we don’t want to get extracted by providing
list of keywords into deny
parameter. Keys are not case sensitive and
regex pattern is also supported.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> ed.Sentences(deny=['first', 'third']).parse(test_text)
['Second sentence.']
-
cdeny
¶
cdeny
is similar to deny
but with exception that provided keys
are case sensitive. Regex pattern as a key is also supported.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> ed.Sentences(cdeny=['First', 'Third']).parse(test_text)
['First sentence?', 'Second sentence.']
-
normalize
¶
By default parameter normalize
is set to True
. This means that any
bad encoding will be automatically fixed, stops added and line breaks
split into sentences.
>>> test_text = ' first sentence... Bad ünicode. HTML entities <3!'
>>> ed.Sentences().parse(test_text)
['First sentence...', 'Bad ünicode.', 'HTML entities <3!']
Lets try to set parameter normalize
to False
and see what happens.
>>> test_text = ' first sentence... Bad ünicode. HTML entities <3!'
>>> ed.Sentences(normalize=False).parse(test_text)
['First sentence...', 'Bad ünicode.', 'HTML entities <3!']
-
capitalize
¶
By default all sentences will get capitalized as we can see bellow.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> ed.Sentences().parse(test_text)
['First sentence?', 'Second sentence.', 'third sentence.']
We can disable this behaviour by setting parameter capitalize
to False
.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> ed.Sentences(capitalize=False).parse(test_text)
['first sentence?', 'Second sentence.', 'third sentence.']
-
title
¶
We can set our text output to title by setting parameter title
to True
.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> ed.Sentences(title=True).parse(test_text)
'First Sentence? Second Sentence. Third Sentence'
-
uppercase
¶
We can set our text output to uppercase by setting parameter uppercase
to True
.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> ed.Sentences(uppercase=True).parse(test_text)
['FIRST SENTENCE?', 'SECOND SENTENCE.', 'THIRD SENTENCE.']
-
lowercase
¶
We can set our text output to lowercase by setting parameter lowercase
to True
.
>>> test_text = 'first sentence? Second sentence. third sentence'
>>> ed.Sentences(lowercase=True).parse(test_text)
'first sentence? second sentence. third sentence'
-
min_chars
¶
By default min_chars
has a value of 5. This means that any sentence that has
less than 5 chars, will be filtered out and not seen at the end result. This
is done to remove ambiguous sentences, especially when extracting text from
html. We can raise or decrease this limit by changing the value of min_chars
.
-
replace_keys
¶
We can replace all chars in a sentences by providing tuple of search key and
replacement char in a replace_keys
parameter. Regex pattern as key is
also supported and search keys are not case sensitive.
>>> test_text = 'first sentence! - second sentence. Third'
>>> ed.Sentences(replace_keys=[('third', 'Last'), ('nce!', 'nce?')]).parse(test_text)
['First sentence?', 'Second sentence.', 'Last.']
-
remove_keys
¶
We can remove all chars in sentences by providing list of search keys in a
replace_keys
parameter. Regex pattern as key is also supported and keys
are not case sensitive.
>>> test_text = 'first sentence! - second sentence. Third'
>>> ed.Sentences(remove_keys=['sentence', '!']).parse(test_text)
['First.', 'Second.', 'Third.']
-
replace_keys_raw_text
¶
We can replace char values before text is split into sentences. This is
especially useful if we want to fix text before it’s parsed and so that
is split into sentences correctly. It accepts regex
as key values in a
tuple
. Please note that keys are not case sensitive and regex as key
is also accepted.
Lets first show default result with badly structured text without
setting keys into replace_keys_raw_text
.
>>> test_text = 'Easybook pro 15 Color: Gray Material: Aluminium'
>>> ed.Sentences().parse(test_text)
['Easybook pro 15 Color: Gray Material: Aluminium.']
As we can see from the result is returned as only one sentence
due to missing stop keys (.
) between sentences. Lets fix this by
adding stop keys into unprocessed text before sentence splitting
happens.
>>> test_text = 'Easybook pro 15 Color: Gray Material: Aluminium'
>>> replace_keys = [('Color:', '. Color:'), ('Material:', '. Material:')]
>>> ed.Sentences(replace_keys_raw_text=replace_keys).parse(test_text)
['Easybook pro 15.', 'Color: Gray.', 'Material: Aluminium.']
-
remove_keys_raw_text
¶
Works similar as replace_keys_raw_text
, but instead of providing list
of tuples in order to replace chars, here we provide list of chars to remove
keys. Lets try first on a sentence without setting keys to rremove_keys_raw_text
.
Please note that keys are not case sensitive and regex as key is also accepted.
>>> test_text = 'Easybook pro 15. Color: Gray'
>>> ed.Sentences().parse(test_text)
['Easybook pro 15.', 'Color: Gray.']
Text above due to stop key .
was split into two sentences. Lets prevent this
by removing color and stop key at the same time and get one sentence instead.
>>> test_text = 'Easybook pro 15. Color: Gray'
>>> ed.Sentences(remove_keys_raw_text=['. color:']).parse(test_text)
['Easybook pro 15 Gray.']
-
split_inline_breaks
¶
By default text with chars like *
, `` - `` and bullet points would get split
into sentences.
Example:
>>> test_text = '- first param - second param'
>>> ed.Sentences().parse(test_text)
['First param.', 'Second param.']
In cases when we want to disable this behaviour, we can set parameter
split_inline_breaks
to False
.
>>> test_text = '- first param - second param'
>>> ed.Sentences(split_inline_breaks=False).parse(test_text)
['- first param - second param.']
Please note that chars like .
, :
, ?
, !
are not considered
as inline breaks.
-
inline_breaks
¶
In above example we saw how default char breaks work. In cases when we want to
split sentences by different char than default one, we can do so by providing list
of chars into inline_breaks
parameter.
>>> test_text = '> first param > second param'
>>> ed.Sentences(inline_breaks=['>']).parse(test_text)
['First param.', 'Second param.']
Regex pattern is also supported as a parameter value:
>>> ed.Sentences(inline_breaks=[r'\b>']).parse(test_text)
-
stop_key
¶
If a sentence is without a stop key at the end, then by default it
will automatically be appended .
. Let see this in bellow example:
>>> test_text = 'First feature <br> second feature?'
>>> ed.Sentences().parse(test_text)
['First feature.', 'Second feature?']
We can change our default char .
to a custom one by setting our
desired char in a stop_key
parameter.
>>> test_text = 'First feature <br> second feature?'
>>> ed.Sentences(stop_key='!').parse(test_text)
['First feature!', 'Second feature?']
-
text_num_to_numeric
¶
We can convert all alpha chars that describe numeric values to actual
numbers by setting text_num_to_numeric
parameter to True
.
>>> test_text = 'First Sentence. Two thousand and three has it. Three Sentences.'
>>> ed.Sentences(text_num_to_numeric=True).parse(test_text)
['1 Sentence.', '2003 has it.', '3 Sentences.']
If our text is in different language we need to change language value in
our language
parameter. Currently supported languages regarding
text_num_to_numeric
are only en, es, hi and ru
.
-
merge_sentences
¶
-
stop_keys_split
¶
-
stop_keys_ignore
¶
-
sentence_separator
¶
-
feature_split_keys
¶
-
autodetect_html
¶
-
html_text_to_sentences
¶
-
css_query
¶
-
exclude_css
¶
6.8.2. Description¶
-
class
easydata.parsers.desc.
Description
(*args, language: Optional[str] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, normalize: bool = True, capitalize: bool = True, title: bool = False, uppercase: bool = False, lowercase: bool = False, min_chars: int = 5, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, replace_keys_raw_text: Optional[list] = None, remove_keys_raw_text: Optional[list] = None, split_inline_breaks: bool = True, inline_breaks: Optional[List[str]] = None, merge_sentences: bool = True, stop_key: str = '.', stop_keys_split: Optional[List[str]] = None, stop_keys_ignore: Optional[List[str]] = None, sentence_separator: str = ' ', feature_split_keys: Optional[List[str]] = None, text_num_to_numeric: bool = False, autodetect_html: bool = True, html_text_to_sentences: bool = True, css_query: Optional[str] = None, exclude_css: Optional[Union[List[str], str]] = None, **kwargs)[source]¶ Bases:
easydata.parsers.desc.BaseDescription
Description
parser accepts all parameters as Sentences
parser and works in
exact same way with only difference, that returned value is string
rather
than a list
of sentences.
Parameters¶
-
sentence_separator
¶
Behind the scenes sentences are from a text always broken into list and
later on a final output joined together by a separator with a default
value ' '
.
Lets see default output in example bellow:
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> ed.Description().parse(test_text)
First sentence? Second sentence. Third sentence.
Behind the scene simple join
on a list of sentences is performed.
Now lets change default value ' '
of sentence_separator
to our
custom one.
>>> test_text = 'first sentence? Second sentence. Third sentence'
>>> ed.Description(sentence_separator=' > ').parse(test_text)
First sentence? > Second sentence. > Third sentence.
6.8.3. Features¶
-
class
easydata.parsers.desc.
Features
(*args, language: Optional[str] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, normalize: bool = True, capitalize: bool = True, title: bool = False, uppercase: bool = False, lowercase: bool = False, min_chars: int = 5, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, replace_keys_raw_text: Optional[list] = None, remove_keys_raw_text: Optional[list] = None, split_inline_breaks: bool = True, inline_breaks: Optional[List[str]] = None, merge_sentences: bool = True, stop_key: str = '.', stop_keys_split: Optional[List[str]] = None, stop_keys_ignore: Optional[List[str]] = None, sentence_separator: str = ' ', feature_split_keys: Optional[List[str]] = None, text_num_to_numeric: bool = False, autodetect_html: bool = True, html_text_to_sentences: bool = True, css_query: Optional[str] = None, exclude_css: Optional[Union[List[str], str]] = None, **kwargs)[source]¶ Bases:
easydata.parsers.desc.BaseDescription
Features
parser accepts all parameters as Sentences
parser and works in
exact same way with only difference, that list of features is returned. Features
are basically sentences that have a key - value in it.
Example:
>>> test_text = '- color: Black - material: Aluminium. Last Sentence'
- color: Black
and - material: Aluminium.
are feature sentences since they
contain key and value in it, while Last Sentence
is a regular sentence.
Features
parser will try to automatically detect which are regular sentences
and which one are features and will show on a final output only list
of
features. Regular sentences are ignored.
>>> ed.Features(test_text).parse(test_text)
[('Color', 'Black'), ('Material', 'Aluminium')]
6.8.4. FeaturesDict¶
-
class
easydata.parsers.desc.
FeaturesDict
(*args, language: Optional[str] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, normalize: bool = True, capitalize: bool = True, title: bool = False, uppercase: bool = False, lowercase: bool = False, min_chars: int = 5, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, replace_keys_raw_text: Optional[list] = None, remove_keys_raw_text: Optional[list] = None, split_inline_breaks: bool = True, inline_breaks: Optional[List[str]] = None, merge_sentences: bool = True, stop_key: str = '.', stop_keys_split: Optional[List[str]] = None, stop_keys_ignore: Optional[List[str]] = None, sentence_separator: str = ' ', feature_split_keys: Optional[List[str]] = None, text_num_to_numeric: bool = False, autodetect_html: bool = True, html_text_to_sentences: bool = True, css_query: Optional[str] = None, exclude_css: Optional[Union[List[str], str]] = None, **kwargs)[source]¶ Bases:
easydata.parsers.desc.BaseDescription
FeaturesDict
parser accepts all parameters as Features
parser and works in
exact same way with only difference that dictionary of features is returned instead
a list of tuples.
Example:
>>> test_text = '- color: Black - material: Aluminium. Last Sentence'
>>> ed.FeaturesDict(test_text).parse(test_text)
{'Color': 'Black', 'Material': 'Aluminium'}
6.8.5. Feature¶
-
class
easydata.parsers.desc.
Feature
(*args, key: Optional[str] = None, key_exact: Optional[str] = None, **kwargs)[source]¶ Bases:
easydata.parsers.desc.BaseDescription
examples coming soon …