6.4. List¶
6.4.1. List¶
-
class
easydata.parsers.list.List(query: Optional[easydata.queries.base.QuerySearchBase] = None, parser: Optional[easydata.parsers.base.Base] = None, unique: bool = True, max_num: Optional[int] = None, split_key: Optional[Union[List[str], str]] = None, allow_parser: Optional[easydata.parsers.base.Base] = None, deny_parser: Optional[easydata.parsers.base.Base] = None, preprocess_allow: Optional[Callable] = None, process_allow: Optional[Callable] = None, **kwargs)[source]¶
List parser returns a value of list type. It’s main advantage is that each
value from list can be processed by other parser which is initialized together with
List parser. For better explanation regarding this, please check further through
examples.
Getting Started¶
EXAMPLE WITH JSON DATA SOURCE:
Lets first try to parse simple json text.
test_json_text = {
'images': [
{'src': 'https://demo.com/imgs/1.jpg'},
{'src': 'https://demo.com/imgs/2.jpg'},
{'src': 'https://demo.com/imgs/3.jpg'}
]
}
List supports any query object for fetching data. In example bellow we will
use jp to query dict object. jp will also automatically convert our
json text into python dictionary or list if it’s not already python object.
list_parser = ed.List(
ed.jp('images[].src'),
parser=ed.Url()
)
print(list_parser.parse(test_json_text))
This would print output like:
test_json_text = [
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
We can also use selector in our Url parser if needed. Lets demonstrate this in
example bellow.
list_parser = ed.List(
ed.jp('images'),
parser=ed.Url(
ed.jp('src')
)
)
print(list_parser.parse(test_json_text))
Printed results is also same as before.
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
EXAMPLE WITH HTML DATA SOURCE:
Now lets try to parse simple HTML text.
<div id="image-container">
<img id="image" src="https://demo.com/imgs/1.jpg">
<div id="images">
<img class="image" src="https://demo.com/imgs/1.jpg">
<img class="image" src="https://demo.com/imgs/2.jpg">
<img class="image" src="https://demo.com/imgs/3.jpg">
</div>
</div>
Lets assume that we loaded HTML above into test_html_text variable.
In example bellow we will use pq to query through html nodes. pq
will also automatically convert our HTML text into python PyQuery
object through which we can use css selectors.
list_parser = ed.List(
ed.pq('#images img::items'),
parser=ed.Url(ed.pq('::src'))
)
Please note that pq('#images img::items') will be iterated through our List
parser and that img html node object will be passed to Url parser upon which
pq query selector can be used again to output final result. Since in example
above in our List parser, we already selected with css img html node, so in
Url parser we just add into query selector ::src pseudo element in order
to get data from src attribute in HTML element.
Now lets parse test_html_text data and print our result.
print(list_parser.parse(test_html_text))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
Parameters¶
-
unique¶
By default List parser will ensure that all values in a returned list
are unique and that there are no duplicate values.
Lets first try to parse json text that contains duplicate image urls.
First we will demonstrate default behaviour which has by default unique
parameter set to True.
{
'images': [
'https://demo.com/imgs/1.jpg'
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg',
'https://demo.com/imgs/3.jpg'
]
}
list_parser = ed.List(
ed.jp('images'),
parser=ed.Url()
)
Now lets parse test_json_text data and print our result.
print(list_parser.parse(test_json_text))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
As we can see, all our printed list values are unique. Now lets set unique
parameter to False and see what happens.
list_parser = ed.List(
ed.jp('images'),
parser=ed.Url(),
unique=False
)
Now lets parse test_json_text data and print our result.
print(list_parser.parse(test_json_text))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg',
'https://demo.com/imgs/3.jpg'
]
As we can see our list contains now two https://demo.com/imgs/3.jpg values.
-
max_num¶
Setting a int value to max_num parameter will basically ensure how many
values we want in our end list result.
test_image_list = [
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
list_parser = ed.List(
parser=ed.Url(),
max_num=2
)
Now lets parse test_image_list data and print our result.
print(list_parser.parse(test_image_list))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg'
]
As we can see, our original list had 3 image urls in it, and now because we have
set to our parameter max_num value of 2, we get only list consisted of 2
image urls.
-
split_key¶
Through split_key we can break a text into list which be processed by List
parser.
Example:
test_text = 'https://demo.com/imgs/1.jpg,https://demo.com/imgs/2.jpg'
list_parser = ed.List(
parser=ed.Url(),
split_key=','
)
Now lets parse test_text data and print our result.
print(list_parser.parse(test_image_list))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg'
]
-
allow_parser¶
-
deny_parser¶
6.4.2. TextList¶
-
class
easydata.parsers.list.TextList(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_text_key: Optional[Union[str, tuple]] = None, split_text_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, multiply_keys: Optional[Union[list, tuple]] = None, **kwargs)[source]¶ Bases:
easydata.parsers.list.List
TextList extends List parsers and therefore all parameters from it, are also
available in TextList. TextList output is a list of str.
Parameters¶
-
allow¶
We can control which list values we want to get extracted by providing list of
keywords into allow parameter. Provided keys are not case sensitive and regex
pattern as a key is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
allow=['.com', '.eu']
)
Now lets parse test_list data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.com',
'http://demo.eu'
]
-
callow¶
callow is similar to allow but with exception that provided keys
are case sensitive. Regex pattern as a key is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
callow=['.COM', '.eu']
)
Now lets parse test_list data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.eu'
]
-
from_allow¶
We can skip list values by providing keys in from_allow parameter.
Keys are not case sensitive and regex pattern is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
from_allow=['.net']
)
Now lets parse test_list data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.net',
'http://demo.eu'
]
-
from_callow¶
from_callow is similar to from_allow but with exception that
provided keys are case sensitive. Regex pattern as a key is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
from_callow=['.net']
)
Now lets parse test_list data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.net',
'http://demo.eu'
]
Lets recreate same example as before but with uppercase key.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
from_callow=['.net']
)
Now lets parse test_list data and print our result.
print(list_parser.parse(test_image_list))
[]
-
to_allow¶
to_allow is similar to from_allow but in reverse order. Here
are list values skipped after provided key is found. Keys are not case
sensitive and regex pattern is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
to_allow=['.eu']
)
Now lets parse test_list data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.com',
'http://demo.net'
]
-
to_callow¶
to_callow is similar to to_allow but with exception that
provided keys are case sensitive. Regex pattern is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
to_callow=['.eu']
)
Now lets parse test_list data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.com',
'http://demo.net'
]
Lets recreate same example as before but with a uppercase key.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
to_callow=['.EU']
)
Now lets parse test_list data and print our result.
print(list_parser.parse(test_list))
[
'http://demo.com',
'http://demo.net',
'http://demo.eu'
]
-
multiply_keys¶
Setting values into multiply_keys enables you to parse str or a first
value from a list into multiple values. Lets check bellow example for a
better understanding.
test_url = 'https://demo.com/imgs/1.jpg'
list_parser = ed.List(
parser=ed.Url(),
multiply_keys=[('1.jpg', ['1.jpg', '2.jpg', '3.jpg', '4.jpg'])]
)
Now lets parse test_url data and print our result.
print(list_parser.parse(test_url))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg',
'https://demo.com/imgs/4.jpg'
]
If instead of
test_url = 'https://demo.com/imgs/1.jpg'
we would provide
test_url = ['https://demo.com/imgs/1.jpg']
or
test_url = ['https://demo.com/imgs/1.jpg', 'https://demo.com/imgs/no-image.jpg']
We would still get same result as in example above.
-
normalize¶
-
capitalize¶
-
title¶
-
uppercase¶
-
lowercase¶
-
replace_keys¶
-
remove_keys¶
-
split_text_key¶
-
split_text_keys¶
-
take¶
-
skip¶
-
text_num_to_numeric¶
-
language¶
-
fix_spaces¶
-
escape_new_lines¶
-
new_line_replacement¶
-
add_stop¶
-
deny¶
-
cdeny¶
6.4.3. UrlList¶
-
class
easydata.parsers.list.UrlList(*args, from_text: bool = False, remove_qs: Optional[Union[str, list, bool]] = None, qs: Optional[dict] = None, domain: Optional[str] = None, protocol: Optional[str] = None, **kwargs)[source]¶
examples coming soon …
-
from_text¶
-
remove_qs¶
-
qs¶
-
domain¶
-
protocol¶
6.4.4. EmailSearchList¶
-
class
easydata.parsers.list.EmailSearchList(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_text_key: Optional[Union[str, tuple]] = None, split_text_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, multiply_keys: Optional[Union[list, tuple]] = None, **kwargs)[source]¶
EmailSearchList will search for emails in a text (html,xml,json,yaml,etc.) and
return a list of validated email addresses.
examples coming soon …