6.4. List¶
6.4.1. List¶
-
class
easydata.parsers.list.
List
(query: Optional[easydata.queries.base.QuerySearchBase] = None, parser: Optional[easydata.parsers.base.Base] = None, unique: bool = True, max_num: Optional[int] = None, split_key: Optional[Union[List[str], str]] = None, allow_parser: Optional[easydata.parsers.base.Base] = None, deny_parser: Optional[easydata.parsers.base.Base] = None, preprocess_allow: Optional[Callable] = None, process_allow: Optional[Callable] = None, **kwargs)[source]¶
List
parser returns a value of list
type. It’s main advantage is that each
value from list can be processed by other parser which is initialized together with
List
parser. For better explanation regarding this, please check further through
examples.
Getting Started¶
EXAMPLE WITH JSON DATA SOURCE:
Lets first try to parse simple json text.
test_json_text = {
'images': [
{'src': 'https://demo.com/imgs/1.jpg'},
{'src': 'https://demo.com/imgs/2.jpg'},
{'src': 'https://demo.com/imgs/3.jpg'}
]
}
List
supports any query object for fetching data. In example bellow we will
use jp
to query dict object. jp
will also automatically convert our
json text into python dictionary
or list
if it’s not already python object.
list_parser = ed.List(
ed.jp('images[].src'),
parser=ed.Url()
)
print(list_parser.parse(test_json_text))
This would print output like:
test_json_text = [
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
We can also use selector in our Url
parser if needed. Lets demonstrate this in
example bellow.
list_parser = ed.List(
ed.jp('images'),
parser=ed.Url(
ed.jp('src')
)
)
print(list_parser.parse(test_json_text))
Printed results is also same as before.
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
EXAMPLE WITH HTML DATA SOURCE:
Now lets try to parse simple HTML
text.
<div id="image-container">
<img id="image" src="https://demo.com/imgs/1.jpg">
<div id="images">
<img class="image" src="https://demo.com/imgs/1.jpg">
<img class="image" src="https://demo.com/imgs/2.jpg">
<img class="image" src="https://demo.com/imgs/3.jpg">
</div>
</div>
Lets assume that we loaded HTML
above into test_html_text
variable.
In example bellow we will use pq
to query through html nodes. pq
will also automatically convert our HTML
text into python PyQuery
object through which we can use css selectors.
list_parser = ed.List(
ed.pq('#images img::items'),
parser=ed.Url(ed.pq('::src'))
)
Please note that pq('#images img::items')
will be iterated through our List
parser and that img html node object will be passed to Url
parser upon which
pq
query selector can be used again to output final result. Since in example
above in our List
parser, we already selected with css img html node, so in
Url
parser we just add into query selector ::src
pseudo element in order
to get data from src
attribute in HTML element.
Now lets parse test_html_text
data and print our result.
print(list_parser.parse(test_html_text))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
Parameters¶
-
unique
¶
By default List
parser will ensure that all values in a returned list
are unique and that there are no duplicate values.
Lets first try to parse json text that contains duplicate image urls.
First we will demonstrate default behaviour which has by default unique
parameter set to True
.
{
'images': [
'https://demo.com/imgs/1.jpg'
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg',
'https://demo.com/imgs/3.jpg'
]
}
list_parser = ed.List(
ed.jp('images'),
parser=ed.Url()
)
Now lets parse test_json_text
data and print our result.
print(list_parser.parse(test_json_text))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
As we can see, all our printed list
values are unique. Now lets set unique
parameter to False
and see what happens.
list_parser = ed.List(
ed.jp('images'),
parser=ed.Url(),
unique=False
)
Now lets parse test_json_text
data and print our result.
print(list_parser.parse(test_json_text))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg',
'https://demo.com/imgs/3.jpg'
]
As we can see our list contains now two https://demo.com/imgs/3.jpg
values.
-
max_num
¶
Setting a int
value to max_num
parameter will basically ensure how many
values we want in our end list
result.
test_image_list = [
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg'
]
list_parser = ed.List(
parser=ed.Url(),
max_num=2
)
Now lets parse test_image_list
data and print our result.
print(list_parser.parse(test_image_list))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg'
]
As we can see, our original list
had 3 image urls in it, and now because we have
set to our parameter max_num
value of 2, we get only list
consisted of 2
image urls.
-
split_key
¶
Through split_key
we can break a text into list which be processed by List
parser.
Example:
test_text = 'https://demo.com/imgs/1.jpg,https://demo.com/imgs/2.jpg'
list_parser = ed.List(
parser=ed.Url(),
split_key=','
)
Now lets parse test_text
data and print our result.
print(list_parser.parse(test_image_list))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg'
]
-
allow_parser
¶
-
deny_parser
¶
6.4.2. TextList¶
-
class
easydata.parsers.list.
TextList
(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_text_key: Optional[Union[str, tuple]] = None, split_text_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, multiply_keys: Optional[Union[list, tuple]] = None, **kwargs)[source]¶ Bases:
easydata.parsers.list.List
TextList
extends List
parsers and therefore all parameters from it, are also
available in TextList
. TextList
output is a list
of str
.
Parameters¶
-
allow
¶
We can control which list values we want to get extracted by providing list of
keywords into allow
parameter. Provided keys are not case sensitive and regex
pattern as a key is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
allow=['.com', '.eu']
)
Now lets parse test_list
data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.com',
'http://demo.eu'
]
-
callow
¶
callow
is similar to allow
but with exception that provided keys
are case sensitive. Regex pattern as a key is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
callow=['.COM', '.eu']
)
Now lets parse test_list
data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.eu'
]
-
from_allow
¶
We can skip list values by providing keys in from_allow
parameter.
Keys are not case sensitive and regex pattern is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
from_allow=['.net']
)
Now lets parse test_list
data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.net',
'http://demo.eu'
]
-
from_callow
¶
from_callow
is similar to from_allow
but with exception that
provided keys are case sensitive. Regex pattern as a key is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
from_callow=['.net']
)
Now lets parse test_list
data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.net',
'http://demo.eu'
]
Lets recreate same example as before but with uppercase key.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
from_callow=['.net']
)
Now lets parse test_list
data and print our result.
print(list_parser.parse(test_image_list))
[]
-
to_allow
¶
to_allow
is similar to from_allow
but in reverse order. Here
are list values skipped after provided key is found. Keys are not case
sensitive and regex pattern is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
to_allow=['.eu']
)
Now lets parse test_list
data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.com',
'http://demo.net'
]
-
to_callow
¶
to_callow
is similar to to_allow
but with exception that
provided keys are case sensitive. Regex pattern is also supported.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
to_callow=['.eu']
)
Now lets parse test_list
data and print our result.
print(list_parser.parse(test_image_list))
[
'http://demo.com',
'http://demo.net'
]
Lets recreate same example as before but with a uppercase key.
test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']
list_parser = ed.List(
parser=ed.Url(),
to_callow=['.EU']
)
Now lets parse test_list
data and print our result.
print(list_parser.parse(test_list))
[
'http://demo.com',
'http://demo.net',
'http://demo.eu'
]
-
multiply_keys
¶
Setting values into multiply_keys
enables you to parse str
or a first
value from a list
into multiple values. Lets check bellow example for a
better understanding.
test_url = 'https://demo.com/imgs/1.jpg'
list_parser = ed.List(
parser=ed.Url(),
multiply_keys=[('1.jpg', ['1.jpg', '2.jpg', '3.jpg', '4.jpg'])]
)
Now lets parse test_url
data and print our result.
print(list_parser.parse(test_url))
[
'https://demo.com/imgs/1.jpg',
'https://demo.com/imgs/2.jpg',
'https://demo.com/imgs/3.jpg',
'https://demo.com/imgs/4.jpg'
]
If instead of
test_url = 'https://demo.com/imgs/1.jpg'
we would provide
test_url = ['https://demo.com/imgs/1.jpg']
or
test_url = ['https://demo.com/imgs/1.jpg', 'https://demo.com/imgs/no-image.jpg']
We would still get same result as in example above.
-
normalize
¶
-
capitalize
¶
-
title
¶
-
uppercase
¶
-
lowercase
¶
-
replace_keys
¶
-
remove_keys
¶
-
split_text_key
¶
-
split_text_keys
¶
-
take
¶
-
skip
¶
-
text_num_to_numeric
¶
-
language
¶
-
fix_spaces
¶
-
escape_new_lines
¶
-
new_line_replacement
¶
-
add_stop
¶
-
deny
¶
-
cdeny
¶
6.4.3. UrlList¶
-
class
easydata.parsers.list.
UrlList
(*args, from_text: bool = False, remove_qs: Optional[Union[str, list, bool]] = None, qs: Optional[dict] = None, domain: Optional[str] = None, protocol: Optional[str] = None, **kwargs)[source]¶
examples coming soon …
-
from_text
¶
-
remove_qs
¶
-
qs
¶
-
domain
¶
-
protocol
¶
6.4.4. EmailSearchList¶
-
class
easydata.parsers.list.
EmailSearchList
(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_text_key: Optional[Union[str, tuple]] = None, split_text_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, multiply_keys: Optional[Union[list, tuple]] = None, **kwargs)[source]¶
EmailSearchList
will search for emails in a text (html,xml,json,yaml,etc.) and
return a list of validated email addresses.
examples coming soon …