6.4. List¶

6.4.1. List¶

class easydata.parsers.list.List(query: Optional[easydata.queries.base.QuerySearchBase] = None, parser: Optional[easydata.parsers.base.Base] = None, unique: bool = True, max_num: Optional[int] = None, split_key: Optional[Union[List[str], str]] = None, allow_parser: Optional[easydata.parsers.base.Base] = None, deny_parser: Optional[easydata.parsers.base.Base] = None, preprocess_allow: Optional[Callable] = None, process_allow: Optional[Callable] = None, **kwargs)[source]¶: Bases: easydata.parsers.base.BaseData

List parser returns a value of list type. It’s main advantage is that each value from list can be processed by other parser which is initialized together with List parser. For better explanation regarding this, please check further through examples.

Getting Started¶

EXAMPLE WITH JSON DATA SOURCE:

Lets first try to parse simple json text.

test_json_text = {
    'images': [
        {'src': 'https://demo.com/imgs/1.jpg'},
        {'src': 'https://demo.com/imgs/2.jpg'},
        {'src': 'https://demo.com/imgs/3.jpg'}
    ]
}

List supports any query object for fetching data. In example bellow we will use jp to query dict object. jp will also automatically convert our json text into python dictionary or list if it’s not already python object.

list_parser = ed.List(
    ed.jp('images[].src'),
    parser=ed.Url()
)

print(list_parser.parse(test_json_text))

This would print output like:

test_json_text = [
    'https://demo.com/imgs/1.jpg',
    'https://demo.com/imgs/2.jpg',
    'https://demo.com/imgs/3.jpg'
]

We can also use selector in our Url parser if needed. Lets demonstrate this in example bellow.

list_parser = ed.List(
    ed.jp('images'),
    parser=ed.Url(
        ed.jp('src')
    )
)

print(list_parser.parse(test_json_text))

Printed results is also same as before.

[
    'https://demo.com/imgs/1.jpg',
    'https://demo.com/imgs/2.jpg',
    'https://demo.com/imgs/3.jpg'
]

EXAMPLE WITH HTML DATA SOURCE:

Now lets try to parse simple HTML text.

<div id="image-container">
    <img id="image" src="https://demo.com/imgs/1.jpg">
    <div id="images">
        <img class="image" src="https://demo.com/imgs/1.jpg">
        <img class="image" src="https://demo.com/imgs/2.jpg">
        <img class="image" src="https://demo.com/imgs/3.jpg">
    </div>
</div>

Lets assume that we loaded HTML above into test_html_text variable.

In example bellow we will use pq to query through html nodes. pq will also automatically convert our HTML text into python PyQuery object through which we can use css selectors.

list_parser = ed.List(
    ed.pq('#images img::items'),
    parser=ed.Url(ed.pq('::src'))
)

Please note that pq('#images img::items') will be iterated through our List parser and that img html node object will be passed to Url parser upon which pq query selector can be used again to output final result. Since in example above in our List parser, we already selected with css img html node, so in Url parser we just add into query selector ::src pseudo element in order to get data from src attribute in HTML element.

Now lets parse test_html_text data and print our result.

print(list_parser.parse(test_html_text))

[
    'https://demo.com/imgs/1.jpg',
    'https://demo.com/imgs/2.jpg',
    'https://demo.com/imgs/3.jpg'
]

Parameters¶

unique¶

By default List parser will ensure that all values in a returned list are unique and that there are no duplicate values.

Lets first try to parse json text that contains duplicate image urls.

First we will demonstrate default behaviour which has by default unique parameter set to True.

{
    'images': [
        'https://demo.com/imgs/1.jpg'
        'https://demo.com/imgs/2.jpg',
        'https://demo.com/imgs/3.jpg',
        'https://demo.com/imgs/3.jpg'
    ]
}

list_parser = ed.List(
    ed.jp('images'),
    parser=ed.Url()
)

Now lets parse test_json_text data and print our result.

print(list_parser.parse(test_json_text))

[
    'https://demo.com/imgs/1.jpg',
    'https://demo.com/imgs/2.jpg',
    'https://demo.com/imgs/3.jpg'
]

As we can see, all our printed list values are unique. Now lets set unique parameter to False and see what happens.

list_parser = ed.List(
    ed.jp('images'),
    parser=ed.Url(),
    unique=False
)

Now lets parse test_json_text data and print our result.

print(list_parser.parse(test_json_text))

[
    'https://demo.com/imgs/1.jpg',
    'https://demo.com/imgs/2.jpg',
    'https://demo.com/imgs/3.jpg',
    'https://demo.com/imgs/3.jpg'
]

As we can see our list contains now two https://demo.com/imgs/3.jpg values.

max_num¶

Setting a int value to max_num parameter will basically ensure how many values we want in our end list result.

test_image_list = [
    'https://demo.com/imgs/1.jpg',
    'https://demo.com/imgs/2.jpg',
    'https://demo.com/imgs/3.jpg'
]

list_parser = ed.List(
    parser=ed.Url(),
    max_num=2
)

Now lets parse test_image_list data and print our result.

print(list_parser.parse(test_image_list))

[
    'https://demo.com/imgs/1.jpg',
    'https://demo.com/imgs/2.jpg'
]

As we can see, our original list had 3 image urls in it, and now because we have set to our parameter max_num value of 2, we get only list consisted of 2 image urls.

split_key¶

Through split_key we can break a text into list which be processed by List parser.

Example:

test_text = 'https://demo.com/imgs/1.jpg,https://demo.com/imgs/2.jpg'

list_parser = ed.List(
    parser=ed.Url(),
    split_key=','
)

Now lets parse test_text data and print our result.

print(list_parser.parse(test_image_list))

[
    'https://demo.com/imgs/1.jpg',
    'https://demo.com/imgs/2.jpg'
]

allow_parser¶

deny_parser¶

6.4.2. TextList¶

class easydata.parsers.list.TextList(*args, normalize: bool = True, capitalize: bool = False, title: bool = False, uppercase: bool = False, lowercase: bool = False, replace_keys: Optional[list] = None, remove_keys: Optional[list] = None, split_text_key: Optional[Union[str, tuple]] = None, split_text_keys: Optional[List[Union[str, tuple]]] = None, take: Optional[int] = None, skip: Optional[int] = None, text_num_to_numeric: bool = False, language: Optional[str] = None, fix_spaces: bool = True, escape_new_lines: bool = True, new_line_replacement: str = ' ', add_stop: Optional[Union[bool, str]] = None, allow: Optional[Union[List[str], str]] = None, callow: Optional[Union[List[str], str]] = None, from_allow: Optional[Union[List[str], str]] = None, from_callow: Optional[Union[List[str], str]] = None, to_allow: Optional[Union[List[str], str]] = None, to_callow: Optional[Union[List[str], str]] = None, deny: Optional[Union[List[str], str]] = None, cdeny: Optional[Union[List[str], str]] = None, multiply_keys: Optional[Union[list, tuple]] = None, **kwargs)[source]¶: Bases: easydata.parsers.list.List

TextList extends List parsers and therefore all parameters from it, are also available in TextList. TextList output is a list of str.

Parameters¶

allow¶

We can control which list values we want to get extracted by providing list of keywords into allow parameter. Provided keys are not case sensitive and regex pattern as a key is also supported.

test_list = ['http://demo.com', 'http://demo.net', 'http://demo.eu']

list_parser = ed.List(
    parser=ed.Url(),
    allow=['.com', '.eu']
)