6.1. Data

6.1.1. BaseData

class easydata.parsers.base.BaseData(query: Optional[Union[easydata.queries.base.QuerySearchBase, easydata.parsers.base.BaseData]] = None, from_item: Optional[str] = None, default: Optional[Any] = None, default_from_item: Optional[str] = None, source: Optional[str] = None, process_raw_value: Optional[Union[Callable, easydata.parsers.base.Base]] = None, process_value: Optional[Union[Callable, easydata.parsers.base.Base]] = None, empty_as_none: bool = False, debug: bool = False, debug_source: bool = False)[source]

Bases: easydata.parsers.base.Base, abc.ABC

Note

BaseData parser can not be instantiated, since it’s an abstract class. It’s purpose is only to be a basis for other parsers that are dependent on querying from a provided data.

BaseData parser is most basic parser that accepts only parameters that are related to selecting data from a provided data source or from other items in a model. All other parsers except clause parsers inherit directly BaseData since it provides logic to select data with queries from a provided data source and some other features that will be explained further bellow.

Hint

When you are creating you own parser, you should always inherit from BaseData. Best reference is to check other parsers how they are build upon BaseData and which methods are needed to be used in order to process selected value.

Parameters

In bellow examples we will use Data parser which inherits BaseData and is exactly the same as BaseData since BaseData can not be instantiated and it’s an abstract class.

query

Currently EasyData has 4 query components, which should cover most of situations.

  • PyQuerySearch (pq) - is a css selector based on package pyquery, which offers jquery-like syntax.

  • JMESPathSearch (jp) - is a json selector based on package jmespath, which helps you to select deeply nested data with ease. Please note that on a simple 1 level dictionaries, it’s preferred to use key query instead due to performance reasons

  • KeySearch (key) - is a simple key based selector to be used on a 1 level dictionary.

  • ReSearch (re) - is a regex based selector with regex pattern as a query.

Example:

>>> import easydata as ed

Lets parse test data dict with Data parser.

>>> test_dict = {'info': {'name': 'EasyBook pro 15'}}
>>> ed.Data(query=ed.jp('info.name')).parse(test_dict)
'EasyBook pro 15'

Since query is first parameter (also in other parsers), we can skip query key as we can see bellow.

>>> ed.Data(ed.jp('info.name')).parse(test_dict)
'EasyBook pro 15'

We can also specify multiple queries with ed.cor where data is selected from the the first matching query.

Now lets create a Data parser with multiple queries and use re query which can select content with regex pattern.

test_dict = {'info': {'name': 'EasyBook pro 15'}}

data_parser = ed.Data(
    ed.cor(
        ed.jp('info.name'),
        ed.re(r'\bpro .+'),
    ),
)

Now lets parse result.

>>> data_parser.parse(test_dict)
pro 15
from_item

from_item parameter accepts a value of another item parsers name and will get it’s value from there instead in a DataBag.

Note

from_item cannot be used in a standalone parser and it will work only if it’s used in a model.

Lets create a model which will parse HTML data. We will use a Has parser as example in this case since it inherits Data parser.

test_html = """
    <html>
        <body>
            <h2 class="name">
                John Doe autographed baseball.
            </h2>
        </body>
    </html>
"""

Now our model:

import easydata as ed


class ProductItemModel(ed.ItemModel):
    item_name = ed.Text(
        ed.pq('.name::text'),
    )

    item_signed = ed.Has(
        from_item='name',
        contains=['autographed', 'signed']
    )

Result:

>>> ProductItemModel().parse_item(test_html)
{'name': 'John Doe autographed baseball.', 'signed': True}
default

Parser can also return default value if specified, if data cannot be extracted or found by selectors.

>>> test_dict = {'info': {'brand': None}}
>>> ed.Data(query=ed.jp('info.brand'), default='EasyData').parse(test_dict)
'EasyData'
default_from_item

default_from_item works similar to default, but instead of specifying return value, we specify name of other item parser in a model, from which value will be taken.

Note

default_from_item in a similar way as from_item cannot be used in a standalone parser and it will work only if it’s used in a model.

Now as example, lets create a model which will parse data from a dict.

First dict with data.

>>> test_dict = {'info': {'name': 'EasyBook pro 15', 'description': None}}

Now model:

import easydata as ed


class ProductItemModel(ed.ItemModel):
    item_name = ed.Text(
        ed.jp('info.name'),
    )

    item_description = ed.Data(
        ed.jp('info.description'),
        default_from_item='name'
    )

Result:

>>> ProductItemModel().parse_item(test_dict)
{'name': 'EasyBook pro 15', 'description': 'EasyBook pro 15'}
source

source value by default is data of used in a model. This means that by default parser will always look into DataBag for a data key and it’s content. If we need to modify content from a different source in a DataBag, then we just need to change source value.

Note

source cannot be used in a standalone parser and it will work only if it’s used in a model.

Example:

First let create some variables, which will hold different kind of data, that we will pass later in this tutorial to a parse method in a model instance.

>>> test_dict = {"brand": "EasyData"}
>>> test_html = '<p class="name">EasyBook pro 15<p>'

Now we will create a simple ItemModel.

import easydata as ed

class ProductItemModel(ed.ItemModel):
    item_brand = ed.Data(ed.jp('brand'))

    item_name = ed.Data(ed.pq('.name'), source="html")

Now lets pass our variables, that we created before, with different kind of data to parse method and see the result.

>>> ProductItemModel().parse_item(data=test_dict, html=test_html)
{'brand': 'EasyData', 'name': 'EasyBook pro 15'}
process_raw_value

process_raw_value accepts a callable function. Provided function is fired just after value is extracted and before value is processed. It’s purpose is mostly to prepare value for processing if needed. Function will receive raw value and data bag parameter. Data bag parameter will only pass DataBag object if parser is used in a model, otherwise it’s value will be None.

test_dict = {'info': {'name': 'EasyBook pro 15'}}
data_parser = ed.Data(
    ed.jp('info.name'),
    process_raw_value=lambda value, db: "EasyData " + db
)

Lets parse test_dict and get our result.

>>> data_parser.parse(test_dict)
'EasyData EasyBook pro 15'
process_value

process_value accepts a callable function and works in a similar way as process_raw_value. Provided function is fired just before value is outputted and it’s purpose is to add final editing to a value if needed.

empty_as_none
debug
debug_source