2. Getting started¶

This guide covers getting started with the package easydata. After working through the guide you should know:

how to use ItemModel

how to assign parsers to ItemModel

how to use query selectors with parsers

basic usage of data processors

basic usage of item processors

2.1. Guide Assumptions¶

This guide is designed for beginners that haven’t worked with easydata before. There are some prerequisites for the tutorial that have to be followed:

python 3.6 and above

installing easydata package, which can be followed under Installation

2.2. Creating Model¶

We will use following html in examples below:

test_html = """
    <html>
        <body>
            <h2 class="name">
                <div class="brand">EasyData</div>
                Test Product Item
            </h2>
            <div id="description">
                <p>Basic product info. EasyData product is newest
                addition to python <b>world</b></p>
                <ul>
                    <li>Color: Black</li>
                    <li>Material: Aluminium</li>
                </ul>
            </div>
            <div id="price">Was 99.9</div>
            <div id="sale-price">49.9</div>
            <div class="images">
                <img src="http://demo.com/img1.jpg" />
                <img src="http://demo.com/img2.jpg" />
                <img src="http://demo.com/img2.jpg" />
            </div>
            <div class="stock" available="Yes">In Stock</div>
        </body>
    </html>
"""

Now lets create an ItemModel which will process html above and parse it to item dict. To select data in a text parser we will use pq, which is based on a PyQuery library with custom pseudo elements to handle output (::text, ::href, ::attr(<attr-name>), etc.).

Note

EasyData currently ships with 4 query selectors to handle various data formats:

PyQuerySearch (pq) - is a css selector which can handle HTML and XML data formats.
JMESPathSearch (jp) - is advanced json selector.
KeySearch (key) - is a simple key based selector to be used on a python dict.
ReSearch (re) - is a regex based selector with a regex pattern as a query selector.

import easydata as ed


class ProductItemModel(ed.ItemModel):
    item_name = ed.Text(
        ed.pq('.name::text'),
    )

    item_brand = ed.Text(
        ed.pq('.brand::text')
    )

    item_description = ed.Description(
        ed.pq('#description::text')
    )

    item_price = ed.PriceFloat(
        ed.pq('#price::text')
    )

    item_sale_price = ed.PriceFloat(
        ed.pq('#sale-price::text')
    )

    item_color = ed.Feature(
        ed.pq('#description::text'),
        key='color'
    )

    item_stock = ed.Has(
        ed.pq('.stock::attr(available)'),
        contains=['yes']
    )

    item_images = ed.List(
        ed.pq('.images img::items'),
        parser=ed.Url(
            ed.pq('::src')
        )
    )

    """
    Alternative shortcut to get list of image urls:

        item_images = ed.List(
            ed.pq('.images img::src-items'),
            parser=ed.Url()
        )
    """

2.3. Parsing data with Model¶

2.3.1. Calling parse to get item dict¶

In the example below we can see how the newly created ProductItemModel will parse provided HTML data into a dict object.

>>> item_model = ProductItemModel()

>>> item_model.parse_item(test_html)

Output:

{
    'brand': 'EasyData',
    'description': 'Basic product info. EasyData product is newest and greatest \
                    addition to python world. Color: Black. Material: Aluminium.',
    'color': 'Black',
    'images': [
        'http://demo.com/img1.jpg',
        'http://demo.com/img2.jpg',
        'http://demo.com/img3.jpg'
    ],
    'name': 'EasyData Test Product Item',
    'price': 99.9,
    'sale_price': 49.9,
    'stock': True
}

2.3.2. Calling parse from a method inside model¶

Advantages of calling parse from a method inside a model, is that you can put all extraction logic (making a request, reading feed file, etc.) inside item model and have better code organization.

...
import json
import requests


class ProductItemModel(ed.ItemModel):
    ...
    def store_item_from_url(product_url = None):
        if product_url:
            response = requests.get(product_url)
        else:
            # default url
            response = requests.get('http://demo.com/item-page-123')

        item_data = item_model.parse_item(response.text)

        with open("test_item.txt", "w") as text_file:
            text_file.write(json.dumps(item_data))

Now we can just use our model like this:

>>> ProductItemModel().store_item_from_url('http://demo.com/item-page-124')

with default url attribute:

>>> ProductItemModel().store_item_from_url()

and there is no need to call parse on item model object.

2.4. Adding Data Processor¶

Data processors are extensions to models which help to prepare/convert data for parser in the cases where data is more complex and with regular query selectors it cannot be selected in it’s raw form.

Tip

The greatest power of data processor usage is to build your own as a reusable piece of data converter in order to be used between different models when needed.

2.4.1. Example¶

In this example we will use following html with json info:

test_html = """
    <html>
        <body>
            <h2 class="name">
                <div class="brand">EasyData</div>
                Test Product Item
            </h2>
            <script type="text/javascript">
                var json_data = {
                    "brand": {"name": "EasyData"},
                    "name": "Test Product Item"
                };
            </script>
        </body>
    </html>
"""

Lets create our item model with data_processors included.

import easydata as ed


class ProductItemModel(ed.ItemModel):
    data_processors = [
        ed.DataJsonFromReToDictProcessor(
            query=r'var json_data = (.*?);',
            new_source='json_info'
        )
    ]

    item_name = ed.Text(
        ed.jp('name'),
        source='json_info'
    )

    item_brand = ed.Text(
        ed.jp('brand.name'),
        source='json_info'
    )

    item_css_name = ed.Text(
        ed.pq('.name::text'),
    )

>>> item_model = ProductItemModel()

>>> item_model.parse_item(test_html)

Output:

{
    'brand': 'EasyData',
    'css_name': 'EasyData Test Product Item',
    'name': 'Test Product Item'
}

2.4.2. How it works¶

Lets check how DataJsonFromReToDictProcessor in our example works in more detail.

data_processors = [
    ed.DataJsonFromReToDictProcessor(
        query=r'var json_data = (.*?);',
        new_source='json_info'
    )
]

The first parameter in DataJsonFromReToDictProcessor is our regex pattern which will extract json data from our HTML sample above.

The second parameter is new_source. This will tell our processor to store the extracted json data as a separate source and not to overwrite our HTML source. We can see in our example that the item parsers item_name and item_brand, which are selecting data from the json source, also need the source parameter specified, so that the query selectors know which source they need to select/query data from.

Example:

item_name = ed.Text(
    ed.key('name'),
    source='json_info'
)

If we didn’t set the new_source parameter in DataJsonFromReToDictProcessor, then the extracted json data would override default HTML source and the below case would throw an error because there wouldn’t be any HTML data to extract info from.

item_css_name = ed.Text(
    ed.pq('.name::text'),
)

We can also specify multiple data processors if needed:

data_processors = [
    ed.DataJsonFromReToDictProcessor(...),
    ed.DataFromQueryProcessor(...),
]

2.4.3. Default data processors¶

EasyData ships with multiple data processors to handle different case scenarios:

2.5. Adding Item Processor¶

Item processors are similar to data processor but instead of transforming data for a parser, their purpose is to modify already parsed item dictionary.

Tip

Similar to data processors, the greatest benefit is to create your own item processors and reuse them across different models. For example, you could implement validation for an item dictionary.

2.5.1. Example¶

In this example we will use following html:

test_html = """
    <html>
        <body>
            <h2 class="name">
                <div class="brand">EasyData</div>
                Test Product Item
            </h2>
            <div id="price">Was 99.9</div>
            <div id="sale-price">49.9</div>
        </body>
    </html>
"""

Lets create our item model with item_processors

import easydata as ed


class ProductItemModel(ed.ItemModel):
    item_name = ed.Text(
        ed.pq('#name::text', rm='.brand')
    )

    item_brand = ed.Text(
        ed.pq('.brand::text')
    )

    item_price = ed.PriceFloat(
        ed.pq('#price::text')
    )

    item_sale_price = ed.PriceFloat(
        ed.pq('#sale-price::text')
    )

    item_processors = [
        ed.ItemDiscountProcessor()
    ]

>>> item_model = ProductItemModel()

>>> item_model.parse_item(test_html)

Output:

{
    'brand': 'EasyData',
    'name': 'Test Product Item',
    'price': 99.9,
    'sale_price': 49.9,
    'discount': 50.05
}

2.5.2. How it works¶

Lets see how ItemDiscountProcessor works in more detail.

...
item_processors = [
    ed.ItemDiscountProcessor()
]

ItemDiscountProcessor looks for the parsed price and sale_price values in the item dictionary and calculates the discount between these two values. Finally it creates a new discount key in the item dictionary and adds the discount value to it. If our price and sale price values live under different keys under the item dictionary then the default ones are price and sale_price. All of the parameters that ItemDiscountProcessor accepts are item_price_key, item_sale_price_key, item_discount_key, decimals, no_decimals, remove_item_sale_price_key.

We can also specify multiple items processors if needed:

item_processors = [
    ed.ItemDiscountProcessor(),
    ed.ItemKeysMergeIntoDictProcessor(
        new_item_key='price_info',
        item_keys=['price', 'sale_price', 'discount'],
        preserve_original=False  # will delete keys in item dict
    )
]

item_processors in above example would produce following output:

{
    'brand': 'EasyData',
    'name': 'Test Product Item',
    'price_info': {
        'price': 99.9,
        'sale_price': 49.9,
        'discount': 50.05
    }
}

2.5.3. Default item processors¶

EasyData ships with multiple items processors to handle different case scenarios:

2.6. Next Steps¶

It’s great to have an understanding of how the data is shared between components, especially if you are planing to build custom parsers or processors. For a brief explanation to see how everything works underneath, please refer to the Architecture section.

For more advanced features please go to the Advanced section.