4. Advanced¶
4.1. Guide Assumptions¶
This guide is designed for those that already went through the Getting started and Architecture sections.
4.2. Creating block models¶
Item block models are ItemModel objects but with a difference, which is
to be used as a reusable extension that contains predefined item parsers and processors.
It is easier to explain this functionality through examples, which begain below.
4.2.1. Basic block model¶
Let’s first create sample HTML text stored in a test_html variable.
test_html = """
<html>
<body>
<h2 class="name">
<div class="brand">EasyData</div>
Test Product Item
</h2>
<div id="price">Was 99.9</div>
<div id="sale-price">49.9</div>
<div class="stock" available="Yes">In Stock</div>
</body>
</html>
"""
Now let’s create model block class, which will be responsible for extracting price data from the HTML above.
import easydata as ed
class PricingBlockModel(ed.ItemModel):
item_price = ed.PriceFloat(
ed.pq('#price::text')
)
item_sale_price = ed.PriceFloat(
ed.pq('#sale-price::text')
)
item_processors = [
('discount', ItemDiscountProcessor())
]
As mentioned before, the model blocks above are meant to be used within ItemModel.
Now let’s create ItemModel which will utilize the block_models property with
PricingBlockModel as a value in a list.
import easydata as ed
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_name = ed.Text(
ed.pq('.name::text'),
)
item_brand = ed.Text(
ed.pq('.brand::text')
)
item_stock = ed.Has(
ed.pq('.stock::attr(available)'),
contains=['yes']
)
Now let’s parse HTML with ProductItemModel and print its output.
>>> item_model = ProductItemModel()
>>> item_model.parse(test_html)
Output:
{
'brand': 'EasyData',
'discount': 50.05,
'name': 'EasyData Test Product Item',
'price': 99.9,
'sale_price': 49.9,
'stock': True
}
As we can see from the result, discount was made through a ItemDiscountProcessor,
which was added in PricingBlockModel.
If needed, we can easily disable ItemDiscountProcessor within our ProductItemModel.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_processors = [
('discount', None)
]
...
We can also override item_price from the PricingBlockModel in our ProductItemModel.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_price = ed.PriceFloat(
ed.pq('#price::text')
)
...
4.2.2. Block models with custom parameters¶
We can also create reusable block models with __init__ parameter, which will modify
or create parsers based on our input parameters. This is also preferred way how block
models should be created and used in most cases.
Example:
import easydata as ed
class PricingCssBlockModel(ed.ItemModel):
def __init__(self,
price_css,
sale_price_css,
calculate_discount = True
):
self.item_price = ed.PriceFloat(
ed.pq(price_css)
)
self.item_sale_price = ed.PriceFloat(
ed.pq(price_css)
)
if calculate_discount:
self.item_processors.append(
('discount', ed.ItemDiscountProcessor())
)
Now let’s use PricingCssBlockModel in our ProductItemModel.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingCssBlockModel(
price_css='#price::text',
sale_price_css='#sale-price::text'
)
]
...
Now let’s parse HTML with ProductItemModel and print its output.
>>> item_model = ProductItemModel()
>>> item_model.parse(test_html)
Output:
{
'brand': 'EasyData',
'discount': 50.05,
'name': 'EasyData Test Product Item',
'price': 99.9,
'sale_price': 49.9,
'stock': True
}
4.3. Model as item property¶
Item properties in a model can have an ItemModel object instead of a parser object. They can also have a
object that will produce dictionary value.
In the example below we will reuse PricingCssBlockModel from the previous section.
import easydata as ed
class ProductItemModel(ed.ItemModel):
item_name = ed.Text(
ed.pq('.name::text'),
)
item_brand = ed.Text(
ed.pq('.brand::text')
)
item_pricing = PricingCssBlockModel(
price_css='#price::text',
sale_price_css='#sale-price::text'
)
item_stock = ed.Has(
ed.pq('.stock::attr(available)'),
contains=['yes']
)
Now let’s parse HTML with ProductItemModel and print its output.
>>> item_model = ProductItemModel()
>>> item_model.parse(test_html) # test_html from previous section
Output:
{
'brand': 'EasyData',
'name': 'EasyData Test Product Item',
'pricing': {
'discount': 50.05,
'price': 99.9,
'sale_price': 49.9,
},
'stock': True
}
4.4. Advanced processor utilization¶
4.4.1. Named processors¶
We already are familiar with item and data processors from the Getting started section; therefore, now we will explain how to use named item and data processors from blocks or models that were dynamically added through a custom model initialization.
For starters let’s create block models without any named item processors.
class PricingBlockModel(ed.ItemModel):
item_price = ed.PriceFloat(
ed.pq('#price::text')
)
item_sale_price = ed.PriceFloat(
ed.pq('#sale-price::text')
)
item_processors = [
ed.ItemDiscountProcessor()
]
Now if we wanted to override ItemDiscountProcessor in our item model, that
wouldn’t be possible. Lets see what happens if we added another ItemDiscountProcessor
with custom parameters to our model.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_processors = [
ed.ItemDiscountProcessor(no_decimals=True)
]
...
In this case ItemDiscountProcessor from our ProductItemModel would be joined
together with the same processor from the PricingBlockModel. For a better understanding,
let’s just show a list of how item_processors acts behind the scenes.
[
ed.ItemDiscountProcessor(),
ed.ItemDiscountProcessor(no_decimals=True)
]
As we see there are two ItemDiscountProcessor while we want only
ItemDiscountProcessor from our model and ignore one from PricingBlockModel.
To solve this issue, named processors are the solution. Let’s recreate our
PricingBlockModel, but this time we will add name to ItemDiscountProcessor.
class PricingBlockModel(ed.ItemModel):
item_price = ed.PriceFloat(
ed.pq('#price::text')
)
item_sale_price = ed.PriceFloat(
ed.pq('#sale-price::text')
)
item_processors = [
('discount', ed.ItemDiscountProcessor())
]
Now if we want to override the discount item processor from the PricingBlockModel in our model,
we will just need to assign the name to our ItemDiscountProcessor as it is in PricingBlockModel.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_processors = [
('discount', ed.ItemDiscountProcessor(no_decimals=True))
]
...
Now only ItemDiscountProcessor from our model would get processed.
We can even remove ItemDiscountProcessor from from the PricingBlockModel by
adding None to our named key in tuple as we can see in example below.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_processors = [
('discount', None)
]
...
Now discount won’t be even calculated.
4.5. Protected items¶
Sometimes we don’t want values from item attributes to be outputted in a final
result but we still need them because item processors or other item parsers
rely on them. To solve this issue elegantly, we can just prefix our item properties
with _item and item with that prefix will be deleted from final output.
Lets demonstrate this in example below.
class ProductItemModel(ed.ItemModel):
_item_price = ed.PriceFloat(
pq('#price::text')
)
_item_sale_price = ed.PriceFloat(
pq('#sale-price::text')
)
item_processors = [
ed.ItemDiscountProcessor()
]
Now let’s parse our ProductItemModel and print its output.
>>> item_model = ProductItemModel()
>>> item_model.parse(test_html) # test_html from previous section
Output:
{
'discount': 50.05
}
As we can see in the result above, there is only 'discount' and it’s value is returned.
Both of the 'price' and 'sale_price' item key/value pairs were deleted, but only after
they were already processed by the item processors.
4.6. Item method¶
In some cases our item parsers just won’t parse value from data properly due to its complexity and in those cases we can make item methods instead of making a parser instance on the model property.
Let’s demonstrate first with an parser instance on a model property to get more clarity.
class ProductItemModel(ed.ItemModel):
item_brand = ed.Text(ed.jp('brand'))
Now in this example instead of defining Text parser on an item property, we
will create item method which will produce the exact same end result.
class ProductItemModel(ed.ItemModel):
def item_brand(data: DataBag):
return data['data']['brand']
4.7. Data processing¶
It’s encouraged that you create your own data processors to modify data, so that
custom processors can be reused between other models, but there are some edge
and specific cases which will occur hopefully not often and for that kind of
situations we can override preprocess_data or process_data methods from the
ItemModel class. Follow the tutorials below to get more info regarding these
two methods.
In the example below we have badly structured json text with missing closing bracket
and because of that, it cannot be converted to a dict type. With preprocess_data we
can fix it before data is processed by data_processors and later on, when
json is parsed into dictionary by DataJsonToDictProcessor, we will modify this
dictionary in a process_data method so that the item parsers can use it.
test_json_text = '{"brand": "EasyData"'
Now lets create our model, which will process test_json_text and utilize
preprocess_data method, which will fix bad json in order to be converted
into dictionary by a processor. We will also utilize process_data which
will create a new data source called brand_type.
class ProductItemModel(ed.ItemModel):
item_brand = ed.Text(ed.jp('brand'))
item_brand_type = ed.Text(source='brand_type')
data_processors = [
ed.DataJsonToDictProcessor()
]
def preprocess_data(self, data):
data['data'] = data['data'] + '}'
return data
def process_data(self, data):
if 'easydata' in data['data']['brand'].lower():
data['brand_type'] = 'local'
else:
data['brand_type'] = 'other'
return data
Now let’s parse our test_json_text with ProductItemModel and show its output.
>>> item_model = ProductItemModel()
>>> item_model.parse(test_json_text)
Output:
{
'brand': 'EasyData',
'brand_type': 'local'
}
4.8. Item processing¶
In a similar way as data_processors, it’s encouraged that you create your
own item processors to modify the item dictionary, so that custom processors can be
reused between other models, but there are some edge and specific cases which will
occur hopefully not often and for that kind of situations we can override
preprocess_item or process_item methods from the ItemModel class.
Follow example below to get more info regarding these two methods.
test_dict = {
'price': 999.9,
'sale_price': 1
}
Now let’s create our model which will process our test_dict. With a preprocess_item,
we will modify item dictionary before item_processors are fired so that we can prepare
items in order to be used in item_processors. In the example below, we will fix wrong sale
price, so that ItemDiscountProcessor can properly calculate discount and later on we will
utilize the process_item method, where new dictionary item final_sale will be created
with bool value, which is determined if the price is discounted or not.
class ProductItemModel(ed.ItemModel):
item_price = ed.PriceFloat(ed.jp('price'))
_item_sale_price = ed.PriceFloat(ed.jp('sale_price'))
item_processors = [
ed.ItemDiscountProcessor()
]
def preprocess_item(self, item):
if item['sale_price'] <= 1:
item['sale_price'] = 0
return item
def process_item(self, item):
item['final_sale'] = bool(item['discount'])
return item
Now let’s parse our test_dict with ProductItemModel and show its output.
>>> item_model = ProductItemModel()
>>> item_model.parse(test_dict)
Output:
{
'discount': 0,
'final_sale': False,
'price': 999.9
}
Note
Please note that sale_price is missing in final output because we declared in a model our sale price property as a protected and those get deleted at the end, but they are still accessible in ``preprocess_item``, ``item_processors`` and ``process_item``.
4.9. With items¶
ItemModel has an option to retrieve multiple items from a provided data source.
4.10. Item Validation¶
easydata does not come with validation solution since its main purpose is to
transform data, but it’s easy to create your own solution via custom item processor
which handles validation or to handle validation after model returns a dict item.
Some validation libraries that we recommend:
Schematics: validation library based on ORM-like models.
JSON Schema: validation library based on JSON schema.