4. Advanced¶
4.1. Guide Assumptions¶
This guide is designed for those that already went through the Getting started and Architecture sections.
4.2. Creating block models¶
Item block models are ItemModel
objects but with a difference, which is
to be used as a reusable extension that contains predefined item parsers and processors.
It is easier to explain this functionality through examples, which begain below.
4.2.1. Basic block model¶
Let’s first create sample HTML text stored in a test_html
variable.
test_html = """
<html>
<body>
<h2 class="name">
<div class="brand">EasyData</div>
Test Product Item
</h2>
<div id="price">Was 99.9</div>
<div id="sale-price">49.9</div>
<div class="stock" available="Yes">In Stock</div>
</body>
</html>
"""
Now let’s create model block class, which will be responsible for extracting price data from the HTML above.
import easydata as ed
class PricingBlockModel(ed.ItemModel):
item_price = ed.PriceFloat(
ed.pq('#price::text')
)
item_sale_price = ed.PriceFloat(
ed.pq('#sale-price::text')
)
item_processors = [
('discount', ed.ItemDiscountProcessor())
]
As mentioned before, the model blocks above are meant to be used within ItemModel
.
Now let’s create ItemModel
which will utilize the block_models
property with
PricingBlockModel
as a value in a list.
import easydata as ed
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_name = ed.Text(
ed.pq('.name::text'),
)
item_brand = ed.Text(
ed.pq('.brand::text')
)
item_stock = ed.Has(
ed.pq('.stock::attr(available)'),
contains=['yes']
)
Now let’s parse HTML with ProductItemModel
and print its output.
>>> item_model = ProductItemModel()
>>> item_model.parse_item(test_html)
Output:
{
'brand': 'EasyData',
'discount': 50.05,
'name': 'EasyData Test Product Item',
'price': 99.9,
'sale_price': 49.9,
'stock': True
}
As we can see from the result, discount
was made through a ItemDiscountProcessor
,
which was added in PricingBlockModel
.
If needed, we can easily disable ItemDiscountProcessor
within our ProductItemModel
.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_processors = [
('discount', None)
]
...
We can also override item_price
from the PricingBlockModel
in our ProductItemModel
.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_price = ed.PriceFloat(
ed.pq('#price::text')
)
...
4.2.2. Block models with custom parameters¶
We can also create reusable block models with __init__
parameter, which will modify
or create parsers based on our input parameters. This is also preferred way how block
models should be created and used in most cases.
Example:
import easydata as ed
class PricingCssBlockModel(ed.ItemModel):
def __init__(self,
price_css: str,
sale_price_css: str,
calculate_discount: bool = True
):
self.item_processors = []
self.item_price = ed.PriceFloat(
ed.pq(price_css)
)
self.item_sale_price = ed.PriceFloat(
ed.pq(price_css)
)
if calculate_discount:
self.item_processors.append(
('discount', ed.ItemDiscountProcessor())
)
Now let’s use PricingCssBlockModel
in our ProductItemModel
.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingCssBlockModel(
price_css='#price::text',
sale_price_css='#sale-price::text'
)
]
...
Now let’s parse HTML with ProductItemModel
and print its output.
>>> item_model = ProductItemModel()
>>> item_model.parse_item(test_html)
Output:
{
'brand': 'EasyData',
'discount': 50.05,
'name': 'EasyData Test Product Item',
'price': 99.9,
'sale_price': 49.9,
'stock': True
}
4.3. Model as item property¶
Item properties in a model can have an ItemModel
object instead of a parser object. They can also have a
object that will produce dictionary value.
In the example below we will reuse PricingCssBlockModel
from the previous section.
import easydata as ed
class ProductItemModel(ed.ItemModel):
item_name = ed.Text(
ed.pq('.name::text'),
)
item_brand = ed.Text(
ed.pq('.brand::text')
)
item_pricing = PricingCssBlockModel(
price_css='#price::text',
sale_price_css='#sale-price::text'
)
item_stock = ed.Has(
ed.pq('.stock::attr(available)'),
contains=['yes']
)
Now let’s parse HTML with ProductItemModel
and print its output.
>>> item_model = ProductItemModel()
>>> item_model.parse_item(test_html) # test_html from previous section
Output:
{
'brand': 'EasyData',
'name': 'EasyData Test Product Item',
'pricing': {
'discount': 50.05,
'price': 99.9,
'sale_price': 49.9,
},
'stock': True
}
4.4. Advanced processor utilization¶
4.4.1. Named processors¶
We already are familiar with item and data processors from the Getting started section; therefore, now we will explain how to use named item and data processors from blocks or models that were dynamically added through a custom model initialization.
For starters let’s create block models without any named item processors.
class PricingBlockModel(ed.ItemModel):
item_price = ed.PriceFloat(
ed.pq('#price::text')
)
item_sale_price = ed.PriceFloat(
ed.pq('#sale-price::text')
)
item_processors = [
ed.ItemDiscountProcessor()
]
Now if we wanted to override ItemDiscountProcessor
in our item model, that
wouldn’t be possible. Lets see what happens if we added another ItemDiscountProcessor
with custom parameters to our model.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_processors = [
ed.ItemDiscountProcessor(no_decimals=True)
]
...
In this case ItemDiscountProcessor
from our ProductItemModel
would be joined
together with the same processor from the PricingBlockModel
. For a better understanding,
let’s just show a list of how item_processors
acts behind the scenes.
[
ed.ItemDiscountProcessor(),
ed.ItemDiscountProcessor(no_decimals=True)
]
As we see there are two ItemDiscountProcessor
while we want only
ItemDiscountProcessor
from our model and ignore one from PricingBlockModel
.
To solve this issue, named processors are the solution. Let’s recreate our
PricingBlockModel
, but this time we will add name to ItemDiscountProcessor
.
class PricingBlockModel(ed.ItemModel):
item_price = ed.PriceFloat(
ed.pq('#price::text')
)
item_sale_price = ed.PriceFloat(
ed.pq('#sale-price::text')
)
item_processors = [
('discount', ed.ItemDiscountProcessor())
]
Now if we want to override the discount item processor from the PricingBlockModel
in our model,
we will just need to assign the name to our ItemDiscountProcessor
as it is in PricingBlockModel
.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_processors = [
('discount', ed.ItemDiscountProcessor(no_decimals=True))
]
...
Now only ItemDiscountProcessor
from our model would get processed.
We can even remove ItemDiscountProcessor
from from the PricingBlockModel
by
adding None
to our named key in tuple
as we can see in example below.
class ProductItemModel(ed.ItemModel):
block_models = [
PricingBlockModel()
]
item_processors = [
('discount', None)
]
...
Now discount won’t be even calculated.
4.5. Protected items¶
Sometimes we don’t want values from item attributes to be outputted in a final
result but we still need them because item processors or other item parsers
rely on them. To solve this issue elegantly, we can just prefix our item properties
with _item
and item with that prefix will be deleted from final output.
Lets demonstrate this in example below.
class ProductItemModel(ed.ItemModel):
_item_price = ed.PriceFloat(
ed.pq('#price::text')
)
_item_sale_price = ed.PriceFloat(
ed.pq('#sale-price::text')
)
item_processors = [
ed.ItemDiscountProcessor()
]
Now let’s parse our ProductItemModel
and print its output.
>>> item_model = ProductItemModel()
>>> item_model.parse_item(test_html) # test_html from previous section
Output:
{
'discount': 50.05
}
As we can see in the result above, there is only 'discount'
and it’s value is returned.
Both of the 'price'
and 'sale_price'
item key/value pairs were deleted, but only after
they were already processed by the item processors.
4.6. Item method¶
In some cases our item parsers just won’t parse value from data properly due to its complexity and in those cases we can make item methods instead of making a parser instance on the model property.
Let’s demonstrate first with an parser instance on a model property to get more clarity.
class ProductItemModel(ed.ItemModel):
item_brand = ed.Text(ed.jp('brand'))
Now in this example instead of defining Text
parser on an item property, we
will create item method which will produce the exact same end result.
class ProductItemModel(ed.ItemModel):
def item_brand(data: DataBag):
return data['data']['brand']
4.7. Data processing¶
It’s encouraged that you create your own data processors to modify data, so that
custom processors can be reused between other models, but there are some edge
and specific cases which will occur hopefully not often and for that kind of
situations we can override preprocess_data
or process_data
methods from the
ItemModel
class. Follow the tutorials below to get more info regarding these
two methods.
In the example below we have badly structured json text with missing closing bracket
and because of that, it cannot be converted to a dict
type. With preprocess_data
we
can fix it before data is processed by data_processors
and later on, when
json is parsed into dictionary by DataJsonToDictProcessor
, we will modify this
dictionary in a process_data
method so that the item parsers can use it.
test_json_text = '{"brand": "EasyData"'
Now lets create our model, which will process test_json_text
and utilize
preprocess_data
method, which will fix bad json in order to be converted
into dictionary by a processor. We will also utilize process_data
which
will create a new data source called brand_type
.
import easydata as ed
from easydata.data import DataBag
class ProductItemModel(ed.ItemModel):
item_brand = ed.Text(ed.jp('brand'))
item_brand_type = ed.Text(source='brand_type')
data_processors = [
ed.DataJsonToDictProcessor()
]
def preprocess_data(self, db: DataBag):
db['main'] = db['main'] + '}'
return db
def process_data(self, db: DataBag):
if 'easydata' in db.get('brand').lower():
db['brand_type'] = 'local'
else:
db['brand_type'] = 'other'
return db
Now let’s parse our test_json_text
with ProductItemModel
and show its output.
>>> test_json_text = '{"brand": "EasyData"'
>>> item_model = ProductItemModel()
>>> item_model.parse_item(test_json_text)
Output:
{
'brand': 'EasyData',
'brand_type': 'local'
}
4.8. Item processing¶
In a similar way as data_processors
, it’s encouraged that you create your
own item processors to modify the item dictionary, so that custom processors can be
reused between other models, but there are some edge and specific cases which will
occur hopefully not often and for that kind of situations we can override
preprocess_item
or process_item
methods from the ItemModel
class.
Follow example below to get more info regarding these two methods.
test_dict = {
'price': 999.9,
'sale_price': 1
}
Now let’s create our model which will process our test_dict
. With a preprocess_item
,
we will modify item dictionary before item_processors
are fired so that we can prepare
items in order to be used in item_processors
. In the example below, we will fix wrong sale
price, so that ItemDiscountProcessor
can properly calculate discount and later on we will
utilize the process_item
method, where new dictionary item final_sale
will be created
with bool value, which is determined if the price is discounted or not.
class ProductItemModel(ed.ItemModel):
item_price = ed.PriceFloat(ed.jp('price'))
_item_sale_price = ed.PriceFloat(ed.jp('sale_price'))
item_processors = [
ed.ItemDiscountProcessor()
]
def preprocess_item(self, item: dict):
if item['sale_price'] <= 1:
item['sale_price'] = 0
return item
def process_item(self, item: dict):
item['final_sale'] = bool(item['discount'])
return item
Now let’s parse our test_dict
with ProductItemModel
and show its output.
>>> item_model = ProductItemModel()
>>> item_model.parse_item(test_dict)
Output:
{
'discount': 0,
'final_sale': False,
'price': 999.9
}
Note
Please note that sale_price is missing in final output because we declared in a model our sale price property as a protected and those get deleted at the end, but they are still accessible in ``preprocess_item``, ``item_processors`` and ``process_item``.
4.9. With items¶
ItemModel
has an option to retrieve multiple items from a provided data source.
4.10. Item Validation¶
easydata
does not come with validation solution since its main purpose is to
transform data, but it’s easy to create your own solution via custom item processor
which handles validation or to handle validation after model returns a dict item.
Some validation libraries that we recommend:
Schematics: validation library based on ORM-like models.
JSON Schema: validation library based on JSON schema.