7.2. PyQuerySearch (pq)¶
-
class
easydata.queries.pq.PyQuerySearch(query: str, remove_query: Optional[str] = None, **kwargs)[source]¶ Bases:
easydata.queries.base.QuerySearch
PyQuerySearch or it’s pq shortcut is a css selector. It uses PyQuery library
underneath and on top of that, it adds custom pseudo keys witch serve as a command to
determine how selected data will be outputted.
Note
pq query selector will also work in most cases with XML formats.
Through this tutorial we will use following HTML:
test_html = """
<html>
<body>
<div id="breadcrumbs">
<div class="breadcrumb">Home > </div>
<div class="breadcrumb">Items</div>
</div>
<h2 class="name">
<div class="brand" content="EasyData">EasyData</div>
Test Product Item
</h2>
<div class="images">
<img src="http://demo.com/img1.jpg" />
<img src="http://demo.com/img2.jpg" />
</div>
<div class="stock" available="Yes">In Stock</div>
<input id="stock-quantity" name="quantity" value="12" />
<a href="https://demo.com" class="link">Home page</a>
</body>
</html>
"""
Lets import our easydata module first.
>>> import easydata as ed
Now lets select brand name from our HTML and pass test_html to our pq instance.
>>> ed.pq('.brand::text').get(test_html)
'EasyData'
If we wouldn’t add pseudo key ::text at the end of our css selector, then we would get
PyQuery instance instead of brand value.
7.2.1. Pseudo keys¶
-
::text¶
Pseudo key ::text will ensure that we get always text output. Any HTML child elements
will be stripped away and new line breaks will be converted to empty spaces.
Lets select in example bellow h2 element which has a child node div.
>>> ed.pq('h2::text').get(test_html)
'EasyData Test Product Item'
-
::ntext¶
Pseudo key ::ntext works same as a ::text but with exception that will perform
string normalization. This means that any bad unicode will be fixed … at least in most
cases.
>>> bad_html = "<div>ünicode</div>"
>>> ed.pq('div::ntext').get(test_html)
'ünicode'
-
::attr(<attr-name>)¶
With pseudo key ::attr we can select attributes in HTML elements.
>>> ed.pq('.brand::attr(content)').get(test_html)
'EasyData'
-
::content¶
Pseudo key ::content is a shortcut for a ::attr(content).
>>> ed.pq('.brand::content').get(test_html)
'EasyData'
-
::href¶
Pseudo key ::href is a shortcut for a ::attr(href).
>>> ed.pq('.link::href').get(test_html)
'EasyData'
-
::src¶
Pseudo key ::src is a shortcut for a ::attr(src).
>>> ed.pq('img::src').get(test_html)
'http://demo.com/img1.jpg'
-
::val¶
Pseudo key ::val is a shortcut for a ::attr(value).
>>> ed.pq('#stock-quantity::val').get(test_html)
'EasyData'
-
::name¶
Pseudo key ::name is a shortcut for a ::attr(name).
>>> ed.pq('#stock-quantity::name').get(test_html)
'quantity'
7.2.2. Pseudo keys “-all” extension¶
As we can see in out test_html above, we have multiple elements with a class
value breadcrumb.
Lets try to select them and output it’s value with pseudo key ::text.
>>> ed.pq('.breadcrumb::text').get(test_html)
'Home > '
Pseudo keys will always by default output only first of the selected HTML element.
In order to get all elements that matches specified selector, we need to add -all extension
to our ::text pseudo key. Lets try that in example bellow.
>>> ed.pq('.breadcrumb::text-all').get(test_html)
'Home > Items'
-all extension currently works only with ::text and ::ntext pseudo keys.
7.2.3. Pseudo keys “-items” extension¶
Purpose of -items extension is to return a list of all HTML elements matched by
a css selector.
>>> ed.pq('.images img::src-items').get(test_html)
['http://demo.com/img1.jpg', 'http://demo.com/img1.jpg']
-items works with all other pseudo keys such as ::text, ::ntext, src, val,
::attr(<attr-name>), href, etc.
We can also use items as a pseudo key and it will return list of PyQuery objects.
This is especially useful when it’s used inside List or Dict parsers where it needs
further processing by a child parsers.
>>> ed.pq('img::items').get(test_html)
[[<img>], [<img>]]
7.2.4. Removing HTML elements from result¶
Lets say we have following HTML:
test_html = """
<h2>
<span>EasyData</span>
Test Product Item
</h2>
"""
If we wanted to select h2 element and it’s content but to exclude content of span
element, then we need to specify rm property with a css selector that points to an
element that we want to be excluded from end result.
>>> ed.pq('h2::text', remove_query='span').get(test_html)
'Test Product Item'
We can also exclude multiple nested HTML elements by separating them with a comma if needed.
>>> ed.pq('.made-up-class::text', remove_query='span,#some-id,.some-class').get(test_html)