7.2. PyQuerySearch (pq)¶
-
class
easydata.queries.pq.
PyQuerySearch
(query: str, remove_query: Optional[str] = None, **kwargs)[source]¶ Bases:
easydata.queries.base.QuerySearch
PyQuerySearch
or it’s pq
shortcut is a css selector. It uses PyQuery library
underneath and on top of that, it adds custom pseudo keys witch serve as a command to
determine how selected data will be outputted.
Note
pq query selector will also work in most cases with XML formats.
Through this tutorial we will use following HTML:
test_html = """
<html>
<body>
<div id="breadcrumbs">
<div class="breadcrumb">Home > </div>
<div class="breadcrumb">Items</div>
</div>
<h2 class="name">
<div class="brand" content="EasyData">EasyData</div>
Test Product Item
</h2>
<div class="images">
<img src="http://demo.com/img1.jpg" />
<img src="http://demo.com/img2.jpg" />
</div>
<div class="stock" available="Yes">In Stock</div>
<input id="stock-quantity" name="quantity" value="12" />
<a href="https://demo.com" class="link">Home page</a>
</body>
</html>
"""
Lets import our easydata module first.
>>> import easydata as ed
Now lets select brand name from our HTML and pass test_html
to our pq
instance.
>>> ed.pq('.brand::text').get(test_html)
'EasyData'
If we wouldn’t add pseudo key ::text
at the end of our css selector, then we would get
PyQuery
instance instead of brand value.
7.2.1. Pseudo keys¶
-
::text
¶
Pseudo key ::text
will ensure that we get always text output. Any HTML child elements
will be stripped away and new line breaks will be converted to empty spaces.
Lets select in example bellow h2
element which has a child node div
.
>>> ed.pq('h2::text').get(test_html)
'EasyData Test Product Item'
-
::ntext
¶
Pseudo key ::ntext
works same as a ::text
but with exception that will perform
string normalization. This means that any bad unicode will be fixed … at least in most
cases.
>>> bad_html = "<div>ünicode</div>"
>>> ed.pq('div::ntext').get(test_html)
'ünicode'
-
::attr(<attr-name>)
¶
With pseudo key ::attr
we can select attributes in HTML elements.
>>> ed.pq('.brand::attr(content)').get(test_html)
'EasyData'
-
::content
¶
Pseudo key ::content
is a shortcut for a ::attr(content)
.
>>> ed.pq('.brand::content').get(test_html)
'EasyData'
-
::href
¶
Pseudo key ::href
is a shortcut for a ::attr(href)
.
>>> ed.pq('.link::href').get(test_html)
'EasyData'
-
::src
¶
Pseudo key ::src
is a shortcut for a ::attr(src)
.
>>> ed.pq('img::src').get(test_html)
'http://demo.com/img1.jpg'
-
::val
¶
Pseudo key ::val
is a shortcut for a ::attr(value)
.
>>> ed.pq('#stock-quantity::val').get(test_html)
'EasyData'
-
::name
¶
Pseudo key ::name
is a shortcut for a ::attr(name)
.
>>> ed.pq('#stock-quantity::name').get(test_html)
'quantity'
7.2.2. Pseudo keys “-all” extension¶
As we can see in out test_html
above, we have multiple elements with a class
value breadcrumb
.
Lets try to select them and output it’s value with pseudo key ::text
.
>>> ed.pq('.breadcrumb::text').get(test_html)
'Home > '
Pseudo keys will always by default output only first of the selected HTML element.
In order to get all elements that matches specified selector, we need to add -all
extension
to our ::text
pseudo key. Lets try that in example bellow.
>>> ed.pq('.breadcrumb::text-all').get(test_html)
'Home > Items'
-all
extension currently works only with ::text
and ::ntext
pseudo keys.
7.2.3. Pseudo keys “-items” extension¶
Purpose of -items
extension is to return a list
of all HTML elements matched by
a css selector.
>>> ed.pq('.images img::src-items').get(test_html)
['http://demo.com/img1.jpg', 'http://demo.com/img1.jpg']
-items
works with all other pseudo keys such as ::text
, ::ntext
, src
, val
,
::attr(<attr-name>)
, href
, etc.
We can also use items
as a pseudo key and it will return list of PyQuery
objects.
This is especially useful when it’s used inside List
or Dict
parsers where it needs
further processing by a child parsers.
>>> ed.pq('img::items').get(test_html)
[[<img>], [<img>]]
7.2.4. Removing HTML elements from result¶
Lets say we have following HTML:
test_html = """
<h2>
<span>EasyData</span>
Test Product Item
</h2>
"""
If we wanted to select h2
element and it’s content but to exclude content of span
element, then we need to specify rm
property with a css selector that points to an
element that we want to be excluded from end result.
>>> ed.pq('h2::text', remove_query='span').get(test_html)
'Test Product Item'
We can also exclude multiple nested HTML elements by separating them with a comma if needed.
>>> ed.pq('.made-up-class::text', remove_query='span,#some-id,.some-class').get(test_html)