7.2. PyQuerySearch (pq)

class easydata.queries.pq.PyQuerySearch(query: str, remove_query: Optional[str] = None, **kwargs)[source]

Bases: easydata.queries.base.QuerySearch

PyQuerySearch or it’s pq shortcut is a css selector. It uses PyQuery library underneath and on top of that, it adds custom pseudo keys witch serve as a command to determine how selected data will be outputted.

Note

pq query selector will also work in most cases with XML formats.

Through this tutorial we will use following HTML:

test_html = """
    <html>
        <body>
            <div id="breadcrumbs">
                <div class="breadcrumb">Home > </div>
                <div class="breadcrumb">Items</div>
            </div>
            <h2 class="name">
                <div class="brand" content="EasyData">EasyData</div>
                Test Product Item
            </h2>
            <div class="images">
                <img src="http://demo.com/img1.jpg" />
                <img src="http://demo.com/img2.jpg" />
            </div>
            <div class="stock" available="Yes">In Stock</div>
            <input id="stock-quantity" name="quantity" value="12" />
            <a href="https://demo.com" class="link">Home page</a>
        </body>
    </html>
"""

Lets import our easydata module first.

>>> import easydata as ed

Now lets select brand name from our HTML and pass test_html to our pq instance.

>>> ed.pq('.brand::text').get(test_html)
'EasyData'

If we wouldn’t add pseudo key ::text at the end of our css selector, then we would get PyQuery instance instead of brand value.

7.2.1. Pseudo keys

::text

Pseudo key ::text will ensure that we get always text output. Any HTML child elements will be stripped away and new line breaks will be converted to empty spaces.

Lets select in example bellow h2 element which has a child node div.

>>> ed.pq('h2::text').get(test_html)
'EasyData Test Product Item'
::ntext

Pseudo key ::ntext works same as a ::text but with exception that will perform string normalization. This means that any bad unicode will be fixed … at least in most cases.

>>> bad_html = "<div>ünicode</div>"
>>> ed.pq('div::ntext').get(test_html)
'ünicode'
::attr(<attr-name>)

With pseudo key ::attr we can select attributes in HTML elements.

>>> ed.pq('.brand::attr(content)').get(test_html)
'EasyData'
::content

Pseudo key ::content is a shortcut for a ::attr(content).

>>> ed.pq('.brand::content').get(test_html)
'EasyData'
::href

Pseudo key ::href is a shortcut for a ::attr(href).

>>> ed.pq('.link::href').get(test_html)
'EasyData'
::src

Pseudo key ::src is a shortcut for a ::attr(src).

>>> ed.pq('img::src').get(test_html)
'http://demo.com/img1.jpg'
::val

Pseudo key ::val is a shortcut for a ::attr(value).

>>> ed.pq('#stock-quantity::val').get(test_html)
'EasyData'
::name

Pseudo key ::name is a shortcut for a ::attr(name).

>>> ed.pq('#stock-quantity::name').get(test_html)
'quantity'

7.2.2. Pseudo keys “-all” extension

As we can see in out test_html above, we have multiple elements with a class value breadcrumb.

Lets try to select them and output it’s value with pseudo key ::text.

>>> ed.pq('.breadcrumb::text').get(test_html)
'Home > '

Pseudo keys will always by default output only first of the selected HTML element.

In order to get all elements that matches specified selector, we need to add -all extension to our ::text pseudo key. Lets try that in example bellow.

>>> ed.pq('.breadcrumb::text-all').get(test_html)
'Home > Items'

-all extension currently works only with ::text and ::ntext pseudo keys.

7.2.3. Pseudo keys “-items” extension

Purpose of -items extension is to return a list of all HTML elements matched by a css selector.

>>> ed.pq('.images img::src-items').get(test_html)
['http://demo.com/img1.jpg', 'http://demo.com/img1.jpg']

-items works with all other pseudo keys such as ::text, ::ntext, src, val, ::attr(<attr-name>), href, etc.

We can also use items as a pseudo key and it will return list of PyQuery objects. This is especially useful when it’s used inside List or Dict parsers where it needs further processing by a child parsers.

>>> ed.pq('img::items').get(test_html)
[[<img>], [<img>]]

7.2.4. Removing HTML elements from result

Lets say we have following HTML:

test_html = """
    <h2>
        <span>EasyData</span>
        Test Product Item
    </h2>
"""

If we wanted to select h2 element and it’s content but to exclude content of span element, then we need to specify rm property with a css selector that points to an element that we want to be excluded from end result.

>>> ed.pq('h2::text', remove_query='span').get(test_html)
'Test Product Item'

We can also exclude multiple nested HTML elements by separating them with a comma if needed.

>>> ed.pq('.made-up-class::text', remove_query='span,#some-id,.some-class').get(test_html)