6.3. Url¶

6.3.1. Url¶

class easydata.parsers.url.Url(*args, from_text: bool = False, from_qs: Optional[str] = None, from_qs_unquote: Optional[str] = None, remove_qs: Optional[Union[str, list, bool]] = None, qs: Optional[dict] = None, domain: Optional[str] = None, protocol: Optional[str] = None, normalize: bool = True, **kwargs)[source]¶: Bases: easydata.parsers.text.Text

Url parser is based upon Text parser and therefore inherits all parameters from it and it’s usage. One differences is that normalize parameter is set to False while in Text parser is set to True by default.

To read docs regarding other parameters than the one described here, please go to Text documentation.

Getting Started¶

>>> test_dict = {'url': 'demo.com/home'}
>>> ed.Url(ed.jp('url')).parse(test_dict)
https://demo.com/home

In this case we see that url in a test_dict is partial. Url parser will try to construct and output always full urls.

Parameters¶

qs¶

With qs parameter we can manipulate urls query strings. We can change existing ones or add new ones.

Lets first try to change existing one.

>>> ed.Url(qs={'home': 'false'}).parse('https://demo.com/?home=true')
'https://demo.com/?home=false'

Now lets try to change existing one and at the same time add a new query string value.

>>> test_url = 'https://demo.com/?home=true'
>>> ed.Url(qs={'home': 'false', 'country': 'SI'}).parse(test_url)
'https://demo.com/?home=false&country=SI'

remove_qs¶

With remove_qs we can remove query string keys and it’s values.

If we provide to remove_qs a str key, then only a single query string key and value will be removed as we can see bellow.

>>> ed.Url(remove_qs='home').parse('https://demo.com/?home=false&country=SI')
'https://demo.com/?country=SI'

We can also delete multiple query string keys and it’s values at the same time by providing a list of str keys to a remove_qs parameter.

>>> test_url = 'https://demo.com/?home=false&country=SI&currency=EUR'
>>> ed.Url(remove_qs=['home', 'country']).parse(test_url)
'https://demo.com/?currency=EUR'

If we set remove_qs to True then all query string keys and values will be removed.

>>> ed.Url(remove_qs=True).parse('https://demo.com/?home=false&country=SI')
'https://demo.com/'

from_text¶

Url parser has ability to extract url from a text as we can see in example bellow.

>>> ed.Url(from_text=True).parse('Home url is:  https://demo.com/home  !!!')
'https://demo.com/home'

domain¶

In some cases we can get only partial url links without a domain, especially when we are scraping websites and for cases like this setting domain parameter with a domain name will help with full url link construction.

>>> ed.Url(domain='http://demo.com').parse('/product/1122')
'http://demo.com/product/1122'

domain parameter value can also be provided without a protocol like http or https. If that’s the case then a default protocol https will be used in order to construct full url.

>>> ed.Url(domain='demo.com').parse('/product/1122')
'https://demo.com/product/1122'

Note

Default value of domain parameter can be defined through a config variable ED_URL_DOMAIN in a model.

protocol¶

As we saw in example above, default protocol https is used when provided domain name in domain parameter has a missing protocol. We can change our default protocol value https by specifying new value into protocol parameter.

>>> ed.Url(domain='demo.com', protocol='ftp').parse('/product/1122')
'ftp://demo.com/product/1122'

Note

Default value of protocol parameter can be defined through a config variable ED_URL_PROTOCOL in a config file or a model.

from_qs¶

from_qs_unquote¶