Introduction to Urljoin in Python

Migel Hewage Nimesha Dec 11, 2023
Introduction to Urljoin in Python

This tutorial describes Python urljoin and its behavior when using it. It also demonstrates the use of urljoin in Python using different example codes.

Introduction to urljoin in Python

URLs usually include essential information that could be utilized when evaluating a website, a participant’s search, or the arrangement of the material in each area.

Sometimes whether URLs appear pretty complex, Python comes with various valuable libraries that let parse, join URLs and retrieve the constituent parts of the URLs.

The urllib package in Python 3 enables users to explore websites from within their script and contains several modules for managing URL functions like urljoin().

The urllib library is crucial when using a URL in Python programming that allows users to visit and interact with websites using their Universal Resource Locator.

Also, this library provides more packages like urllib.request, urllib.error, urllib.parse, and urllib.robotparser.

Use of the urljoin() Method

The urljoin() method is helpful where many related URLs are needed. For instance, URLs for a set of pages to be generated for a website and adding new values to the base URL.

Syntax:

urljoin(baseurl, newurl, allowFrag=None)

After constructing a full URL by combining a base URL(baseurl) with another URL(newurl), informally, this uses parts of the base URL as the addressing scheme, the network location and the path to provide missing parts in the relative URL.

As an example:

>>> from urllib.parse import urljoin
>>> urljoin('http://www.cwi.nl:50/%7Eguido/Python.html', 'FAQ.html')

Output:

'http://www.cwi.nl:50/%7Eguido/FAQ.html'

The allowFrag argument consists of the same meaning and default as for urlparse(). If newurl is an absolute URL that starts with // or scheme://, the newurl’s hostname and/or scheme will be present in the output. As an example:

>>> from urllib.parse import urljoin
>>> urljoin('http://www.cwi.nl:50/%7Eguido/Python.html', '//www.python.org/%7Eguido')

Output:

'https://www.python.org/%7Eguido'

In case this is not the output excepted, preprocess the newurl with urlsplit() and urlunsplit(), detaching possible scheme and network location parts.

If you are concerned about the functions then, the functionalities of urlparse(), urlsplit() and urlunsplit() is briefly described below:

  • urlparse() - This module enables the user to quickly separate URLs into different parts and filter out any particular part from URLs.
  • urlsplit() - This module is an alternative to urlparse() but different as it does not split the parameters from the URL. The urlsplit() module is helpful for URLs following RFC 2396, which supports parameters for each path segment.
  • urlunsplit() - The function of this module is to combine the elements of a tuple as returned by urlsplit() to form a complete URL as a string.

Use the urljoin() Module to Build URLs

The requests module in Python can assist in building URLs and manipulating the URL value dynamically. Programmatically, any sub-directory of the URL can be fetched and then can substitute some parts of the URL with the new values to build new URLs.

The following code fence uses urljoin() to fetch different subfolders in a URL path. The urljoin() is used to add new values to the base URL that will build an URL.

from requests.compat import urljoin

base = "https://stackoverflow.com/questions/10893374"
print(urljoin(base, "."))
print(urljoin(base, ".."))
print(urljoin(base, "..."))
print(urljoin(base, "/10893374/"))

url_query = urljoin(base, "?vers=1.0")
print(url_query)
url_sec = urljoin(url_query, "#section-5.4")
print(url_sec)

Output:

https://stackoverflow.com/questions/
https://stackoverflow.com/
https://stackoverflow.com/questions/...
https://stackoverflow.com/10893374/
https://stackoverflow.com/questions/10893374?vers=1.0
https://stackoverflow.com/questions/10893374?vers=1.0#section-5.4

Is there a way to split URLs in Python? Of course, yes!

We can split the URLs into many components beyond the primary address. The additional parameters used for a particular query or tags attached to the URL are divided using the urlphase() method, as shown below.

from requests.compat import urlparse

url_01 = (
    "https://docs.python.org/3/library/__main__.html?highlight=python%20hello%20world"
)
url_02 = "https://docs.python.org/2/py-modindex.html#cap-f"
print(urlparse(url_01))
print(urlparse(url_02))

Output:

ParseResult(scheme='https', netloc='docs.python.org', path='/3/library/__main__.html', params='', query='highlight=python%20hello%20world', fragment='')
ParseResult(scheme='https', netloc='docs.python.org', path='/2/py-modindex.html', params='', query='', fragment='cap-f')

Use urljoin() to Form URLs

The formation of URLs from different parts to understand the behavior of the urljoin() method imported from urllib.parse is shown and explained the below examples.

Example Code:

>>> from urllib.parse import urljoin
>>> urljoin('test', 'task')

Output:

'task'

Example Code:

>>> from urllib.parse import urljoin
>>> urljoin('http://test', 'task')

Output:

'http://test/task'

Example Code:

>>> from urllib.parse import urljoin
>>> urljoin('http://test/add', 'task')

Output:

'http://test/task'

Example Code:

>>> from urllib.parse import urljoin
>>> urljoin('http://test/add/', 'task')

Output:

'http://test/add/task'

Example Code:

>>> from urllib.parse import urljoin
>>> urljoin('http://test/add/', '/task')

Output:

'http://test/task'

Example Code:

>>> from urllib.parse import urljoin
>>> urljoin('test', 'task')

Output:

'task'

In the above snippet, the first argument can be considered as the baseurl (assuming the syntax of the urljoin() ) that can be equal to the page displayed on the browser.

The second argument, newurl, can be considered as the href of an anchor on that page. As the outcome, the final URL directs to a page once clicked by the user.

A person can also consider the baseurl includes a scheme and domain when considering the above snippet.

Example Code:

>>> from urllib.parse import urljoin
>>> urljoin('http://test', 'task')

Output:

'http://test/task'

The link can direct the user to the URL in the above snippet when considering the virtual hosting aspect and an anchor like <a href='task'>Baz</a>.

>>> from urllib.parse import urljoin
>>> urljoin('http://test/add', 'task')

Output:

'http://test/task'

Adding another part, test/add as above, will create a relative link to the task that will direct the user to the above URL.

>>> from urllib.parse import urljoin
>>> urljoin('http://test/add/', 'task')

Output:

'http://test/add/task'

Here test/add/ is added that will direct to different relative link: test/add/task.

>>> from urllib.parse import urljoin
>>> urljoin('http://test/add/', '/task')

Output:

'http://test/task'

If the user is on test/add/ and the href is to /task, it will link the user to test/task. So, we can say that the urljoin() in Python is a handy function that will help work out URLs as necessary.

Migel Hewage Nimesha avatar Migel Hewage Nimesha avatar

Nimesha is a Full-stack Software Engineer for more than five years, he loves technology, as technology has the power to solve our many problems within just a minute. He have been contributing to various projects over the last 5+ years and working with almost all the so-called 03 tiers(DB, M-Tier, and Client). Recently, he has started working with DevOps technologies such as Azure administration, Kubernetes, Terraform automation, and Bash scripting as well.