Hurry! Try our new Interactive Courses for FREE. 🥳   🚀
  Signup/Sign In

Decode HTML entities into Python String

In this article, we will learn to decode HTML entities into Python String. We will use some built-in functions and some custom code as well.

Let us discuss decode HTML scripts or entities into Python String. It increases the readability of the script. A programmer who does not know about HTML script can decode it and read it using Strings. So, these three methods will decode the ASCII characters in an HTML script into a Special Character.

Example: Use HTML Parser to decode HTML Entities

It imports html library of Python. It has html.unescape() function to remove and decode HTML entities and returns a Python String. It replaces ASCII characters with their original character.

import html

print(html.unescape('£682m'))
print(html.unescape('© 2010'))


£682m
© 2010

Example: Use Beautiful Soup to decode HTML Entities

It uses BeautifulSoup for decoding HTML entities.This represents Beautiful Soup 4 as it works in Python 3.x. For versions below this, use Beautiful Soup 3. For Python 2.x, you will need to specify the convertEntities argument to the BeautifulSoup constructor. But in the case of Beautiful Soup 4, entities get decoded automatically. html.parser is passed as an argument along with the HTML script to BeautifulSoup because it removes all the extraneous HTML that wasn't part of the original string (i.e. <html> and <body>).

# Beautiful Soup 4

from bs4 import BeautifulSoup
print(BeautifulSoup("&pound;682m", "html.parser"))


£682m

Example: Use w3lib.html Library to decode HTML Entities

This method uses w3lib.html module. In order to avoid "ModuleNotFoundError", install w3lib using pip install using the given command. It provides replace_entities to replace HTML script with Python String.

pip install w3lib

from w3lib.html import replace_entities print(replace_entities("&pound;682m"))


£682m

Conclusion

In this article, we learned to decode HTML entities into Python String using three built-in libraries of Python such as html, w3lib.html, and BeautifulSoup. We saw how HTML script is removed and replaced with ASCII characters. Install your packages correctly if you are getting "ModuleNot FoundError".