Decode HTML entities into Python String
In this article, we will learn to decode HTML entities into Python String. We will use some built-in functions and some custom code as well.
Let us discuss decode HTML scripts or entities into Python String. It increases the readability of the script. A programmer who does not know about HTML script can decode it and read it using Strings. So, these three methods will decode the ASCII characters in an HTML script into a Special Character.
Example: Use HTML Parser to decode HTML Entities
html library of Python. It has
html.unescape() function to remove and decode HTML entities and returns a Python String. It replaces ASCII characters with their original character.
import html print(html.unescape('£682m')) print(html.unescape('© 2010'))
Example: Use Beautiful Soup to decode HTML Entities
BeautifulSoup for decoding HTML entities.This represents Beautiful Soup 4 as it works in Python 3.x. For versions below this, use Beautiful Soup 3. For Python 2.x, you will need to specify the
convertEntities argument to the BeautifulSoup constructor. But in the case of Beautiful Soup 4, entities get decoded automatically.
html.parser is passed as an argument along with the HTML script to BeautifulSoup because it removes all the extraneous HTML that wasn't part of the original string (i.e. <html> and <body>).
# Beautiful Soup 4 from bs4 import BeautifulSoup print(BeautifulSoup("£682m", "html.parser"))
Example: Use w3lib.html Library to decode HTML Entities
This method uses
w3lib.html module. In order to avoid "ModuleNotFoundError", install
pip install using the given command. It provides
replace_entities to replace HTML script with Python String.
pip install w3lib
from w3lib.html import replace_entities print(replace_entities("£682m"))
In this article, we learned to decode HTML entities into Python String using three built-in libraries of Python such as
BeautifulSoup. We saw how HTML script is removed and replaced with ASCII characters. Install your packages correctly if you are getting "ModuleNot FoundError".