Why remove HTML tags?
So why would you ever want to remove HTML tags from text? Well, there are many reasons. For instance, you might want to extract the text content from a web page for analysis, or you might want to sanitize user input to prevent XSS (Cross Site Scripting) attacks. Removing HTML tags can help in both these scenarios, and many others.
Note: XSS is a type of security vulnerability where an attacker injects malicious scripts into webpages viewed by other users. By sanitizing user input and stripping HTML tags, we can help mitigate this risk.
In the following sections we'll show a few ways to strip HTML tags from a string. You'll probably notice that, when using plain JS, the common denominator is to use Regular Expressions, which are a powerful tool for working with complext string manipulations like this.
The replace() Method
The following example shows how you can use the
replace() method to remove all HTML tags from a given string:
let stringWithHtml = "<p>Hello, World!</p> <a href='#'>Click Me</a>"; let strippedString = stringWithHtml.replace(/<\/?[^>]+(>|$)/g, ""); console.log(strippedString); // Outputs: Hello, World! Click Me
In this example, the regular expression
/<\/?[^>]+(>|$)/g is used to match any string that starts with a less-than symbol (
<), followed by optional forward slash (
/), and then followed by any character that is not a greater-than symbol (
>), ending with a greater-than symbol (
>) or the end of the string.
By replacing these matches with an empty string, we effectively strip all HTML tags from the original string, leaving us with just the text content.
Here's how you can use Cheerio to strip HTML tags:
const cheerio = require('cheerio'); let str = "<p>Hello, World!</p>"; let $ = cheerio.load(str); console.log($.text());
This will also output:
Stripping HTML Entities
HTML entities are a different beast altogether. These are special characters that are written using specific codes to be displayed in an HTML document. For example,
& is the HTML entity for the ampersand (
Stripping HTML entities is a bit trickier, but can be done using the
he library. Here's how:
const he = require('he'); let str = "Hello, World & everyone else!"; let decodedStr = he.decode(str); console.log(decodedStr);
This will output:
"Hello, World & everyone else!".
he.decode() function will decode any HTML entities in your string, converting them back into their original characters.
Handling Nested HTML Tags
Before we conclude, one thing we should probably look at is - does our technique work on nested HTML entities? This can present a bit of a challenge when trying to strip them out. Let's say we have a string like this:
let str = "<div><p>Hello <strong>World</strong></p></div>";
replace() method, combined with a well-crafted regular expression, can handle this scenario quite well. Here's how:
let str = "<div><p>Hello <strong>World</strong></p></div>"; let stripped = str.replace(/<[^>]+>/g, ''); console.log(stripped); // "Hello World"
Here, the regular expression
<[^>]+> matches any sequence that starts with
<, followed by one or more characters that are not
>, and ends with
>. This matches all HTML tags, nested or not, and replaces them with an empty string.