Stripping HTML Tags from Text Using Plain JavaScript

Introduction

Web scraping has become more and more prevalent over the years, which means more developers are having to figure out how to work with HTML markup from the pages they're scraping. But what if you just want the text? Given the complexity of HTML, this might seem like a daunting task, but thankfully, there are some ways to do it with JavaScript.

Why remove HTML tags?

So why would you ever want to remove HTML tags from text? Well, there are many reasons. For instance, you might want to extract the text content from a web page for analysis, or you might want to sanitize user input to prevent XSS (Cross Site Scripting) attacks. Removing HTML tags can help in both these scenarios, and many others.

Note: XSS is a type of security vulnerability where an attacker injects malicious scripts into webpages viewed by other users. By sanitizing user input and stripping HTML tags, we can help mitigate this risk.

How to Strip HTML Tags with JavaScript

In the following sections we'll show a few ways to strip HTML tags from a string. You'll probably notice that, when using plain JS, the common denominator is to use Regular Expressions, which are a powerful tool for working with complext string manipulations like this.

The replace() Method

The replace() method is a frequently-used tool for manipulating strings in JavaScript, and it can also be used to strip HTML tags from a string. It works by searching the string for a specified pattern, which in our case would be HTML tags, and replacing them with an empty string.

The following example shows how you can use the replace() method to remove all HTML tags from a given string:

let stringWithHtml = "<p>Hello, World!</p> <a href='#'>Click Me</a>";
let strippedString = stringWithHtml.replace(/<\/?[^>]+(>|$)/g, "");
console.log(strippedString);

// Outputs: Hello, World! Click Me

In this example, the regular expression /<\/?[^>]+(>|$)/g is used to match any string that starts with a less-than symbol (<), followed by optional forward slash (/), and then followed by any character that is not a greater-than symbol (>), ending with a greater-than symbol (>) or the end of the string.

The g at the end of the regular expression is a flag that tells JavaScript to replace all occurrences, not just the first one.

By replacing these matches with an empty string, we effectively strip all HTML tags from the original string, leaving us with just the text content.

Using Libraries

While using plain JavaScript is great, sometimes you might want to use a library to handle this task. One such library is Cheerio. Cheerio provides a simple API for manipulating HTML and XML documents, similar to jQuery.

Here's how you can use Cheerio to strip HTML tags:

const cheerio = require('cheerio');

let str = "<p>Hello, World!</p>";
let $ = cheerio.load(str);

console.log($.text());

This will also output: "Hello, World!".

Stripping HTML Entities

HTML entities are a different beast altogether. These are special characters that are written using specific codes to be displayed in an HTML document. For example, &amp; is the HTML entity for the ampersand (&).

Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

Stripping HTML entities is a bit trickier, but can be done using the he library. Here's how:

const he = require('he');

let str = "Hello, World &amp; everyone else!";
let decodedStr = he.decode(str);

console.log(decodedStr);

This will output: "Hello, World & everyone else!".

Note: The he.decode() function will decode any HTML entities in your string, converting them back into their original characters.

By combining these techniques and this, we can effectively strip all HTML tags and entities from a string using JavaScript. Remember, while libraries can make our lives easier, understanding how to do it with plain JavaScript is a great skill to have.

Handling Nested HTML Tags

Before we conclude, one thing we should probably look at is - does our technique work on nested HTML entities? This can present a bit of a challenge when trying to strip them out. Let's say we have a string like this:

let str = "<div><p>Hello <strong>World</strong></p></div>";

If we were to use a naive approach, we might end up with some unexpected results. But don't worry, JavaScript's replace() method, combined with a well-crafted regular expression, can handle this scenario quite well. Here's how:

let str = "<div><p>Hello <strong>World</strong></p></div>";
let stripped = str.replace(/<[^>]+>/g, '');
console.log(stripped);

// "Hello World"

Here, the regular expression <[^>]+> matches any sequence that starts with <, followed by one or more characters that are not >, and ends with >. This matches all HTML tags, nested or not, and replaces them with an empty string.

Conclusion

In this Byte, we've explored how to strip HTML tags from text using plain JavaScript. We've learned about the replace() method and how to use regular expressions to match HTML tags. We've also covered how to handle nested HTML tags and special characters. While JavaScript provides us with the tools to do this in a fairly straightforward manner, always consider the complexity and performance implications of your specific use case.

Last Updated: September 2nd, 2023
Was this helpful?
Project

React State Management with Redux and Redux-Toolkit

# javascript# React

Coordinating state and keeping components in sync can be tricky. If components rely on the same data but do not communicate with each other when...

David Landup
Uchechukwu Azubuko
Details

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms