Use JavaScript to condense HTML markup and remove extra space

This technique is a little bizarre and probably doesn't have many direct applications. I'm going to share it because it lets me talk about regular expressions, and it's food for thought about other ways to use them.

The problem

We have a DOM element that we want to serialize (basically, we want to store it as a string). The problem is, there is a ton of extra white space in the element's HTML markup. If we stored that white space in a JavaScript variable we would be wasting resources. If we have to store many elements like this, the problem is even bigger!

We need a way to remove space from a string, but not just any space. We only want to remove white space characters that occur between HTML elements.

<p>We want to keep all space between these words.</p>
ā€‹
<p>We want to remove the blank line above this paragraph.</p>
ā€‹
<span>Also, </span>          <span>the space between these two spans</span>

A simple solution

Here's the answer:

function condenseHTML(elem) {
    const html = elem.outerHTML
    const condensedHtml = html.replace(/>\s+</g, '><')
    return condensedHtml
}

Let's walk through this:

  • Our function accepts a single argument: an HTMLElement.
  • We then get the HTML that represents this element and its children using the element's outerHTML property.
  • Next, we use a regular expression to execute a search and replace operation.

Let's talk about that regular expression:

  • The regular expression looks for one or more white space characters (\s matches white space characters, + matches one or more).
  • Specifically, we are looking for white space between > and < characters. In other words, white space that appears after the end of one HTML element and before the beginning of the next. Notice that we are matching the > and < as well.
  • Also notice that the regular expression has the g flag at the end, which stands for "global". This means it can find multiple matches in the same string.
  • We will then replace all the matches by using the string's replace method.

Since our regular expression matched the > and < characters, they will also be replaced. That's why we are replacing each individual match with ><, as otherwise the angular brackets would be lost. Now the HTML elements remain valid but there will no longer be space between them.

Caveats

This approach isn't perfect. Technically, we are removing space between elements that really should be there. See this example:

<span>We still need</span>         <span>space here.</span>

The space between these two spans will all be removed, which means the words "need" and "space" will be smashed together when viewed in the browser. To be extra safe, our solution could replace multiple spaces with a single space instead of replacing them all with no spaces.

Remember that in HTML, multiple spaces usually condense down to a single visual space (unless the element's style has white-space set to pre, pre-wrap, or break-spaces).

Extended learning

If you'd like to get more practice using regular expressions, I recommend checking out regex101.com.

If you're interested in further optimizing HTML serialization, try using a DOMParser to parse the HTML source into a live Document. You can then use the context of each Element and Text node to determine which spaces are safe to remove.