Changing HTML Tag Attributes with the WP HTML Tag Processor

So what problem does the WP_HTML_Tag_Processor added in WordPress 6.2 solve?

Well since its inception, WordPress has used filters to do additional processing on the HTML it generates. This approach allows the introduction of new features without having to change existing content.

For example the loading=”lazy” attributes are added this way to any images in the content. This improves page performance in browsers that do support lazy-loading. All without having to go in and update every image in every post.

This feature relies on regular expressions to find the images in the content. The HTML representing the images is then passed through several helper functions to apply the needed changes.

Over time more and more such code has been added to Core. The responsive images feature for example uses a very similar approach. But again with its own set of functions to find and change images.

But there are three issues with this approach:

There is a lot of code duplication for common tasks such as finding HTML tags.
The regular expressions used for finding HTML tags and their attributes only work when the actual HTML matches the expectations.
There are no helper functions for common tasks such as adding, changing, or removing an HTML tag attribute.

WordPress 6.2 introduces the first part of a new HTML API designed to fix all of these issues.

Introducing the HTML Tag Processor

The HTML Tag Processor is the first part of the HTML API. It’s an HTML parser that allows developers to find specific HTML tags. The attributes of these tags can then safely be modified through helper methods.

A parser doesn’t work on strings of HTML. Instead it breaks up the HTML into pieces or tokens. This is much more reliable than pure text processing, as different syntax rules or quoting styles are understood by the parser.

In opposition to other HTML parsers, the HTML Tag Processor does not build up a tree of these tokens. Therefore it doesn’t know how the individual HTML tags relate to each other. So it cannot relate opening tag with closing tag. Or know about how tags are nested.

Instead it starts with the first tag, and from there goes linearly through all tags found in the HTML. Thus changes to any tags have to be done when the tag has been found.

There is a rudimentary bookmarking system that allows to mark tags. This way the processing can be restarted at the bookmarked point. However the system is so rudimentary that we won’t cover it in this article.

Using the HTML Tag Processor

Working with the tag processor always involves four steps:

Preparing the HTML for processing. Creating a new instance of the WP_HTML_Tag_Processor class, passing in the HTML that we want to modify.
Finding the tag or tags to process: The parser supports simple queries to limit results to specific tags, or classes.
Modifying the tag attributes: When a tag has been found, the attributes can be changed through helper methods. At this point additional checks can be performed, like for example looking for the existence of an attribute before performing any changes.
Generating the updated HTML: Any changes done so far have been done inside the parser instance. To persist them into an HTML string, the parser needs to regenerate it.

So we now have a good overview of how the tag processor works. Now let’s look at actual use cases.

Processing a single HTML tag

We’ll start off with the simplest use case, which is changing the attributes of a single HTML tag. The example that we’ll look at is changing the read more link that is used by the the_content function.

To change this link, we’re going to use the the_content_more_link filter. This is the default markup that WordPress generates:

<a href="http://example.com/hello-world/#more-25" class="more-link">
    <span aria-label="Continue reading Hello World">(more…)</span>
</a>

How can we add a class to this link with the HTML tag processor? Let’s first look at the full code snippet.

function wpdc_filter_the_content_more_link( $more_link ) {
    $processor = new WP_HTML_Tag_Processor( $more_link );
    $processor->next_tag( 'a' );
    $processor->add_class( 'the-content-read-more' );

    return $processor->get_updated_html();
}
add_filter( 'the_content_more_link', 'wpdc_filter_the_content_more_link' );

Let’s go over this in more detail:

We use the HTML provided by the filter to create a new WP_HTML_Tag_Processor instance.
We look for the first anchor tag in the HTML. We also could have just called next_tag() without any arguments. But since this HTML could have been filtered before, we want to have a precise query.
We add the class using the add_class() helper method.
We regenerate the HTML to persist the markup changes, and return it from the filter callback.

This is the HTML that we have just created:

<a href="http://example.com/hello-world/#more-25" class="more-link the-content-read-more">
    <span aria-label="Continue reading Hello World">(more…)</span>
</a>

Since we’re filtering the read more link, we know that there will be an anchor tag in the HTML. But what if a tag is not always present?

Processing the HTML of blocks

Much like we can filter the output of template tags, we can also filter the output of blocks using the render_block filter.

The use case that we are looking at here is that we want to add a class to the figcaption tag of image blocks. This the simplified markup of an example image:

<figure class="wp-block-image">
    <img width="800" height="600" src="http://example.com/wp-content/uploads/2023/04/image.jpg" alt="Sample Image" class="wp-image-37">
    <figcaption class="wp-element-caption">With a caption</figcaption>
</figure>

The challenge here is that images may or may not have a caption. So how can we account for this? Here is the full code.

function wpdc_filter_the_image_block( $block_content, $block ) {
    if ( ! $block_content || $block['blockName'] !== 'core/image' ) {
        return $block_content;
    }

    $processor = new WP_HTML_Tag_Processor( $block_content );

    if ( ! $processor->next_tag( 'figcaption' ) ) {
        return $block_content;
    }

    $processor->add_class( 'image-caption' );

    return $processor->get_updated_html();
}

add_filter( 'render_block', 'wpdc_filter_the_image_block', 10, 2 );

The solution is to check the return of the next_tag() method. It returns true if a tag match was found. Else it returns false. This way we can make sure that we have the HTML tag that we are looking for.

While a bit more complex, we were only dealing with single HTML tags so far. What if we need to process all tags of a certain kind?

Processing the entire post content

Imagine that you have a blog, and your content contains affiliate links. You want to mark these links with rel="sponsored" for SEO purposes. With the HTML Tag Processor, this is straightforward to do:

function wpdc_filter_affiliate_links_in_content( $content ) {
    $processor = new WP_HTML_Tag_Processor( $content );

    while ( $processor->next_tag( [ 'tag_name' => 'a', 'tag_closers' => 'skip' ] ) ) {
        if ( ! str_starts_with( (string) $processor->get_attribute( 'href' ), 'https://affiliate.com/' ) ) {
            continue;
        }

        $processor->set_attribute( 'rel', 'sponsored' );
    }

    return $processor->get_updated_html();
}
add_action( 'the_content', 'wpdc_filter_affiliate_links_in_content' );

But this time we don’t just pass in the HTML of a single link or block. Instead we pass in the entire post content. Due to how the parser is written, this is still performant.

A post could contain any number of links. So how do we catch all of these? We’re relying again on the next_tag() method. But we now call it inside of a while() statement.

This means that the content of the while loop will be executed as long as there are anchor tags remaining. The loop will work through all of them from first to last. We also specify that we want to avoid tag closers. So in our case we want to skip the closing </a> tags.

Once we have found an anchor tag, we use the get_attribute() method to get the referenced URL. If the link does not point to our affiliate site, we will move on to the next one.

Links that do point to the affiliate set get the rel attribute added with the set_attribute() method.

Wrapping up

The HTML Tag Processor turns what would have required extensive use of regular expressions into a straightforward set of instructions. Even developers that are not familiar with the parser will have an idea of what this code does.

Right now the HTML API is limited, and only covers changing the attributes of HTML tags. But the Gutenberg team is already working on several improvements and extensions.

If you have used the HTML Tag Processor in your projects, please share your experiences and thoughts. I look forward to reading any comments and feedback!