The Genuine Article Tag

In the forest of search results, clarity is currency.

When we conduct technical SEO audits at Tallest Tree, we often see developers falling into the trap of being “Good Web Citizens” at the expense of their SEO. They follow the W3C HTML5 specifications with academic precision: wrapping the blog post in <article>, the sidebar widgets in <article>, and every single user comment in a nested <article>.

Technically, this is valid code. Practically, it is a semantic soup.

While you can use multiple <article> tags on a page, the question isn’t what is allowed—it’s what works. If you want Googlebot and other extraction algorithms to easily identify your Main Content (MC), you need to stop thinking about validation and start thinking about DOM scoring.

The Reality: A Look Inside the Open-Source Blueprint (Mozilla)

In the world of SEO, we are usually staring at a black box, guessing at the algorithms inside. But when it comes to content extraction, we actually have a transparent blueprint available to us: Mozilla’s Readability.js.

This is the open-source library that powers the “Reader View” in Firefox. It is also the industry standard model for how a machine takes a messy HTML document and decides what text actually matters. Because the code is public, we don’t have to guess how it works. We can see the math.

Mozilla’s parser doesn’t just “read” your page; it plays a Scoring Game:

The Scan: The parser iterates through every block element (<div>, <article>, <section>).
The Points: It assigns a “content score” to each container.
- Points Added: It adds points for paragraphs (<p>) and text density. It specifically looks for commas (which indicate sentence structures rather than menu items).
- Points Subtracted: It penalizes containers with high link density. If a container has 500 words but 50 links, the parser assumes it is a navigation menu or a blog roll, not an article.
- The “Class” Penalty: It strips points for class names that imply noise, such as comment, share, or widget.
The Winner: The container with the highest aggregate score is declared the “Main Content” and displayed to the user. Everything else is trashed.ng curve. You make it harder for the algorithm to determine which container is the true winner.

The Application: What We Know About Google

So, how does this relate to Google?

Google does not publish its source code, so we cannot say for certain that they use this exact scoring loop. However, we know they are solving the exact same problem.

Martin Splitt, a Google Search Advocate, has explicitly confirmed that Googlebot processes the DOM to separate the “Main Content” from the “Boilerplate” (nav, footer, sidebars). He refers to this internally as “Centerpiece Annotation.”

What we know: Google builds a “Layout Tree.” They analyze the visual and semantic structure of the page to decide which section is the “Centerpiece.” This Centerpiece is weighted heavily for ranking, while the boilerplate is weighted lightly.
The Contrast: Unlike Mozilla, Google likely uses more advanced signals, including visual rendering (where the element sits on the screen) and perhaps more complex NLP (Natural Language Processing).

The Strategic Inference: While we can’t see Google’s proprietary code, Mozilla’s Readability.js acts as the perfect training dummy. It is the strictest, most logical model we have for content extraction.

If you optimize your HTML structure to “win” the Mozilla scoring game—by consolidating your text into a single <article> and pushing link-heavy widgets into <aside> tags—you are inherently optimizing for Google’s Centerpiece Annotation. You are removing the ambiguity and handing the algorithm a mathematically undeniable winner.

The Tallest Tree Strategy: Create a “Content Centerpiece”

To win the scoring game, your goal is to create one dominant, high-scoring <article> container that acts as the undeniable “Centerpiece” of the URL.

While you can use other article tags for independent widgets, we recommend keeping the hierarchy obvious.

1. What Goes INSIDE the Primary <article> Tag

This tag should wrap the specific entity the user came to see. If you were to print the page to PDF, this is the data you would keep.

Some layouts won’t allow for all of these to be wrapped in the <article> tag, so these are in order of importance:

The Body Content: Your paragraphs, lists, H2s, and data tables.
The Byline & Date: Critical metadata that belongs to the story.
The H1: This is the headline of the entity.
The Featured Image: Assuming it is contextually relevant to the story.

2. What Must Stay OUTSIDE the Primary <article> Tag

This is where most sites fail. By keeping “noise” out of your primary container, you increase its text density score.

Comments: The W3C spec allows comments to be nested <article> tags. Don’t do this for SEO. User-generated content usually has lower quality and different keyword relevance than your expert content. Wrap comments in a <section id="comments"> or a <div> outside the main article.
“Related Posts” Cards: These are internal links, not part of the current story. High link density hurts the content score. Move them to a <nav> or <aside>.
Share Buttons: These are functional tools, not semantic content.
Sidebar Widgets: Even if they contain text, they are supplementary.

The Visual Metaphor: Signal vs. Noise

Think of your page’s HTML structure like a desk.

The “Semantic Soup” Desk: It is covered in 50 sticky notes. Some are important tasks, some are doodles, some are phone numbers. A manager (Googlebot) walking by can’t tell which note matters.
The “Centerpiece” Desk: It has a single, clean file folder in the center containing the report. There might be a stapler (sidebar) and a coffee cup (comments) on the desk, but they are clearly not the report.

Summary: Prune for Growth

You don’t need to banish the <article> tag from the rest of your page entirely, especially on Archive or Feed pages where every card is an independent entry.

But on your core content pages—your blog posts, service pages, and case studies—you must establish a clear hierarchy.

By reserving the primary <article> wrapper for your unique content and stripping away the noise, you hand the parser the answer key. You ensure your content wins the scoring game every time.

Keep it semantic. Keep the signal strong.

Topics on this page

Google Google Chrome Googlebot HTML5 Safari Signal-to-noise ratio World Wide Web Consortium

Show 7 topics