Google's Content Warehouse API

The May 2024 leak of Google’s internal API documentation moved the industry from speculation to evidence. While we cannot reverse-engineer the exact values of Google’s scoring functions, the variable names exposed—siteFocusScore, siteRadius, and siteEmbeddings—confirm that search ranking relies heavily on Vector Space Models.

Google no longer just matches keywords; it calculates the distance between the mathematical vector of a query and the vector of your content. Below is how our architecture, TopicalBoost, uses Schema and internal linking to align with these revealed parameters.

1. Reducing Ambiguity: siteEmbeddings and pageEmbeddings

The Mechanism: According to the leak, Google generates siteEmbeddings (a vector representation of your domain) and pageEmbeddings (vectors for individual URLs). Ranking depends on the alignment between these vectors.

The Challenge: Google relies on unstructured text analysis to generate these vectors. If your content uses ambiguous terms (e.g., “Apple” could mean a fruit or a tech company), the resulting vector may “drift” away from the user’s intent.

The Solution: TopicalBoost uses Natural Language Processing (NLP) to extract entities and inject them into the page’s <head> using JSON-LD mentions Schema. By explicitly linking text to distinct @id URIs (like Wikidata), we act as a disambiguation layer. We can’t directly manipulate the vector, but we help ensure Google’s NLP interprets your content correctly, resulting in a more accurate embedding assignment.

Example: This much-simplified example (we build on a much more complex JSON structure generated by Yoast SEO) show how the JSON markup we use helps Google contextualize a page.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "How the Federal Reserve Impacts 2025 Inflation Rates",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://www.example.com/federal-reserve-inflation-2025"
  },

  /* THE "MENTIONS" ARRAY (The TopicalBoost Logic)
   This is where we maximize siteFocusScore. By explicitly defining 
   entities found in the text and linking them to trusted @id sources 
   (Wikidata/Wikipedia), we remove ambiguity for the NLP algorithms.
  */
  "mentions": [
    {
      "@type": "Organization",
      "name": "Federal Reserve",
      "sameAs": "https://www.wikidata.org/wiki/Q53536"
    },
    {
      "@type": "Person",
      "name": "Jerome Powell",
      "jobTitle": "Chair of the Federal Reserve",
      "sameAs": "https://www.wikidata.org/wiki/Q6182718"
    },
    {
      "@type": "Thing",
      "name": "Inflation",
      "description": "A general increase in prices and fall in the purchasing value of money.",
      "sameAs": "https://en.wikipedia.org/wiki/Inflation"
    },
    {
      "@type": "CreativeWork",
      "name": "Consumer Price Index",
      "alternateName": "CPI",
      "sameAs": "https://www.wikidata.org/wiki/Q180687"
    }
  ]
}
</script>

2. Defining Scope: siteRadius and siteFocusScore

The Mechanism:

siteRadius implies a measurement of semantic variance. A high radius suggests a diluted, “Jack of all trades” site.
siteFocusScore correlates with the depth of expertise on a specific topic.

The Challenge: You may have high-quality content that goes unrecognized because the entities are “trapped” in unstructured text strings. If Google’s confidence score in extracting an entity is low, that page contributes less to your overall siteFocusScore.

The Solution: TopicalBoost increases machine confidence. We don’t just tag the main keyword; we map the page’s entities to Google’s Knowledge Graph.

3. Navigational Context: phraseAnchorSpamPenalty

The Mechanism: The leak highlighted phraseAnchorSpamPenalty, confirming Google penalizes unnatural, keyword-stuffed anchor text.

The Solution: TopicalBoost generates internal links using Entity Labels as anchors. Instead of spammy commercial anchors (e.g., “best mortgage rates”), we use the precise entity name (e.g., “The Federal Reserve”). This creates an “Annotated Text” structure that is semantic, helpful to users, and safe from spam classifiers.

TL;DR for the Dev Team

The defining revelation of Google’s API leak is that ranking depends on how well a page aligns with the perceived area of expertise of the overall site. TopicalBoost addresses this mechanism directly.

TopicalBoost moves optimization from the Presentation Layer (HTML text) to the Data Layer (Knowledge Graph). By automating high-fidelity JSON-LD and semantic internal linking, we provide the structured disambiguation required for Google’s vector-based ranking systems to function efficiently.

Topics on this page

Google HTML Schema.org Yoast SEO JSON Natural language processing Vector space model Wikidata

Google’s Content Warehouse API

1. Reducing Ambiguity: siteEmbeddings and pageEmbeddings

2. Defining Scope: siteRadius and siteFocusScore

3. Navigational Context: phraseAnchorSpamPenalty

TL;DR for the Dev Team

Topics on this page