Think Tanks Should Embrace AI Bots

A few months back, I learned that one of our clients had hired a new AI director—not from a press release, and not from an intro on a regular client call. I learned because she emailed me with the kind of question I wished every client would ask.

She wanted to know our strategy on LLM crawlers. More precisely, she wanted to know why the strategy appeared to be: block everything except Google. The robots.txt on her site looked like this:

User-Agent: *
Content-signal: search=yes,ai-train=no
Allow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: meta-externalagent
Disallow: /

To paraphrase her email: “I want to be discoverable in LLMs. I want to know why we have it set up this way and what factors I’m not considering.”

I had no idea who’d put those rules there. I hadn’t.

The culprit was Cloudflare

The path to the rule was Security → Settings → Bot Traffic → “Instruct AI bot traffic with robots.txt.” A toggle, on by default for any domain Cloudflare onboarded after July 1, 2025—the day Cloudflare announced what it called “Content Independence Day”.

The intent of this wasn’t bad. AI training crawlers consume bandwidth, scrape content, and (per Anthropic’s own data) send back almost nothing in return: ClaudeBot crawled 13,528 pages for every one human visit Anthropic sent back to publishers as of mid-April 2026. OpenAI’s ratio was 1,252 to one. Google’s ratio is closer to five to one, and Google sent 87.52% of all search referrals on Cloudflare’s network during the same window. The asymmetry is real, and the case for blocking is real—for the right kind of publisher.

Cloudflare’s block-by-default serves a real interest. AP, Time, DMG Media, and Sky News all publicly endorsed the move when it shipped. It also paved the way for Cloudflare’s Pay-Per-Crawl marketplace, which lets publishers charge AI crawlers per request.

But the rule shipped without anyone asking the client what their economic model was. And Cloudflare isn’t the only infrastructure layer doing this. As Will Scott documented in Search Engine Land this week, managed WordPress hosts like WP Engine apply similar blocks at the edge layer, where they’re invisible to most SEO audits.

The wrong default for influence publishers

Cloudflare’s block-by-default was built for one publisher class. It is the wrong default for another.

Think tanks are not ad-supported publishers. They don’t sell pageviews to advertisers. They sell ideas to policymakers, journalists, donors, and—increasingly—to whoever shows up first when someone asks ChatGPT “what should we do about Iran?” or “is South Africa’s corruption a financial-system risk?”

The academic literature on what think tanks actually do, like Andrew Selee’s What Should Think Tanks Do?, converges on something close to a definition: think tanks succeed when policy moves toward their policy preferences. In that light, pageviews are a means, not the end. So is being cited. So is being read by a Senate staffer, or quoted by a Wall Street Journal columnist, or, increasingly, baked into an LLM’s default model of how the world works.

A line often attributed to Truman, but may more accurately be traced to a 19th-century English Jesuit named Father Strickland, captures what I’m driving at: “It is amazing what you can accomplish if you do not care who gets the credit.” For an ad-supported publisher, credit is the entire business. A byline at the top of a Google result is an asset. They have a real reason to fight against being summarized inside an AI Overview that strips their brand, and against being trained on without compensation.

For a think tank, credit matters less than influence. If a graduate student writes a stronger paper because GPT-5 was trained on a defense-policy white paper from 2018, that’s a win, even when no name appears. If a Senate staffer asks Claude about housing policy and gets back a response shaped by your fellows’ work, that’s a win, even when Claude doesn’t cite anyone. Both ad-supported publishers and think tanks measure influence, but they measure it on different surfaces, and the surfaces require different bot policies.

Two kinds of crawler, not one

The “block AI bots” debate gets stuck because it treats AI bots as one thing. They aren’t. They split cleanly into two categories.

Training crawlers: GPTBot from OpenAI, Google-Extended, CCBot from Common Crawl, ClaudeBot in its training mode, and others pull content into model training data. They don’t send referral traffic back. Their job is to make the model smarter. Block them and your content does not shape future LLM responses.
RAG crawlers: OAI-SearchBot, ChatGPT-User, PerplexityBot, and others fetch content live when a user asks a question, in order give a model context for the answer. This is RAG, Retrieval-Augmented Generation, a fetch-and-explain approach to AI rather than explain-from-training-data model. These search-powered AIs send referral traffic, even if the volumes are small. Block them and you become invisible in AI search itself. OpenAI’s own bot documentation maintains the distinction explicitly: GPTBot and OAI-SearchBot are independent settings.

Most ad-supported publishers want to block training crawlers and allow RAG crawlers—block what doesn’t pay, allow what might. That is a defensible position.

Think tanks should want both. Training crawlers shape the model that a graduate student will use to write a paper next semester. RAG crawlers shape the answer a journalist will get tomorrow morning. With AI Overviews and AI Mode increasingly visible in Google search and tools like Claude exploding in popularity with non-coders these RAG crawlers aren’t some early-adopter or bleeding-edge phenomena, they’re the default way information gets discovered now.

What we found across our client base

After that first email, I went back through every Tallest Tree client running on Cloudflare (which is the majority of them) and asked them plainly: do you want to block AI training, or allow it?

Every think tank we asked toggled the block off. Not most. Every one. More than two dozen organizations across different ideological orientations, policy areas, and sizes—the answer was the same as soon as the question was framed in their economic terms. None of them asked to block training. A few asked, with audible enthusiasm, the opposite question: how do we make sure we are showing up more in AI training data?

That same AI director went on to build an internal LLM trained on her organization’s own corpus, so her staff can “talk” to their collective archive. For these sorts of clients, AI isn’t something they’re try to block or toll, it’s something they’re actively embracing.

Citation was already happening

One thing surprised me before we even made the change at Cloudflare: the client was already being cited.

The night I confirmed Cloudflare was the source of the rule, I tested the live state of things. I asked Gemini and ChatGPT a policy question using some of the specific verbiage from a recent report. Both Gemini and ChatGPT cited the paper. The “no AI bots” toggle was still on in Cloudflare, training crawlers had been blocked for months, but they were still cited by both AIs.

The reason is unglamorous. This client ranks consistently high in Google for terms specific to their policy focus, and at the time both Gemini and ChatGPT were leaning on Google’s index for grounding. Strong organic SEO and entity authority were already routing the client into AI answers, even with the training block in place. That is the point of the entity work we have been writing about—Carolina Journal’s 5x Top Stories lift, Illinois Policy’s 62% organic traffic gain, and the entity authority playbook all describe the same mechanism. If you are connected to the Knowledge Graph, you are already partway into the AI answer surface, whether or not anything in Cloudflare has been toggled.

Toggling Cloudflare’s blockers off compounds that. Training crawlers can pull the content. Dedicated RAG crawlers can fetch the page directly instead of relying on Google’s cached index. The strong organic floor stays. The AI ceiling rises.

What to do tomorrow

Again, if you are a publisher on Cloudflare, the toggle is at Security → Settings → Bot Traffic → “Instruct AI bot traffic with robots.txt.” Verify the current state. If you are a think tank, an advocacy organization, an academic institution, or any publisher who doesn’t rely on ad revenue, turning that off is probably the right move. Pair it with a clean robots.txt that allows the AI bots you want and disallows the ones you don’t, on purpose, with knowledge of which is which.

If you are an ad-supported publisher, leaving the block on is totally reasonable, and Cloudflare’s Pay-Per-Crawl is the natural next move for monetizing the access AI bots want.

Either way: make it a deliberate decision, not a CDN default. The bot policy you ship with is the bot policy you live with—until you find out, six months later, that an AI director somewhere is asking you why her stuff isn’t discoverable in LLMs.