Below is a preview of our updated Think Tank SEO Ranking that we’ve been working on between client jobs for the last several months.
In 2020 and 2021 we ranked the SEO performance of 300+ think tanks using these sources:
- Financial disclosure data from ProPublica
- SEO data from Ahrefs
- Indexed web page data from Google
- Web software data from Wappalyzer
We then shipped that data to Jason Sorens, director of the Center for Ethics in Society at St. Anselm College and a political scientist of some note who knows his way around statistics. The good professor worked his statistical magic on the data and sent us back a ranking.
This process was a lot of work on our part. We looked up the data manually and then merged it together using Excel—not the way I prefer to spend my weekends.
We were also imposing on Professor Sorens’s time, so we only updated our ranking annually—he’s busy working on cool projects like the New Hampshire portion of the National Zoning Atlas.
So, we thought we’d save everyone involved time by creating a process that involved less manual data entry, less chance for error, and a lot less time investment. By doing so, we’d not only save ourselves a few headaches, but we’d also be able to update the ranking more frequently. That way, our ranking could become a way for think tanks to gauge their progress as they seek to improve their SEO and expand their reach in the marketplace of ideas.
So we built a process using APIs from ProPublica, Ahrefs, SerpAPI (substituting for Google itself), and Wappalyzer. We’re now able to populate a database with over 6,500 data points after gathering only a few basic inputs for each think tank. We’re also able to collect historical search performance data reaching back many years—very cool.
However, we’re not statisticians. To be precise, my last statistics training was my sophomore year of college, which was over 20 years ago—I got a B because group projects in college are a nightmare. Though we deal with data frequently, we typically filter, sort, create averages, graph, and look for patterns. We don’t run regressions.
Even so, we thought, how hard can it be? With some basic instructions from Statology and help from ChatGPT, we created our own multiple regression analysis.
Using Google Sheets, we gathered our three independent variables:
- Total Annual Expenses
- Domain Rating
- Indexed Pages
With the goal of creating a model that would predict our dependent variable:
- Monthly Organic Traffic
Because think tank expenses, monthly organic traffic, and pages indexed by Google vary by several orders of magnitude, we used the natural log to make our data more normally distributed. We end up with data that looks like this:
ln Observed Monthly Organic Traffic | ln Expenses | Domain Rating | ln Indexed Pages |
15.48721053 | 15.65694028 | 84 | 13.91987057 |
10.03249623 | 14.92562478 | 63 | 13.76737018 |
13.15532222 | 14.59250382 | 81 | 13.17495583 |
14.62737589 | 18.36879218 | 90 | 12.45683136 |
We then used LINEST to generate these values:
Indexed Pages | Domain Rating | Expenses | |||
coefficient values | 0.6089382428 | 0.05931765768 | 0.3408197202 | -5.239243803 | intercept |
coefficient s. error | 0.0679724013 | 0.008969595353 | 0.06034945207 | 0.7370123999 | intercept s.error |
coefficient of determination | 0.7785375929 | 1.195599871 | #N/A | #N/A | s.error for y estimate |
f statistic | 282.4069367 | 241 | #N/A | #N/A | degrees of freedom |
regression sum of squares | 1211.067455 | 344.4996313 | #N/A | #N/A | residual sum of squares |
Then, using the equation below, we generated a predicted traffic value for all of the groups in our ranking.
Y = a + b1X1 + b2X2 + b3X3
In this equation, Y is Monthly Organic Traffic, the dependent variable. The intercept calculated in the table above is represented by a. The variables b1, b2, and b3 are the coefficient values produced by the LINEST function for Total Annual Expenses, Domain Rating, and Indexed Pages respectively. Those are then multiple by X, the corresponding Expense, Domain Rating, and Indexed Pages values for each group.
The function in Google Sheets looks like this:
=EXP(‘State LINEST’!$E$2+(‘State LINEST’!$D$2*(ln(F2)))+(‘State LINEST’!$C$2*J2)+(‘State LINEST’!$B$2*(ln(N2))))
This is full of references to different sheets within our Google Sheets file, but it takes into account two coefficients being derived from figures converted to natural log and the dependent variable also being converted to natural log, hence the use of EXP at the beginning of the function.
We then divided the observed monthly traffic by the predicted monthly traffic to generate an efficiency ratio. We found that groups range from generating almost 35x their expected value to only 0.9% of their expected value.
Below is a table that uses those efficiency ratios to rank 245 national think tanks:
So is this valid? Are we creating something meaningful here?
These rankings make intuitive sense given our experience working in public policy SEO. Organizations ranked highly in this table do seem to be boxing well above their weight class. The low-ranking groups have traffic values that indicate either something is wrong with their website or perhaps web promotion simply isn't a priority for them, which is totally plausible and a fair approach to running a public policy group.
As always, our goal with this project isn't to shame the groups who rank lower on the list. Think tanks influence the policy process in many ways. While making research accessible via search engines can help to influence policymakers, inform journalists, and create new connections with fellow experts, those things can also be accomplished through other means. Many of the groups who rank lower in this table are highly effective in their specific domains and might actually form relations that are mediated through technology—a novel approach.
Even so, we think it's valuable to show that success in search is possible for public policy nonprofits no matter their level of resources. This ranking and data are meant to serve as a tool for think tanks to identify their peers and benchmark their own search performance. We think we've achieved that.
That said, this is our first run at something like this and we know we could be missing something. Again, our intuition says this ranking makes sense, but we may have missed something like collinearity or other such concepts with which we only have a basic understanding.
We're looking for feedback on this and we're glad to share our entire dataset and the functions we developed to create this table with anyone interested in contributing to this project.
To help us improve this project, please email [email protected].