How the Google Panda Algorithm Works

In 2005 Google published its “Web Authoring Statistics” report, which provided a unique insight into how a large search engine views the Web at the very basic HTML level.
In August 2009 Matt Cutts invited Webmasters to help test a new indexing technology that Google had dubbed Caffeine. The SEO community immediately fell to rampant speculation about how Caffeine would affect rankings (in fact, the only effect was unintentional).
By February 2010 even I had fallen prey to Caffeine Speculationitis. On February 25, 2010 Matt McGee confirmed that Google had not yet implemented the Caffeine technology on more than 1 data center (at this time, in April 2013, there are only 13 Google Data Centers around the world).
On June 8, 2010 Google announced the completion of rolling out its Caffeine indexing technology. Caffeine gave Google the ability to index more of the Web at a faster rate than ever before. This larger, faster indexing technology invariably changed search results because all the newly discovered content was changing the search engine’s frame of reference for millions of queries.
On November 11, 2010 Matt Cutts said that Google might use as many as 50 variations for some of its 200+ ranking signals, a point that Danny Sullivan used to extrapolate a potential 10,000 “signals” Google might use in its algorithm.
GET the full picture.  Start your Reflective Dynamics marketing campaign now.
On February 24, 2011 Google announced the release of its first Panda algorithm iteration into the index.
On March 2, 2011 Google asked Webmasters to share URLs of sites they believed should not have been downgraded by Panda. The discussion went on for many months and the thread is more than 1000 posts long. Google engineers occasionally confirmed throughout 2011 that they were still watching the discussion and collecting more information.
The next day Wired published an interview with Amit Singhal and Matt Cutts (see below).
On May 6, 2011 Amit Singhal published 23 questions that drew much criticism from frustrated Web marketers. The angry mobs did not understand the context in which the questions should be used.
On June 21, 2011 Danny Sullivan suggested that Panda may be a ranking factor more than just a filter (a view that I and others had also come to hold by that time, but Danny was the first to suggest this publicly).
In mid-March 2013 Google announced that the Panda algorithm had been “incorporated into our indexing process”, meaning it was now essentially running on autopilot. Between February 24, 2011 and March 15, 2013 there were more than 20 confirmed and suspected “iterations” of the Panda algorithm that changed the search results for millions of queries.

What Google Has Told Us About the Panda Algorithm

On March 3, 2011 Wired published an interview with Amit Singhal and Matt Cutts where they explained what Panda was and where it came from.

Singhal: So we did Caffeine [a major update that improved Google’s indexing process] in late 2009. Our index grew so quickly, and we were just crawling at a much faster speed. When that happened, we basically got a lot of good fresh content, and some not so good. The problem had shifted from random gibberish, which the spam team had nicely taken care of, into somewhat more like written prose. But the content was shallow. Matt Cutts: It was like, “What’s the bare minimum that I can do that’s not spam?” It sort of fell between our respective groups. And then we decided, okay, we’ve got to come together and figure out how to address this.
The process that Google developed to respond to this “shallow content” it had suddenly become aware of was not simple. They selected a core group of Websites, handed those sites to “quality raters”, who then reviewed the Websites. The reviews consisted of or included a survey where the quality raters answered intuitive questions:

Wired.com: How do you recognize a shallow-content site? Do you have to wind up defining low quality content? Singhal: That’s a very, very hard problem that we haven’t solved, and it’s an ongoing evolution how to solve that problem. We wanted to keep it strictly scientific, so we used our standard evaluation system that we’ve developed, where we basically sent out documents to outside testers. Then we asked the raters questions like: “Would you be comfortable giving this site your credit card? Would you be comfortable giving medicine prescribed by this site to your kids?”
Cutts: There was an engineer who came up with a rigorous set of questions, everything from. “Do you consider this site to be authoritative? Would it be okay if this was in a magazine? Does this site have excessive ads?” Questions along those lines.
Singhal: And based on that, we basically formed some definition of what could be considered low quality. In addition, we launched the Chrome Site Blocker [allowing users to specify sites they wanted blocked from their search results] earlier , and we didn’t use that data in this change. However, we compared and it was 84 percent overlap [between sites downloaded by the Chrome blocker and downgraded by the update]. So that said that we were in the right direction.
Wired.com: But how do you implement that algorithmically?
Cutts: I think you look for signals that recreate that same intuition, that same experience that you have as an engineer and that users have. Whenever we look at the most blocked sites, it did match our intuition and experience, but the key is, you also have your experience of the sorts of sites that are going to be adding value for users versus not adding value for users. And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons …
Singhal: You can imagine in a hyperspace a bunch of points, some points are red, some points are green, and in others there’s some mixture. Your job is to find a plane which says that most things on this side of the place are red, and most of the things on that side of the plane are the opposite of red.
Since the search engineers could not compute a signal for “would you trust this site with your credit card” they had to look for other statistical measurements that would correspond highly with the answers provided in the Quality Raters Survey.
Sample chart demonstrating Hyperplane Separation from a paper co-authored by Navneet Panda.
Sample chart demonstrating Hyperplane Separation from a paper co-authored by Navneet Panda.
Amit Singhal’s 23 questions (see link above) are almost certainly taken directly from the Quality Raters’ Survey. I believe they mentioned somewhere that the actual survey had about 100 questions. The answers to these questions do not provide Google with data that can be integrated into any ranking factors. I believe that they did plot the answers on a chart that helped them divide a sample of sites from across the Web into “high quality” and “low quality” sites. They probably used a technique similar to Hyperplane Separation, which is one of the areas that Google engineer Navneet Panda has studied.

What We Know About the Panda Algorithm Independently of Google’s Remarks

The Panda algorithm is a heuristic algorithm. That is, it scans a large data set and looks for specific types of solutions to questions or problems (such as, “What mix of statistical signals would divide data into ALPHA and BETA groups?”). What may be revolutionary about the Panda algorithm (I believe) is that (I think) it seeks to eliminate or bypass unnecessary comparisons and computations, thus reducing the overall number of calculations required to find the best match for a specific desired solution.
What Google needed to do was develop a set of ranking signals and/or weights that would help them separate Websites into “High Quality” and “Low Quality” sites. The Quality Rater Survey was apparently used to divide a pool of secretly selected Websites into such a segregated plane. The Google engineers then turned Panda loose on their immense volumes of data about Websites with the goal of finding the best grouping of signals and weighted values for those signals that would produce the closest match to the quality raters’ collective choices.
Through the many public iterations Google appears to have been changing (probably mostly enlarging) the pool (learning set) of Websites that is used to determine what mix of signals and weights should be used to determine a Web (page/site)’s Panda score. This score (if it exists) is probably added to the (page/site)’s PageRank. Matt described the algorithm as a “document classifier”, which in established usage means that it is a program that scans individual Web documents and assesses them.
Hence, your “Panda score” is assigned to individual pages, and cumulatively enough pages on your Website may be negatively affected that they “drag down” the rest of your site, a possible scenario that Googlers have acknowledged.
Changing the learning set should mean that the mix of best-matched signals and weights will also change, even if only subtly.

What I Believe This Means About the Panda Algorithm

How does Google know if a Website in the learning set should be rated as “high quality” or “low quality”? I believe they have conducted several, perhaps many, new Quality Rater Surveys as they have expanded their learning set. Each time sites are added to the learning set the quality raters provide feedback on the sites and the engineers use that feedback to determine whether the sites are “high quality” or “low quality”.
In this way Google always has a fairly current blueprint of what the Web looks like. This blueprint is used to help the Panda algorithm find the best match of Website signals and how to weight those signals to produce a set of scores (to be assigned to individual pages) that divide the Web into “high quality” and “low quality”.
I suspect that — now the Panda algorithm is more-or-less automated — there must be thresholds that protect an indeterminate “middle layer” of Websites whose pages cannot really be deemed “high quality” or “low quality”. Perhaps this content is not assigned a Panda score at all. Perhaps it just means the score doesn’t affect a document’s valuation in the Google index one way or the other.

How Important is Panda to Webmasters in 2013?

Here in 2013 the Panda algorithm is still upsetting many Webmasters. It is cited more often than any other Google algorithmic change except Penguin across the broad spectrum of SEO discussions that I follow. I continue to receive consulting inquiries from people whose sites cannot seem to recover from Panda.
In late March Eric Enge shared his latest thoughts about Panda on Google+. Way down in the deep comments I finally decided to step out of obscurity and take exception to part of Eric’s logic (which has been reasoned/argued/supported by many people in the industry). The discussion at first focused on bounce rates, but I eventually realized that we were really NOT talking about bounce rates (and certainly not bounce rates that you can track and measure in your analytics).
In my final comment on that discussion I began with the following:

You can make a teacup or you can assemble a collection of teacups. You can also pick one teacup, just one, that someone else has made. So Google is telling people about teacups rather than making them. From their perspective it’s better to create a great collection of teacups than to evaluate every teacup in such meticulous detail that they pick only one. Hence, they need to focus on what makes the best collection of teacups, not the best teacup. It’s a basic principle of economics (or maybe biology is a better comparative source) that a system gravitates toward an equilibrium point that produces the best possible result for the least amount of energy. That “best result” is always a compromise, never a perfect option.
Google’s job is NOT to single out the best websites but rather to find enough acceptable content to show in its SERPs that its users are satisfied. When you know nothing about gold-plated hat trees how do you tell people which gold-plated hat trees are the best? You can’t. You can only help them look at the best presentations from gold-plated hat tree vendors and hope there is real substance behind the presentations.
NOTE: After thinking about this some more, Eric published a nice summation a few days later with which I can agree. What I was referring to in my comment to Eric was what I have often called The Wikipedia Principle, which states that “a search engine intentionally promotes low quality content that is minimally acceptable to searchers because it costs less to do that than to promote better content.”
Search engineers may not agree with my wording but the principle is fundamentally sound. A search engine does not, cannot, and will not attempt to improve upon a searcher’s satisfaction with results. If the results satisfy the user the search engine’s work is done, even if there may be better information available out there that could benefit the searcher more.
Competitive interests motivate search engines to exceed the Wikipedia Principle’s Threshold, to be sure. After all, if someone creates a better search engine than Google then Google must either improve its results or risk losing users to the better search engine. Nonetheless, all that economic competition between search engines means is that the Satisfaction Threshold is elevated, not eliminated. The technology cannot do away with its own inherent equilibria.

So How Do You Recover From a Panda Downgrade?

The short answer is simple: you redesign your site to present information (and create a user experience) that is approximately comparable in quality of presentation to that provided by sites that benefit from the Panda algorithm.
In other words, you have to stop putting your own interests ahead of the interests of the users and create real presentational value for those users. The increasing emphasis on conversions in the Web marketeering industries has all-but-ensured that Google’s Panda algorithm will have plenty of pages to downgrade for years to come.
The Panda algorithm is rewarding Websites that organize and present information that is useful, unique, and relevant to the user; the algorithm is downgrading Websites that are just publishing content so that someone can earn some money. Was this Google’s intention with Panda? I doubt it. They continue to help many Websites generate billions of dollars in revenue. Panda is not really about the money for Google — not directly. Panda is simply a response to competitive pressures to continually improve the quality of search results.
If it were not for Bing and other search engines, we might never have seen a Panda algorithm. Or maybe it would have behaved differently.

Can We Get Down to Panda-specific Details?

I mentioned to Eric that I am no longer constrained by non-disclosure agreements to keep my Panda research to myself. I don’t have the data I originally collected because that was proprietary, but I know what I learned. And I can now say that I participated in a scientifically rigid correlation study that evaluated several proposed contributing factors for Panda downgrades. Only 1 of those proposed factors produced a statistically inarguable correlation.
I submitted a proposal to SMX Advanced 2013 to share my research but it looks like that will not happen. I’m not going to put it on SEO Theory for several reasons I don’t want to go into. Understand that since I don’t have access to the original data I would have to reconstruct my research (and perhaps that is sufficient reason NOT to include such a presentation in SMX Advanced).
As for the 1 “statistically inarguable correlation” it is only applicable to Websites that fall into a certain category. By “category” I mean sites that share a certain design and presentation style. This has nothing to do with “content” and it is not a bounce rate.
Are there other causes or explanations for Panda downgrades? I am convinced there must be. And yet, to date, I haven’t seen anyone publish any credible studies analyzing Panda factors (and just to be clear, YOU have not seen ME publish anything like that, either).
I have discussed some of my Panda findings on the SEO Theory Premium Newsletter. Much though I would like for you all to subscribe to the newsletter, I would rather you did not do it for that reason only, and if you do subscribe you have to pay for specific back issues. You cannot just sign up for 1 month, raid the archives, and then leave.
Perhaps somewhere in the future I’ll have the opportunity to make a public presentation. I cannot solve the entire Panda puzzle for you but I have certainly helped to bring a lot of sites back from Panda downgrades. There is no formulaic solution, except in that many Websites have made the same mistakes over and over again.
Simplicity is the best cure for a Panda Downgrade. Barring that, putting the user experience ahead of your financial goals is the optimal path to survival in an age of Pandas and Penguins.

Post a Comment

0 Comments