Unraveling Panda Patterns |
Posted: 28 Jul 2014 05:16 PM PDT Posted by billslawski This is my first official blog post at Moz.com, and I'm going to be requesting your help and expertise and imagination. I'm going to be asking you to take over as Panda for a little while to see if you can identify the kinds of things that Google's Navneet Panda addressed when faced with what looked like an incomplete patent created to identify sites as parked domain pages, content farm pages, and link farm pages. You're probably better at this now then he was then.
You're a subject matter expert. To put things in perspective, I'm going to include some information about what appears to be the very first Panda patent, and some of Google's effort behind what they were calling the "high-quality site algorithm." I'm going to then include some of the patterns they describe in the patent to identify lower-quality pages, and then describe some of the features I personally would suggest to score and rank a higher-quality site of one type. Google's Amit Singhal identified a number of questions about higher quality sites that he might use, and told us in the blog post where he listed those that it was an incomplete list because they didn't want to make it easy for people to abuse their algorithm. In my opinion though, any discussion about improving the quality of webpages is one worth having, because it can help improve the quality of the Web for everyone, which Google should be happy to see anyway. Warning searchers about low-quality contentIn "Processing web pages based on content quality," the original patent filing for Panda, there's a somewhat mysterious statement that makes it sound as if Google might warn searchers before sending them to a low quality search result, and give them a choice whether or not they might actually click through to such a page. As it notes, the types of low quality pages the patent was supposed to address included parked domain pages, content farm pages, and link farm pages (yes, link farm pages):
This did not sound like a good idea. Recently, Google announced in a post on the Google Webmaster Central blog post, Promoting modern websites for modern devices in Google search results, that they would start providing warning notices on mobile versions of sites if there were issues on those pages that visitors might go to. I imagine that as a site owner, you might be disappointed seeing such warning notice shown to searchers on your site about technology used on your site possibly not working correctly on a specific device. That recent blog post mentions Flash as an example of a technology that might not work correctly on some devices. For example, we know that Apple's mobile devices and Flash don't work well together. That's not a bad warning in that it provides enough information to act upon and fix to the benefit of a lot of potential visitors. :) But imagine if you tried to visit your website in 2011, and instead of getting to the site, you received a Google warning that the page you were trying to visit was a content farm page or a link farm page, and it provided alternative pages to visit as well. That " your website sucks" warning still doesn't sound like a good idea. One of the inventors listed on the patent is described in LinkedIn as presently working on the Google Play store. The warning for mobile devices might have been something he brought to Google from his work on this Panda patent. We know that when the Panda Update was released that it was targeting specific types of pages that people at places such as The New York Times were complaining about, such as parked domains and content farm sites. A follow-up from the Timesafter the algorithm update was released puts it into perspective for us. It wasn't easy to know that your pages might have been targeted by that particular Google update either, or if your site was a false positive—and many site owners ended up posting in the Google Help forums after a Google search engineer invited them to post there if they believed that they were targeted by the update when they shouldn't have been. The wording of that invitation is interesting in light of the original name of the Panda algorithm. (Note that the thread was broken into multiple threads when Google did a migration of posts to new software, and many appear to have disappeared at some point.) As we were told in the invite from the Google search engineer:
The timing for such in-SERP warnings might have been troublesome. A site that mysteriously stops appearing in search results for queries that it used to rank well for might be said to have gone astray of Google's guidelines. Instead, such a warning might be a little like the purposefully embarrassing "Scarlet A" in Nathaniel Hawthorn's novel The Scarlet Letter.
A page that shows up in search results with a warning to searchers stating that it was a content farm, or a link farm, or a parked domain probably shouldn't be ranking well to begin with. Having Google continuing to display those results ranking highly, showing both a link and a warning to those pages, and then diverting searchers to alternative pages might have been more than those site owners could handle. Keep in mind that the fates of those businesses are usually tied to such detoured traffic. My imagination is filled with the filing of lawsuits against Google based upon such tantalizing warnings, rather than site owners filling up a Google Webmaster Help Forum with information about the circumstances involving their sites being impacted by the upgrade. In retrospect, it is probably a good idea that the warnings hinted at in the original Panda Patent were avoided. Google seems to think that such warnings are appropriate now when it comes to multiple devices and technologies that may not work well together, like Flash and iPhones. But there were still issues with how well or how poorly the algorithm described in the patent might work. In the March, 2011 interview with Google's Head of Search Quality, Amit Sighal, and his team member and Head of Web Spam at Google, Matt Cutts, titled TED 2011: The "Panda" That Hates Farms: A Q&A With Google's Top Search Engineers, we learned of the code name that Google claimed to be using to refer to the algorithm update as "Panda," after an engineer with that name came along and provided suggestions on patterns that could be used by the patent to identify high- and low-quality pages. His input seems to have been pretty impactful—enough for Google to have changed the name of the update, from the "High Quality Site Algorithm" to the "Panda" update. How the High-Quality Site Algorithm became PandaDanny Sullivan named the update the "Farmer update" since it supposedly targeted content farm web sites. Soon afterwards the joint interview with Singhal and Cutts identified the Panda codename, and that's what it's been called ever since. Google didn't completely abandon the name found in the original patent, the "high quality sites algorithm," as can be seen in the titles of these Google Blog posts:
The most interesting of those is the "more guidance" post, in which Amit Singhal lists 23 questions about things Google might look for on a page to determine whether or not it was high-quality. I've spent a lot of time since then looking at those questions thinking of features on a page that might convey quality. The original patent is at: Processing web pages based on content quality Abstract
The patent expands on what are examples of low-quality web pages, including:
An invitation to crowdsource high-quality patternsThis is the section I mentioned above where I am asking for your help. You don't have to publish your thoughts on how quality might be identified, but I'm going to start with some examples. Under the patent, a content quality value score is calculated for every page on a website based upon patterns found on known low-quality pages, "such as parked web pages, content farm web pages, and/or link farm web pages." For each of the patterns identified on a page, the content quality value of the page might be reduced based upon the presence of that particular pattern—and each pattern might be weighted differently. Some simple patterns that might be applied to a low-quality web page might be one or more references to:
One of these references may be in the form of an IP address that the destination hostname resolves to, a Domain Name Server ("DNS server") that the destination domain name is pointing to, an "a href" attribute on the destination page, and/or an "img src" attribute on the destination page. That's a pretty simple pattern, but a web page resolving to an IP address known to exclusively serve parked web pages provided by a particular Internet domain registrar can be deemed a parked web page, so it can be pretty effective. A web page with a DNS server known to be associated with web pages that contain little or no content other than advertisements may very well provide little or no content other than advertising. So that one can be effective, too. Some of the patterns listed in the patent don't seem quite as useful or informative. For example, the one stating that a web page containing a common typographical error of a bona fide domain name may likely be a low-quality web page, or a non-existent web page. I've seen more than a couple of legitimate sites with common misspellings of good domains, so I'm not too sure how helpful a pattern that is. Of course, some textual content is a dead giveaway the patent tells us, with terms on them such as "domain is for sale," "buy this domain," and/or "this page is parked." Likewise, a web page with little or no content is probably (but not always) a low-quality web page. This is a simple but effective pattern, even if not too imaginative: ... page providing 99% hyperlinks and 1% plain text is more likely to be a low-quality web page than a web page providing 50% hyperlinks and 50% plain text. Another pattern is one that I often check upon and address in site audits, and it involves how functional and responsive pages on a site are.
As for user-data, sometimes it might play a role as well, as the patent tells us:
My example of some patterns for an e-commerce websiteThere are a lot of things that you might want to include on an ecommerce site that help to indicate that it's high quality. If you look at the questions that Amit Singhal raised in the last Google Blog post I mentioned above, one of his questions was "Would you be comfortable giving your credit card information to this site?" Patterns that might fit with this question could include:
As I mentioned above, the patent tells us that a high-quality content score for a page might be different from one pattern to another. The questions from Amit Singhal imply a lot of other patterns, but as SEOs who work on and build and improve a lot of websites, this is an area where we probably have more expertise than Google's search engineers.
What other questions would you ask if you were tasked with looking at this original Panda Patent? What patterns would you suggest looking for when trying to identify high or low quality pages? Perhaps if we share with one another patterns or features on a site that Google might look for algorithmically, we could build pages that might not be interpreted by Google as being a low quality site. I provided a few patterns for an ecommerce site above. What patterns would you suggest? (Illustrations: Devin Holmes @DevinGoFish) Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read! |
You are subscribed to email updates from Moz Blog To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google Inc., 20 West Kinzie, Chicago IL USA 60610 |
Facebook Twitter | More Ways to Engage