Just How Smart Are Search Robots? |
| Just How Smart Are Search Robots? Posted: 29 Nov 2011 12:51 PM PST Posted by iPullRank Matt Cutts announced at Pubcon that Googlebot is “getting smarter.” He also announced that Googlebot can crawl AJAX to retrieve Facebook comments coincidentally only hours after I unveiled Joshua Giardino's research that suggested Googlebot is actually a headless browser based off the Chromium codebase at SearchLove New York. I'm going to challenge Matt Cutts' statements, Googlebot hasn't just recently gotten smarter, it actually hasn’t been a text-based crawler for some time now; nor has BingBot or Slurp for that matter. There is evidence that Search Robots are headless web browsers and the Search Engines have had this capability since 2004. Disclaimer: I do not work for any Search Engine. These ideas are speculative based on patent research done by Joshua Giardino, myself, some direction from Bill Slawski and what can be observed on Search Engine Results Pages. A headless browser is simply a full-featured web browser with no visual interface. Similar to the TSR (Terminate Stay Resident) programs that live on your system tray in Windows they run without you seeing anything on your screen but other programs may interact with them. With a headless browser you can interface with it via a command-line or scripting language and therefore load a webpage and programmatically examine the same output a user would see in Firefox, Chrome or (gasp) Internet Explorer. Vanessa Fox alluded that Google may be using these to crawl AJAX in January of 2010. However Search Engines would have us believe that their crawlers are still similar to Unix’s Lynx browser and can only see and understand text and its associated markup. Basically they have trained us to believe that Googlebot, Slurp and Bingbot are a lot like Pacman in that you point it in a direction and it gobbles up everything it can without being able to see where it’s going or what it’s looking at. Think of the dashes that Pacman eats as webpages. Every once in a while it hits a wall and is forced in another direction. Think of SEOs as the power pills. Think of ghosts as technical SEO issues that might trip up Pacman and cause him to not complete the level that is your page. When an SEO gets involved with a site it helps a search engine spider eat the ghost; when they don’t Pacman dies and starts another life on another site. That’s what they have been selling us for years the only problem is it’s simply not true anymore and hasn’t been for some time. To be fair though Google normally only lies by omission so it’s our fault for taking so long to figure it out. I encourage you to read Josh’s paper in full but some highlights that indicate this are:
Google also owns a considerable amount of IBM patents as of June and August of 2011 and with that comes a lot of their awesome research into remote systems, parallel computing and headless machines for example the “Simultaneous network configuration of multiple headless machines” patent. Though Google has clearly done extensive research of their own in these areas. Not to be left out there’s a Microsoft patent entitled “High Performance Script Behavior Detection Through Browser Shimming” where there is not much room for interpretation; in so many words it says Bingbot is a browser. "A method for analyzing one or more scripts contained within a document to determine if the scripts perform one or more predefined functions, the method comprising the steps of: identifying, from the one or more scripts, one or more scripts relevant to the one or more predefined functions; interpreting the one or more relevant scripts; intercepting an external function call from the one or more relevant scripts while the one or more relevant scripts are being interpreted, the external function call directed to a document object model of the document; providing a generic response, independent of the document object model, to the external function call; requesting a browser to construct the document object model if the generic response did not enable further operation of the relevant scripts; and providing a specific response, obtained with reference to the constructed document object model, to the external function call if the browser was requested to construct the document object model."(emphasis mine) Curious, indeed. Furthermore, Yahoo filed a patent on Feb 22, 2005 entitled "Techniques for crawling dynamic web content" which says "The software system architecture in which embodiments of the invention are implemented may vary. FIG 1 is one example of an architecture in which plug-in modules are integrated with a conventional web crawler and a browser engine which, in one implementation, functions like a conventional web browser without a user interface (also referred to as a "headless browser")." Ladies and gentlemen I believe they call that a "smoking gun." The patent then goes on to discuss automatic and custom form filling and methods for handling JavaScript. Search Engine crawlers are indeed like Pacman but not the floating mouth without a face that my parents jerked across the screen of arcades and bars in the mid-80’s. Googlebot and Bingbot are actually more like the ray-traced Pacman with eyes, nose and appendages that we’ve continued to ignore on console systems since the 90’s. This Pacman can punch, kick, jump and navigate the web with lightning speed in 4 dimensions (the 4th is time – see the freshness update). That is to say Search Engine crawlers can render the page as we see them in our own web browsers and have achieved such a high level of programmatic understanding that allows them to emulate a user. Have you ever read the EULA for Chrome? Yeah me neither, but as with most Google products they ask you to opt-in to a program in which your usage data is sent back to Google. I would surmise that this usage data is not just used to inform the ranking algorithm (slightly) but that it is also used as a means to train Googlebot’s machine learning algorithms in order to teach it to input certain fields in forms. For example Google can use user form inputs to figure out what type of data goes into which field and then programmatically fill forms with generic data of that type. If 500 users put in an age in a form field named “age” it has a valid data set that tells it to input an age. Therefore Pacman no longer runs into doors and walls, he has keys and can scale the face of buildings. |
| 2nd November Index Update: Our Broadest Index Yet, and New PA/DA Scores are Live Posted: 29 Nov 2011 12:36 AM PST Posted by randfish Hey gang - it's that magical time again when Linkscape's web index has updated with brand new data (for the second time this month). Open Site Explorer, the Mozbar and the PRO Web App all have new links and scores to check out. This index also features the updated Page Authority and Domain Authority models covered by Matt last week on the blog. Here's the current index's metrics:
As you can see, we're crawling a LOT more root domains - we expect to have data for an extremely high percentage of all the domains that you might find active on the web. However, because of this broader crawl, we're not reaching as deeply into some large domains (some of that is us weeding out crap, including many more millions of binary files, error-producing webpages and other web "junk"). You can see below a chart of the root domains we've crawled in the last 6 months vs. the total URLs in each index. We work toward a few key metrics to judge our progress on the index:
We've gotten better with most of these recently - PA/DA have better correlations, more of your requests (via Open Site Explorer, the Mozbar or any third-party application) now have link data, and we're slowly improving freshness (this index was actually completed last week, but didn't launch due to the Thanksgiving holiday). However, we are not improving as much on raw index size (root domains, yes, which we've seen correlate with other metrics, but raw URL count, no). This will continue to be a focus for us in the months to come, and we're still targeting 100 billion+ URLs as a goal (though we're not willing to sacrifice quality, accuracy or freshness to get there). As always, if you've got feedback on the new scores, on the link data or anything related to the index, please do let us know. We love to hear from you! |
| You are subscribed to email updates from SEOmoz Daily SEO Blog To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
| Google Inc., 20 West Kinzie, Chicago IL USA 60610 | |

.jpg)
.jpg)

















