Machine Learning and Link Spam: My Brush With Insanity |
Machine Learning and Link Spam: My Brush With Insanity Posted: 23 Apr 2013 07:33 PM PDT Posted by wrttnwrd This post was originally in YouMoz, and was promoted to the main blog because it provides great value and interest to our community. The author's views are entirely his or her own and may not reflect the views of SEOmoz, Inc.
Know someone who thinks they’re smart? Tell them to build a machine learning tool. If they majored in, say, History in college, within 30 minutes they’ll be curled up in a ball, rocking back and forth while humming the opening bars of “Oklahoma.” Sometimes, though, the alternative is rooting through 250,000 web pages by hand, checking them for compliance with Google’s TOS. Doing that will skip you right past the rocking-and-humming stage, and launch you right into writing-with-crayons-between-your-toes phase. Those were my two choices six months ago. Several companies came to Portent asking for help with Penguin/manual penalties. They all, for one reason or another, had dirty link profiles. Link analysis, the hard way. Back when I was a kid...I did the first link profile review by hand, like this:
After all of that prep work, my final review still took 10+ hours of eye-rotting agony. There had to be a better way. I knew just enough about machine learning to realize it had possibilities, so I dove in. After all, how hard can it be? Machine learning: the basic conceptThe concept of machine learning isn’t that hard to grasp:
If you ignore the hysterical laughter, the process seems pretty simple. Alas, the laughter is a dead giveaway: these seven steps are easy the same way “Fly to moon, land on moon, fly home” is three easy steps. Note: At this point, you could go ahead and use a pre-built toolset like BigML, Datameer, or Google’s Prediction API. Or, you could decide to build it all by hand. Which is what I did. You know, because I have so much spare time. If you’re unsure, keep reading. If this story doesn’t make you run, screaming, to the pre-built tools, start coding. You have my blessings. The ingredients: Python, NLTK, scikit-learnI sketched out the process for IIS (Is It Spam, not Internet Information Server) like this:
To do all of this, I needed a programming language, some kind of natural language processing (to figure out meaningful words, clean up HTML, etc.) and a machine learning algorithm that I could connect to the programming language. I’m already a bit of a Python hacker (not a programmer – my code makes programmers cry), so Python was the obvious choice of programming language. I’d dabbled a little with the Natural Language Toolkit (NLTK). It’s built for Python, and would easily filter out stop words, clean up HTML, and do all the other stuff I needed. For my machine learning toolset, I picked a Python library called scikit-learn, mostly because there were tutorials out there that I could actually read. I smushed it all together using some really-not-pretty Python code, and connected it to a MongoDB database for storage. A word about the training setThe training set makes or breaks the model. A good training set means your bouncing baby machine learning program has a good teacher. A bad training set means it’s got Edna Krabappel. And accuracy alone isn’t enough. A training set also has to cover the full range of possible classification scenarios. One ‘good’ and one ‘spam’ page aren’t enough. You need hundreds or thousands to provide a nice range of possibilities. Otherwise, the machine learning program stagger around, unable to classify items outside the narrow training set. Luckily, our initial hand-review reinclusion method gave us a set of carefully-selected spam and good pages. That was our initial training set. Later on, we dug deeper and grew the training set by running Is It Spam and hand-verifying good and bad page results. That worked great on Is It Spam 2.0. It didn’t work so well on 1.0. First attempt: failFor my first version of the tool, I used a Bayesian Filter as my machine learning tool. I figured, hey, it works for e-mail spam, why not SEO spam? Apparently, I was already delirious at that point. Bayesian filtering works for e-mail spam about as well as fishing with a baseball bat. It does occasionally catch spam. It also misses a lot of it, dumps legitimate e-mail into spam folders, and generally amuses serious spammers the world over. But, in my madness, I forgot all about these little problems. Is It Spam 1.0 seemed pretty great at first. Initial tests showed 75% accuracy. That may not sound great, but with accurate confidence data, it could really streamline link profile reviews. I was the proud papa of a baby machine learning tool. But Bayesian filters can be ‘poisoned.’ If you feed the filter a training set where 90% of the spam pages talk about weddings, it’s possible the tool will begin seeing all wedding-related content as spam. That’s exactly what happened in my case: I fed in 10,000 or so pages of spammy wedding links (we do a lot of work in the wedding industry). On the next test run, Is It Spam decided that anything matrimonial was spam. Accuracy fell to 50%. Since we tend to use the tool to evaluate sites in specific verticals, this would never work. Every test would likely poison the filter. We could build the training set to millions of pages, but my pointy little head couldn’t contemplate the infrastructure required to handle that. The real problem with a pure Bayesian approach is that there’s really only one feature: The content of the page. It ignores things like links, page trust and authority. Oops. Back to the drawing board. I sent my little AI in for counseling, and a new brain. Note: I wouldn’t have figured this out without help from SEOmoz’s Dr. Pete and Matt Peters. A ‘hat tip’ doesn’t seem like enough, but for now, it’ll have to do. Second attempt: a qualified successMy second test used logistic regression. This machine learning model uses numeric data, not text. So, I could feed it more features. After the first exercise, this actually wasn’t too horrific. A few hours of work got me a tool that evaluates:
This time, the tool worked a lot better. With vertical-specific training sets, it ran with 85%+ accuracy. In case you're wondering, this is what victory looks like:
When I tried to use the tool for more general tests, though, my coded kid tripped over its big, adolescent feet. Some of the funnier results:
False positives remain a big problem if we try to build a training set outside a single vertical. Disappointing. But the tool chugs along happily within verticals, so we continue using it for that. We build a custom training set for each client, then run the training set against the remaining links. The result is a relatively clear report:
Results and next stepsWith little IIS learning to walk, we’ve cut the brute-force portion of large link profile evaluations from 30 hours to 3 hours. Not. Too. Shabby. I tried to launch a public version of Is It Spam, but folks started using it to do real link profile evaluations, without checking their results. That scared the crap out of me, so I took the tool down until we cure the false positives problem. I think we can address the false positives issue by adding a few features to the classification set:
If I'm lucky, one or more of these changes may yield a tool that can evaluate pages across different verticals. If I'm lucky. InsightsThis is by far the most challenging development project I've ever tried. I probably wore another 10 years' enamel off my teeth in just six weeks. But it's been productive:
It's also been a great humility-building exercise. Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read! |
You are subscribed to email updates from SEOmoz Daily SEO Blog To stop receiving these emails, you may unsubscribe now. | Email delivery powered by Google |
Google Inc., 20 West Kinzie, Chicago IL USA 60610 |