Central Perk

miercuri, 24 aprilie 2013

Damn Cool Pics

High Five FAIL Compilation
The Girls of Coachella 2013 - Part 2
Micro Tattoos
Abandoned Dog Gets a Haircut
New vs Old Media Billionaires [Infographic]
The 35 Hottest Mugshots Ever

High Five FAIL Compilation

Posted: 24 Apr 2013 04:08 PM PDT

We`ve all faced this humiliating situation at least one time in our life.

The Girls of Coachella 2013 - Part 2

Posted: 24 Apr 2013 04:01 PM PDT

The girls of Coachella 2013 on Instagram.

Previous part:
The Girls of Coachella 2013 - Part 1

Micro Tattoos

Posted: 24 Apr 2013 12:57 PM PDT

Collection of small and cute tattoos.

Abandoned Dog Gets a Haircut

Posted: 24 Apr 2013 11:01 AM PDT

This poor dog was found in Leeds, UK, being in a very bad condition. It looked much better after a shave.

New vs Old Media Billionaires [Infographic]

Posted: 24 Apr 2013 09:47 AM PDT

We all know that new media is growing at a much faster pace than old media businesses. I thought it would be interesting to see just how this is affecting the media billionaires. How much faster did these new media billionaires make their money?

Click on Image to Enlarge.

Staff.com presents New vs Old Media Billionaires - Infographic

Via Staff

The 35 Hottest Mugshots Ever

Posted: 23 Apr 2013 07:35 PM PDT

Machine Learning and Link Spam: My Brush With Insanity

Posted: 23 Apr 2013 07:33 PM PDT

Posted by wrttnwrd

This post was originally in YouMoz, and was promoted to the main blog because it provides great value and interest to our community. The author's views are entirely his or her own and may not reflect the views of SEOmoz, Inc.

sadfishie

Know someone who thinks they’re smart? Tell them to build a machine learning tool. If they majored in, say, History in college, within 30 minutes they’ll be curled up in a ball, rocking back and forth while humming the opening bars of “Oklahoma.”

Sometimes, though, the alternative is rooting through 250,000 web pages by hand, checking them for compliance with Google’s TOS. Doing that will skip you right past the rocking-and-humming stage, and launch you right into writing-with-crayons-between-your-toes phase.

Those were my two choices six months ago. Several companies came to Portent asking for help with Penguin/manual penalties. They all, for one reason or another, had dirty link profiles.

Link analysis, the hard way. Back when I was a kid...

I did the first link profile review by hand, like this:

Download a list of all external linking pages from SEOmoz, MajesticSEO, and Google Webmaster Tools.
Remove obviously bad links by analyzing URLs. Face it: if a linking page is on a domain like “FreeLinksDirectory.com” or “ArticleSuccess.com,” it’s gotta go.
Analyze the domain and page trustrank and trustflow. Throw out anything with a zero, unless it’s on a list of ‘whitelisted’ domains.
Grab thumbnails of each remaining linking page, using Python, Selenium, and Phantomjs. You don’t have to do this step, but it helps if you’re going to get help from other folks.
Get some poor bugger a faithful Portent team member to review the thumbnails, quickly checking off whether they’re forums, blatant link spam, or something else.

After all of that prep work, my final review still took 10+ hours of eye-rotting agony.

There had to be a better way. I knew just enough about machine learning to realize it had possibilities, so I dove in. After all, how hard can it be?

Machine learning: the basic concept

The concept of machine learning isn’t that hard to grasp:

Take a large dataset you need to classify. It could be book titles, people’s names, Facebook posts, or, for me, linking web pages.
Define the categories. In this case, I’m looking for ‘spam’ and ‘good.’
Get a collection of those items and classify them by hand. Or, if you’re really lucky, you find a collection that someone else classified for you. The Natural Language Toolkit, for example, has a movie reviews corpus you can use for sentiment analysis. This is your training set.
Pick the right machine learning tool (hah).
Configure it correctly (hahahahahahaha heee heeeeee sniff haa haaa… sorry, I’m ok… ha ha haaaaaaauuuugh).
Feed in your training set, with the features — the item attributes used for classification — pre-selected. The tool will find patterns, if it can (giggle).
Use the tool to compare each item in your dataset to the training set.
The tool returns a classification of each item, plus its confidence in the classification and, if it’s really cool, the features that were most critical in that classification.

If you ignore the hysterical laughter, the process seems pretty simple. Alas, the laughter is a dead giveaway: these seven steps are easy the same way “Fly to moon, land on moon, fly home” is three easy steps.

Note: At this point, you could go ahead and use a pre-built toolset like BigML, Datameer, or Google’s Prediction API. Or, you could decide to build it all by hand. Which is what I did. You know, because I have so much spare time. If you’re unsure, keep reading. If this story doesn’t make you run, screaming, to the pre-built tools, start coding. You have my blessings.

The ingredients: Python, NLTK, scikit-learn

I sketched out the process for IIS (Is It Spam, not Internet Information Server) like this:

Download a list of all external linking pages from SEOmoz, MajesticSEO, and Google Webmaster Tools.
Use a little Python script to scrape the content of those pages.
Get the SEOmoz and MajesticSEO metrics for each linking page.
Build any additional features I wanted to use. I needed to calculate the reading grade level and links per word, for example. I also needed to pull out all meaningful words, and a count of those words.
Finally, compare each result to my training set.

To do all of this, I needed a programming language, some kind of natural language processing (to figure out meaningful words, clean up HTML, etc.) and a machine learning algorithm that I could connect to the programming language.

I’m already a bit of a Python hacker (not a programmer – my code makes programmers cry), so Python was the obvious choice of programming language.

I’d dabbled a little with the Natural Language Toolkit (NLTK). It’s built for Python, and would easily filter out stop words, clean up HTML, and do all the other stuff I needed.

For my machine learning toolset, I picked a Python library called scikit-learn, mostly because there were tutorials out there that I could actually read.

I smushed it all together using some really-not-pretty Python code, and connected it to a MongoDB database for storage.

A word about the training set

The training set makes or breaks the model. A good training set means your bouncing baby machine learning program has a good teacher. A bad training set means it’s got Edna Krabappel.

And accuracy alone isn’t enough. A training set also has to cover the full range of possible classification scenarios. One ‘good’ and one ‘spam’ page aren’t enough. You need hundreds or thousands to provide a nice range of possibilities. Otherwise, the machine learning program stagger around, unable to classify items outside the narrow training set.

Luckily, our initial hand-review reinclusion method gave us a set of carefully-selected spam and good pages. That was our initial training set. Later on, we dug deeper and grew the training set by running Is It Spam and hand-verifying good and bad page results.

That worked great on Is It Spam 2.0. It didn’t work so well on 1.0.

First attempt: fail

For my first version of the tool, I used a Bayesian Filter as my machine learning tool. I figured, hey, it works for e-mail spam, why not SEO spam?

Apparently, I was already delirious at that point. Bayesian filtering works for e-mail spam about as well as fishing with a baseball bat. It does occasionally catch spam. It also misses a lot of it, dumps legitimate e-mail into spam folders, and generally amuses serious spammers the world over.

But, in my madness, I forgot all about these little problems. Is It Spam 1.0 seemed pretty great at first. Initial tests showed 75% accuracy. That may not sound great, but with accurate confidence data, it could really streamline link profile reviews. I was the proud papa of a baby machine learning tool.

But Bayesian filters can be ‘poisoned.’ If you feed the filter a training set where 90% of the spam pages talk about weddings, it’s possible the tool will begin seeing all wedding-related content as spam. That’s exactly what happened in my case: I fed in 10,000 or so pages of spammy wedding links (we do a lot of work in the wedding industry). On the next test run, Is It Spam decided that anything matrimonial was spam. Accuracy fell to 50%.

Since we tend to use the tool to evaluate sites in specific verticals, this would never work. Every test would likely poison the filter. We could build the training set to millions of pages, but my pointy little head couldn’t contemplate the infrastructure required to handle that.

The real problem with a pure Bayesian approach is that there’s really only one feature: The content of the page. It ignores things like links, page trust and authority.

Oops. Back to the drawing board. I sent my little AI in for counseling, and a new brain.

Note: I wouldn’t have figured this out without help from SEOmoz’s Dr. Pete and Matt Peters. A ‘hat tip’ doesn’t seem like enough, but for now, it’ll have to do.

Second attempt: a qualified success

My second test used logistic regression. This machine learning model uses numeric data, not text. So, I could feed it more features. After the first exercise, this actually wasn’t too horrific. A few hours of work got me a tool that evaluates:

Page TrustFlow and CitationFlow (from MajesticSEO – I’m adding SEOmoz metrics now)
Links per word
Page Flesch-Kincaid reading grade level
Page Flesch Kincaid reading ease
Words per page
Syllables per page
Characters per page
A few other seemingly-random bits, like images per page, misspellings, and grammar errors

This time, the tool worked a lot better. With vertical-specific training sets, it ran with 85%+ accuracy.

In case you're wondering, this is what victory looks like:

This is what victory looks like

When I tried to use the tool for more general tests, though, my coded kid tripped over its big, adolescent feet. Some of the funnier results:

It saw itself as spam.
It thought Rand’s blog was a swirling black hole of spammy despair.

False positives remain a big problem if we try to build a training set outside a single vertical.

Disappointing. But the tool chugs along happily within verticals, so we continue using it for that. We build a custom training set for each client, then run the training set against the remaining links. The result is a relatively clear report:

excelreport

Results and next steps

With little IIS learning to walk, we’ve cut the brute-force portion of large link profile evaluations from 30 hours to 3 hours. Not. Too. Shabby.

I tried to launch a public version of Is It Spam, but folks started using it to do real link profile evaluations, without checking their results. That scared the crap out of me, so I took the tool down until we cure the false positives problem.

I think we can address the false positives issue by adding a few features to the classification set:

Bayesian filtering: Instead of depending on a Bayesian classification as 100% of the formula we’ll use the Bayesian score as one more feature.
Grammar scoring: Anyone know a decent grammar testing algorithm in Python? If so, let me know. I’d love to add grammar quality as a feature.
Anchor text matters a lot. The next generation of the tool needs to score the relevant link based on the anchor text. Is it a name (like in a byline)? Or is it a phrase (like in a keyword-stuffed link)?
Link position may matter, too. This is another great feature that could help with spam detection. It might lead to more false positives, though. If Is It Spam sees a large number of spammy links in press release body copy, it may start rating other links located in body copy as spam, too. We’ll test to see if the other features are enough to help with this.

If I'm lucky, one or more of these changes may yield a tool that can evaluate pages across different verticals. If I'm lucky.

Insights

This is by far the most challenging development project I've ever tried. I probably wore another 10 years' enamel off my teeth in just six weeks. But it's been productive:

When you start digging into automated page analysis and machine learning, you learn a lot about how computers evaluate language. That's awfully relevant if you're a 21st Century marketer.
I uncovered an interesting pattern in Google's Penguin implementation. This is based on my fumbling about with machine learning, so take it with a grain of salt, but have a look here.
We learned that there is no such thing as a spammy page. There are only spammy links. One link from a particular page may be totally fine: For example, a brand link from a press release page. Another link from that same page may be spam: For example, a keyword-stuffed link from the same press release.
We've reduced time required for an initial link profile evaluation by a factor of ten.

It's also been a great humility-building exercise.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

Meet Mr. Charbonneau, Teacher of the Year

Your Daily Snapshot for
Wednesday, April 24, 2013

Meet Mr. Charbonneau, Teacher of the Year

Yesterday President Obama honored Jeff Charbonneau, a teacher from Washington state, as the 2013 National Teacher of the Year.

Educators like Jeff and everyone up here today, they represent the very best of America -- committed professionals who give themselves fully to the growth and development of our kids. And with them at the front of the classroom and leading our schools, I am absolutely confident that our children are going to be prepared to meet the tests of our time and the tests of the future.

President Barack Obama, with Education Secretary Arne Duncan, honors 2013 National Teacher of the Year Jeff Charbonneau, State Teachers of the Year, and Principals of the Year, in the Rose Garden of the White House, April 23, 2013. (Official White House Photo by Pete Souza)

Forget Google AdWords: Pay $3 Per Month NOT $3 Per Click

This is a SubmitStart Sponsor Update. Unsubscribe from this list.

Dear Business Owner,

If you've tried Google Adwords and other expensive search engine marketing programs that cost more than they deliver, now might be time to try a flat fee system that puts your ad on 90+ search engines and web directories as well as 2 PPC Networks (Advertise.com and Affinity.com) for just $3 to $4 per month.

No Bidding - No PPC - No Hassle - No SEO

We've provided budget-minded business owners with an inexpensive ad delivery system for 8 years. Some of our advertisers have been with us that entire time.

To find out more, visit our order page or watch our video introduction.

We provide a proven search engine marketing program for budget-minded online businesses. It doesn't matter whether you receive 10 clicks or 1,000 clicks, you still pay the same flat-fee rate of $3 - $4 per month.

As a further incentive to try our program, we'll throw in 6 free bonus ebooks (on SEO, Social Media & Traffic Building) valued at $100 with your purchase.

SEO Secrets v1.4 (49 page ebook)
Article Directory Marketing & Syndication (37 page ebook)
Using Twitter Effectively (71 page ebook)
How to Optimize for Google (10 page whitepaper)
LinkedIn Profile Optimization (40 page ebook)
Traffic Heist (56 page ebook)

Visit ExactSeek today to place your order

Sent to e0nstar1.blog@gmail.com — why did I get this?

unsubscribe from this list | update subscription preferences

SubmitStart · Trade Center · Kristian IV:s väg 3 · Halmstad 302 50

Seth's Blog : Your manifesto, your culture

Your manifesto, your culture

It's so easy to string together a bunch of platitudes and call them a mission statement. But what happens if you actually have a specific mission, a culture in mind, a manifesto for your actions?

The essential choice is this: you have to describe (and live) the difficult choices. You have to figure out who you will disappoint or offend. Most of all, you have to be clear about what's important and what you won't or can't do.

Here's one that was published this week, by my friends at Acumen:

Acumen: It starts by standing with the poor, listening to voices unheard, and recognizing potential where others see despair.

It demands investing as a means, not an end, daring to go where markets have failed and aid has fallen short. It makes capital work for us, not control us.

It thrives on moral imagination: the humility to see the world as it is, and the audacity to imagine the world as it could be. It's having the ambition to learn at the edge, the wisdom to admit failure, and the courage to start again.

It requires patience and kindness, resilience and grit: a hard-edged hope. It's leadership that rejects complacency, breaks through bureaucracy, and challenges corruption. Doing what's right, not what's easy.

Acumen: it's the radical idea of creating hope in a cynical world. Changing the way the world tackles poverty and building a world based on dignity.

Starts, demands, thrives and requires. Four words that are not in the vocabulary of most organizations.

Starts, as in, "here's where we are, where few others are." Most politicians and corporate entities can't imagine standing with the poor. Apart from them, sure. But with them?

Demands? Demands mean making hard choices about who your competition will be and what standards you're willing to set and be held to.

Thrives, because your organization is only worth doing if it gets to the point where it will thrive, where you will be making a difference, not merely struggling or posturing.

And requires, because none of this comes easy.

David highlights a very diffent (but strikingly similar) document from HubSpot. The same dynamic is at work: no platitudes, merely a difficult to follow (but worth it) compass for how to move forward.

Both require the hubris of caring, of thinking big and being willing to fail if that's what it takes to attempt the right thing.

It's easy to write something like this (hey, even the TSA has one) but it's incredibly difficult to live one, because it requires difficult choices and the willingness to own the outcome of your actions. If you're going to permit loopholes, wiggle room and deniability, don't even bother.