Central Perk

marți, 8 ianuarie 2013

SEO Blog

Local SEO Marketing Strategies for 2013

Posted: 08 Jan 2013 03:52 AM PST

Local Marketing is the implementation of Market Mix for a particular Region or at a Domestic level. Market Mix elements constitute of four key elements i.e. Product, Price, Place & Promotion. These four P's make the Market effective to achieve great Exposure. To target local online market the local seo...
Read more »

Visualizing Duplicate Web Pages

Posted: 07 Jan 2013 06:12 PM PST

Posted by David Barts

We've just changed the way we detect duplicate or near-duplicate web pages in our custom crawler to better serve you. Our previous code produced good results, but it could fall apart on large crawls (ones larger than about 85,000 pages), and takes an excessively long time (sometimes on the order of weeks) to finish.

Now that the change is live, you’ll see some great improvements and a few changes:

Results will come in faster (up to an hour faster on small crawls and literally days faster on larger crawls)
More accurate duplicate removal, resulting in fewer duplicates in your crawl results

This post provides a look into the motivations behind our decision to change the way our custom crawl detects duplicate and near-duplicate web pages at a high level. Enjoy!

Improving our page similarity measurement

The heuristic we currently use to measure the similarity between two pages is called fingerprints. Fingerprints relies on turning each page into a vector of 128 64-bit integers in such a way that duplicate or near-duplicate pages result in an identical, or nearly identical, vector. The difference between a pair of pages is proportional to the number of corresponding entries in the two vectors which are not the same.

The faster heuristic we are working on implementing is called a similarity hash, or simhash for short. A simhash is a single, 64-bit, unsigned integer, again calculated in such a way that duplicate or near-duplicate pages result in simhash values which are identical, or nearly so. The difference between pages is proportional to the number of bits that differ in the two numbers.

The problem: avoid false duplicates

The problem is that these two measures are very different: one is a vector of 128 values, while the other is a single value. Because of this difference, the measurements may vary in how they see page difference. With the possibility of a single crawl containing over a million pages, that's an awful lot of numbers we need to compare to determine the best possible threshold value for the new heuristic.

Specifically, we need to set the heuristic threshold to detect as many duplicates and near-duplicates as possible, while minimizing the number of false duplicates. It is more important to absolutely minimize the number of page pairs which aren’t duplicates, so we’re not removing a page as a duplicate unless it actually *is* a duplicate. This means we need to be able to detect pages where:

The two pages are not actually duplicates or near-duplicates,
The current fingerprints heuristic correctly views them as different, but
The simhash heuristic incorrectly views them as similar.

We’re being incredibly careful about this to avoid the most negative customer experience we anticipate: having a behind-the-scenes change of our duplicate detection heuristic causing a sudden rash of incorrect "duplicate page" errors to appear for no apparent good reason.

The solution: visualizing the data

Our need to make a decision where many numeric quantities are involved is a classic case where data visualization can be of help. Our SEOmoz data scientist, Matt Peters, suggested that the best way to normalize these two very different measures of page content was to focus on how they measured difference between existing pages. Taking that to heart, I decided on the following approach:

Sample about 10 million pairs of pages from about 25 crawls selected at random.
For each pair of pages sampled, plot their difference as measured by the legacy fingerprints heuristic on the horizontal axis (0 to 128), and their difference as measured by simhash on the vertical axis (0 to 64).

The plot resulting from this approach looks like this:

Immediately, a problem is obvious: there's no measure of central tendency (or lack thereof) in this image. If more than one page pair has the same difference as measured by both legacy fingerprints and simhash, the plotting software will simply place a second red dot precisely atop the first one. And so on for the third, fourth, hundredth, and possibly thousandth identical data point.

One way to address this problem is to color the dots differently depending on how many page pairs they represent. So what happens if we select the color using a light wavelength that corresponds to the number of times we draw a point on the same spot? This tactic gives us a plot with red (a long wavelength) indicating the most data points, down through orange, yellow, green, blue, and violet (really, magenta on this scale) representing only one or two values:

How disappointing! That's almost no change at all. However, if you look carefully, you can see a few blue dots in that sea of magenta, and most important of all, the lower-leftmost dot is red, representing the highest number of instances of all. What's happening here is that red dot represents a count so much higher than all the other counts that most of the other colors between it and the ones representing the lowest numbers end up unused.

The solution is to assign colors in such a way that most of the colors end up being used for coding the lower counts, and to assign progressively fewer colors as counts increase. Or, in mathematical terms, to assign colors based on a logarithmic scale rather than a linear one. If we do that, we end up with the following:

Now we're getting somewhere! As expected, there is a central tendency in the data, even though it's pretty broad. One thing that's immediately evident is that, although in theory, the difference measured by simhash can go to a maximum of 64, in practice, it rarely gets much higher than 46 (three-fourths of the maximum). In contrast, using the fingerprints difference, many pages reach the maximum possible difference of 128 (witness all the red and orange dots along the right side of the graphic). Keep in mind that those red and orange dots represent really big counts, because the color scale is logarithmic.

Where we have to be most careful is on the bottom edge of things. That represents simhash values which indicate pairs of pages that are quite similar. If two pages are not, in fact, similar, yet simhash measures them similar where fingerprints saw a significant difference, this is precisely the sort of negative customer experience we are trying to avoid. One potential trouble spot is circled below:

The circled dot represents a pair of pages which are actually quite different, yet which simhash thinks are quite similar. (The dot to the left and even further below turns out to not be a problem: it represents a pair of nearly duplicate pages that the old heuristic missed!)

The vertical position of the troublesome dot represents a simhash difference of 6 (6 corresponding bits in the two 64-bit simhash values differ). It's not the only case, either: occasionally, such pairs of pages come up from time to time. It happens in 1% or less of the crawls, but it does happen. If we choose a simhash difference threshold of 6 (matching the threshold we currently have defined for the legacy fingerprints), there will be false positives.

Picking a threshold

Thankfully, 6 seems to be a border case. Above 6 bits of difference, the chance of a false positive increases. Below 6, I was unable to find any such pathological cases, and I examined thousands of crawls trying to find one. So I chose a difference threshold of 5 for simhash-based duplicate detection. That results in a situation represented by the final graphic:

Here we have lines drawn to represent the two difference thresholds. Everything to the left of the vertical line represents what the current code would report as duplicate. Everything below the horizontal line represents what the simhash code will report. Keeping in mind the logarithmic color scale and the red dot in the lower-left corner, we see that the number of page pairs where the two heuristics agree about similarity outweighs the number of page pairs where they disagree.

Note that there are still things in the "false positive" (lower right) quadrant. It turns out that those pairs tend not to differ much from the pairs where the two measures agree, or, for that matter, from the false negative pairs in the upper left quadrant. In other words, with the chosen thresholds, both simhash and the legacy fingerprints miss seeing some true near-duplicates.

The visible results

With this threshold decision, the number of false negatives outnumbers the number of false positives. This meets our goal of minimizing the number of false positives, even at the cost of incurring false negatives. Note that the "false positives" in the lower-right quadrant are actually quite similar to each other, and therefore would more accurately be described as the false negatives of the legacy fingerprints heuristic, rather than the false positives of the fingerprints heuristic.

The most visible aspects of the change to customers are two-fold:

1. Fewer duplicate page errors: a general decrease in the number of reported duplicate page errors. However, it bears pointing out that:

We may still miss some near-duplicates. Like the current heuristic, only a subset of the near-duplicate pages is reported.
Completely identical pages will still be reported. Two pages that are completely identical will have the same simhash value, and thus a difference of zero as measured by the simhash heuristic. So, all completely identical pages will still be reported.

2. Speed, speed, speed: The simhash heuristic detects duplicates and near-duplicates approximately 30 times faster than the legacy fingerprints code. This means that soon, no crawl will spend more than a day working its way through post-crawl processing, which will facilitate significantly faster delivery of results for large crawls.

I hope this post provides some meaningful insight into our upcoming changes. I look forward to hearing your thoughts in the comments below.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

New Nominations for the National Security Team

Your Daily Snapshot for
Tuesday, January 8, 2013

New Nominations for the National Security Team

Speaking from the East Room of the White House yesterday, President Obama announced two key nominations for his national slecurity team. He tapped John Brennan to serve as the Director of the Central Intelligence Agency, and he asked Sen. Chuck Hagel to serve as Secretary of Defense.

Learn about John Brennan, President Obama's pick for Director of the Central Intelligence Agency.

Learn about Sen. Chuck Hagel, President Obama's choice for Secretary of Defense.

President Barack Obama shakes hands with former Sen. Chuck Hagel in the East Room of the White House, Jan. 7, 2013. The President nominated Sen. Hagel for Secretary of Defense and John Brennan, Assistant to the President for Homeland Security and Counterterrorism, second from left, for Director of the CIA. Also pictured on stage are acting CIA Director Michael Morrell, left, and Secretary of Defense Leon Panetta, right. (Official White House Photo by Chuck Kennedy)

Seth's Blog : Toward resilience in communication (the end of cc)

Toward resilience in communication (the end of cc)

If you saw this post tweeted in your twitter stream, odds are you didn't click on it. And if you've got an aggressive spam filter, it's likely that many people who have sent you email are discovering you didn't receive it. "Did you see the tweet?" or "did you get my email?" are a tax on our attention. Resilience means standing up in all conditions, but in fact, electronic communication has gotten more fragile, not less.

We wait, hesitating, unsure who has received what and what needs to be resent. With this error rate comes an uncertainty where we used to have none (we're certain of the transmission if you're actively talking on the phone with us and we know if you got that certified mail.) It's now hard to imagine the long cc email list as an idea choice for getting much done.

The last ten years have seen an explosion in asynchronous, broadcast messaging. Asynchronous, because unlike a phone call, the sender and the recipient aren't necessarily interacting in real time. And broadcast, because most of the messaging that's growing in volume is about one person reaching many, not about the intimacy of one to one. That makes sense, since the internet is at its best with low-resolution mass connection.

It's like throwing a thousand bottles into the ocean and waiting to see who gets your message.

Amazon, eBay, Twitter, blogs, Pinterest, Facebook--they are all tools designed to make it easier to reach more and more people with a variation of faux intimacy. And this broadcast approach means that communication breaks down all the time... we have mass, but we've lost resiliency.

Asynchronous creates two problems when it comes to resiliency. First, it's difficult to move the conversation forward because the initiator can't be sure when to report back in with an update. Second, if some of the data changes in between interactions, it's entirely likely that the conversation will go off the rails. If you send two colleagues a word processed doc and, while you're waiting for a response, the file changes, it's entirely possible that you'll get feedback on the wrong file. Source control for any conversation of more than two people becomes a huge issue.

Your boss initiates a digital thread about an upcoming meeting. While two of the people are busy working on the agenda, a third ends up cancelling the meeting, wasting tons of effort because people are out of sync.

But asynchronous communication is also a boon. It means that you don't have to drop everything to get on a call or go to a meeting. Without the ability to spread out our project communication, we'd get a lot less done.

So, here we are in the middle of the communication age, and we're actually creating a system that's less engaging, less resilient to change or dropped signals, and less likely to ensure that small teams are actually contributing efficiently. The internet funding structure rewards systems that get big, not always systems that work very well.

A simple trade-off has to be made: You can't simultaneously have a wide, open system for communication and also have tight connections and resilience. Open and wide might work great for promoting your restaurant on Twitter, but it's no way to ensure tight collaboration among the three or four investors who need to coordinate your new menu.

As digital teamwork gets more important, then, team leaders are going to have to figure out how to build resiliency into the way they work. That might include something as simple as affirmative checkins, or more technical solutions to be sure everyone is in sync and also being heard. Someone sitting on a conference call and doing nothing but pretending to listen benefits no one.

Friends and family at Dispatch have built one approach to this problem, a free online collaboration tool that uses the cloud to create a threaded conversation built around online files, with redundancy and a conversation audit trail as part of the process. When someone speaks up, everyone can track it. When a file changes, everyone sees it. And only the invited participate.

It won't be the last tool you'll find that will address an increasingly urgent problem for teams that want to get things done, but it's worth some effort to figure this out. Tightly-knit, coordinated teams of motivated, smart people can change the world. It's a shame to miss that opportunity because your tools are lousy.

• Email to a friend •

luni, 7 ianuarie 2013

Mish's Global Economic Trend Analysis

Krugman Supports the $1 Trillion Coin; Why Stop There? I Support the $1 Quadrillion Coin
Google's Self-Driving Car Takes on D.C.; Not Quite Ready For Real World ... Yet
Humanoid Robots Play Music

Krugman Supports the $1 Trillion Coin; Why Stop There? I Support the $1 Quadrillion Coin

Posted: 07 Jan 2013 09:15 PM PST

There is a lot of crazy talk out there regarding the minting of a $1 trillion coin to get around the debt ceiling.

Today Paul Krugman hopped on the $1 trillion bandwagon in his New York Times article Be Ready To Mint That Coin.

Should President Obama be willing to print a $1 trillion platinum coin if Republicans try to force America into default? Yes, absolutely. He will, after all, be faced with a choice between two alternatives: one that's silly but benign, the other that's equally silly but both vile and disastrous. The decision should be obvious.

For those new to this, here's the story. First of all, we have the weird and destructive institution of the debt ceiling; this lets Congress approve tax and spending bills that imply a large budget deficit — tax and spending bills the president is legally required to implement — and then lets Congress refuse to grant the president authority to borrow, preventing him from carrying out his legal duties and provoking a possibly catastrophic default.

Enter the platinum coin. There's a legal loophole allowing the Treasury to mint platinum coins in any denomination the secretary chooses. Yes, it was intended to allow commemorative collector's items — but that's not what the letter of the law says. And by minting a $1 trillion coin, then depositing it at the Fed, the Treasury could acquire enough cash to sidestep the debt ceiling — while doing no economic harm at all.

So why not?

$1 Trillion Not Enough

Krugman asks "why not?" The answer should be obvious. It's crazy to think $1 trillion would be enough. A year or two from now, the Treasury would have to mint another coin, with the same silly debate we are having right now about whether the process is legal.

Question of Legality

I do not accept the idea that the proposed process would be legal. Others side with me as well. Here are a few examples:

Nonetheless Tut! Tut! I say to Krugman detractors.

Does any president care what is legal? Roosevelt didn't. Nixon didn't. Bush didn't. Obama didn't.

We all know presidents are above the law and they do what they want anyway. Kidnapping, torture, wiretapping, holding people without charges in Cuba, data gathering of all sorts with drones and other measures without due cause and in direct violation of the constitution, so clearly the constitution is meaningless already.

No one will possibly do anything if Obama breaks the law as Krugman wants. Krugman's major error is $1 trillion is nowhere near enough.

Let me be first to support the idea of a $1 quadrillion coin.

Krugman's second error is in regards to whose picture should be on the coin. I propose this picture for the front of the coin.

The back of the coin should be equally obvious.
Paul Krugman Prays for America.

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com

"Wine Country" Economic Conference Hosted By Mish
Click on Image to Learn More

Google's Self-Driving Car Takes on D.C.; Not Quite Ready For Real World ... Yet

Posted: 07 Jan 2013 02:55 PM PST

CNN Money has an update on Google's self-driving car. This time, the car attempts to handle traffic in Washington D.C. The software and radar did reasonably well, but a human driver had to take over a few times.

Please consider Toyota to reveal self-driving car research at CES

Toyota will offer a glimpse into its self-driving car research on Monday, just ahead of the Consumer Electronics Show, in Las Vegas.

The Japanese automaker recently revealed a five-second video of a Lexus-based research vehicle carrying a device similar to that used on Google's so-called self-driving car. Unlike Google's car, however, Toyota's research also involves vehicle-to-vehicle and vehicle-to-infrastructure communication, the automaker said in an announcement.

Those technologies allow cars to wirelessly communicate with one another and with things like traffic lights and stop signs. For example, a car could signal vehicles around it when it stops or turns or when it encounters a slippery road surface. Similarly, a traffic light could wirelessly signal that it is turning red so approaching cars can automatically apply their brakes.

Google's research car was based on a Toyota Prius, but Google and Toyota have not been involved in each other's research projects, according to a source at Toyota.

Automakers generally prefer to use the term "autonomous driving" rather "self-driving" for these technologies because, even in the future, a human driver should remain at the controls of a vehicle, ready to take over as needed.

The article has an interesting video of the Google test-drive in DC that inquiring minds may wish to see.

Google is not interested in cars per se. Rather, Google is interested in making software that will go into every car.

Each year, technology gets better and better. What's now promoted as "autonomous driving" will indeed morph into "self-driving". I suspect that most people are unaware of the possibilities.

Eventually there will not be much need for skilled pilots or skilled truck drivers. At this moment I cannot put a timeline on "eventually", other than "sooner than most think".

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com

Humanoid Robots Play Music

Posted: 07 Jan 2013 10:35 AM PST

Perhaps giving away my age, I have never heard of the band "Motorhead" nor have I heard of their song "Ace of Spades".

I have now. Popular Science reports Humanoid Robots Play Motorhead's Ace of Spades.

I love robots. I love Motorhead. And so it stands to reason that I would love robots playing Motorhead. But I haven't actually been able to test that theory -- until now, thanks to some roboticists in Berlin, Germany.

Having a four-armed drummer is pretty metal, and I would love to see what it can do with some Neil Peart, Mike Portnoy or Flo Mounier hijinks.

Ace of Spades Video

Hand movements on the guitar players were quite limited but the drummer was in full swing.

How long will it be before humanoids are boxing or humanoid teams play basketball?

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com

6 people you might know on Google+

Hi Mihai!

Here are some people you might know on Google+.

Suggestions for you

View all suggestions

	Traian Codrut Teglet Add to circles		Costin Comba 1 person in common Add to circles		Ciprian Pardău www.pardau.ro Add to circles
	Alexandru Bleau Politehnica Timisoara, Automatica si Calculatoare Add to circles		Carmen Rusu University of Bucharest Add to circles		Matilda Tanascov NNC Services Add to circles

The most popular content on Google+

View what's hot

Fantastic Maps .

Update - NOT being given out free

Edit at 3pm EST: Thanks to +Jonathan Black for pointing out this link to an adobe staff member denying that CS2 is going out free: http://indesignsecrets.com/adobe-is-not-giving-away-cs2.php

Follow the link to the adobe forum where the statement is made.

Original post kept for posterity:
Adobe's giving away CS2. I've seen some comments saying it doesn't play well with current Mac OS - but Windows machines are fine with it apparently. Anyway, no harm in downloading ...

Grab Photoshop and CS2 For Absolutely Free, Right Now

Grab Photoshop and CS2 For Absolutely Free, Right Now by Gizmodo UK. Adobe's giving us all a late Christmas present. You can grab yourself a free, legitimate copy of Photoshop and the rest of the Crea...

+854 - 186 comments - 722 shares

View or comment on this post »

The Onion .

Man Returns To Work After Vacation With Fresh, Reenergized Hatred For Job

EUGENE, OR—Arriving back at work after a two-week winter vacation, local marketing assistant Matthew Bueso told reporters Monday he was happy to return to the office with a fresh and rejuvenated loath...

+345 - 60 comments - 114 shares

View or comment on this post »

Gmail .

If you're looking for an in-depth introduction to Gmail for yourself or a friend, check out this blogpost by Gmail Top Contributor Wendy Durham.

http://gmail-miscellany.blogspot.co.uk/2012/11/gmail-101.html

#GmailTip

Gmail 101

All your Gmail basics in one place! A primer for newcomers to Gmail, which explains how to find your way around Google's innovative email service and to perform the basic email tasks of reading me...

+688 - 70 comments - 235 shares

View or comment on this post »

This notification was sent to e0nstar1.blog@gmail.com. Don't want occasional updates about Google+ activity and friend suggestions? Change what email Google+ sends you.

marți, 8 ianuarie 2013

SEO Blog

SEO Blog

Visualizing Duplicate Web Pages

Visualizing Duplicate Web Pages

Improving our page similarity measurement

The problem: avoid false duplicates

The solution: visualizing the data

Picking a threshold

The visible results

New Nominations for the National Security Team

Seth's Blog : Toward resilience in communication (the end of cc)

Toward resilience in communication (the end of cc)

More Recent Articles

luni, 7 ianuarie 2013

Mish's Global Economic Trend Analysis

Mish's Global Economic Trend Analysis

6 people you might know on Google+

Pagini

Persoane interesate

Arhivă blog

marți, 8 ianuarie 2013

Improving our page similarity measurement

The problem: avoid false duplicates

The solution: visualizing the data

Picking a threshold

The visible results

More Recent Articles

luni, 7 ianuarie 2013

Pagini

Persoane interesate

Arhivă blog

Subscribe to