vineri, 9 decembrie 2011

I'm Being Outranked by a Spammer - Whiteboard Friday

I'm Being Outranked by a Spammer - Whiteboard Friday


I'm Being Outranked by a Spammer - Whiteboard Friday

Posted: 08 Dec 2011 01:14 PM PST

Posted by Kenny Martin

What do you do when you are being outranked by a spammer? It's one of the most frustrating things that an SEO can face, but before jumping to conclusions it's important to understand what exactly is happening.

In this week's Whiteboard Friday, Rand gives us helpful advice with many important steps to follow if you discover that a spammer is outranking you. Let us know your thoughts on this challenging problem in the comments below!



Video Transcription

Howdy SEOmoz fans! Welcome to another edition of Whiteboard Friday. This week we're talking about a very tough problem, being outranked by spammers.

What I mean very specifically here is link spammers, because it is rare in the SEO world that today you are seeing other sorts of spam. Cloaking, manipulative redirection, doorway pages, they happen a little bit, but they are much less common. The most common forms of spam and the thing that I see people complain about all the time, the thing I get emails about, I get tweets about, we get Q&A about is, "Hey, Rand, I am being outranked by these spammers. Can you send this over to the Google webspam team? Can you tell the Bing webspam team? Who should I email over there? I filed my webspam reports. Do you think I should try and get it published on YOUmoz? Should I try and write to The New York Times and have them write about it because it seems like Google kicks people out when they're written about in The New York Times, at least for a little while?"

These are not always great tactics unfortunately, but I do want to walk you through some things that you should be doing when you think you are being outranked by spammers.

The first part is make sure, make 100% sure, that what you are looking at is really a ranking that's been earned through link spam. What I am talking about here is I will take you back. I will tell you a story of several years ago. This was probably, I am going to say 2007, and I was in the audience, I can't remember if I was on the panel or in the audience, and there was Google's head of webspam, has been for the last decade or so, Matt Cutts, on the panel. Matt was looking at some links using his special Google software where he is investigating a link graph right on his laptop, and someone from the audience had said, "Hey, Matt, I am getting outranked by this particular spammer." He looked and said, "No, you know, we see a few thousands links to that site, but we're actually only counting a few hundred of them, and they're the ones that are making it rank there."

So, think about that. We're talking about thousands of links pointing to a site. Think of all the links that might be pointing to a site here. Here are five different links that are pointing to this particular page. What the webspam team at Google is essentially saying is, "Hey, you know what? We know that this and this and this and this are spam. The reason that this page is ranking is because they do have some good links that we are counting." Remember it is often the case that Google's webspam team and their algorithm will not make these links cause a penalty against you, because then you could just point them at somebody else's site or page and make them drop in the rankings. Instead, what they'll do is drop the value of those, so that essentially it is like having a no-followed link for those pages. Yes, it's a followed link, but they are going to essentially say, "Oh, you know what? Our algorithm has detected that those are manipulative links. We are going to remove the value that they pass."

A lot of the times when you look through a list of hundreds or thousands of links and you see a lot of spam and you think to yourself, "Hey, that's why that guy is ranking. It's because he is spamming." It might not be the case. It could even be the case that person didn't actually build those spammy links. They just came through, you know, crap, junk on the Web. Not all the time, and usually you can tell the difference, but this is really something to keep in mind as you're analyzing that stuff. When you are, think to yourself, "Hey, how did they get the best-looking links that they've got, and could those be the ones that are responsible for making them rank so well?" Because if that's the case, you need to revisit your thesis around I'm being outranked by a spammer and think I am being outranked by a guy who's done some good link building who also happens to have lots of spammy links pointing at him. That's a completely different problem, and you need to solve for that.

If you are sure, so let's say you've gone through step one, confirmed that, you know what, this is a crap link too that Google shouldn't be passing value, but somehow they are. I want you to ask two more critical questions.

The first one: Is focusing on someone else's spam that's outranking you the best possible use of your web marketing time? You've got a lot on your plate. You don't just have to worry about SEO, right? These days you're worrying about SEO; you're worrying about keyword research; you're worrying about link building; you're worrying about content marketing; you're worrying about blogs and blog commenting and RSS and the traffic rating through there. You're worried about social media - Facebook, Twitter, Google+, and LinkedIn - and the longtail of all these other sites - Quora and Pinterest, Reddit, StumbleUpon. You're worried about web analytics and analyzing your success and making sure things are going through. You have to worry about crawlers and XML site maps and robots.txt. Make sure that thinking about and spending time on trying to flag somebody else's spam or trying to get them penalized is absolutely the best possible thing that you can be doing with your day. If it's not, reprioritize and put something else up there.

The second question is, which we don't discuss at SEOmoz here because we really kind of hate this practice, but are you willing to and does your site have risk tolerance to go acquire spammy links? If you see someone's outranking you and you're like, well, I could get those kinds of links too or those exact links too, do you have the risk tolerance for it? If you believe that it is an ethical issue, do you have the moral flexibility for it? If you don't believe it is a moral question, do you have the budget for it? Is it the best use of your budget? Is it the best use of your time? I almost always believe the answer is no, with the possible exception of some super spammy fields, PPC (porn, pills, and casino), which I have never personally operated, and so I don't pretend to understand that world. But virtually every other form of legit business on the Web, I can't get behind this. But maybe you can. Maybe you want to. Decide if that's the route you want to take.

Once you answer those two, you can move on to step three, which is, should you report the spam? The problem here is, you are going to go and send it through, let's say, your Webmaster Tools report, and there are thousands, probably hundreds of thousands of these filed every day. Probably put it somewhere between tens and hundreds of thousands of these filed every day, spam reports. Google says Webmaster Tools are the best place to file, when you are logged into your account, to report spam from other folks. Those reports go to a team of software engineers who work on Google's webspam and search quality teams. Then you can see, they've done a video, where they sit around and they prioritize all the day's projects and they determine who is going to work on what and how much energy they're going to put into it. You can probably tell that over the last couple of years, maybe even three years, there has not been a ton of energy spent to try and devalue link spam. In fact, a lot of paid links are working these days, and it's sort of a sad reality. I think that many people assume that Google's actually trying to move beyond linking signals, particularly social signals, Google+, by using the signals of users and usage data that they're getting through Chrome's market share, which I think was now reported globally as over 25% of all web browsers, which is very impressive, from StatCounter.

So, I would say that this is a low-value activity as well. Not that you necessarily shouldn't do it. I mean, if you want to try and help Google make the Web a better place and you believe in their sort of mission and the quality of the people there, then by all means, spend two minutes, report them for webspam. It's not going to take a ton of your time. But please, don't think this is a solution. This will not solve any of the problems you're trying to solve. It might help Google in the long run to get better, to try and analyze some forms of webspam and link spam that they might not have otherwise caught if you hadn't told them. Is it going to help you rank better? Boy, probably not, and even if it is, not for a long time, because these algorithmic developments take a tremendous amount of time and energy to implement. Panda was years in the making. Most of the link spam devaluations that happened in '07 and '08 were years in the making. You could see patents that were filed two years, four years before those things actually came out. But reporting spam is an option.

Then I want you to move on to step four, which is can you - I think you almost always can - outmaneuver the spammers using their own tactics? What I mean by this is you might see where those links are coming from, but what's winning? Is it coming from high PageRank or high MozRank pages? Sort of home pages of domains? Is it coming from internal pages? Are they coming from directories? Are they coming from forums? Are they coming from blogs? Are they coming from .edu sites? Where are those links coming from, what are they pointing to, and what kind of anchor text are they using? Is it diverse anchor text? Is it all exact match anchor text? You want to find, you want to identify all the patterns. You're going to say, "Oh, this is anchor text pattern and this is the diversity of those patterns of where those links are coming from and this is a type of site it is coming from and this is the quantity or the number of sites I'm seeing and here's where the link target's pointing to." You look at all those things and then you find ways to do it inbound. Find ways to do it white hat. I promise you, you can. Think of one of the most common forms of spam, which is someone hijacking .edu webpages on student domains and then they essentially have all these anchor text links pointing to a specific page on their site from .edu pages that are buried deep in a site, but because it is an .edu it is a trusted domain. Usually there are only 50 or 100 of them, but they seem to be passing juice.

So how do you get 20 or 30 good links from .edu? I'll give you some great examples. Well, I'll give you one, and then you can figure out tons more on your own and certainly there is tons of link building content that you can look at on the SEOmoz site. But here's a great one. Go do a search like your keyword - whoa, that's a lot of smudging - your keyword + file type:pdf or xls or something like that and site:.edu. What this is going to give you is essentially here is a bunch of research that has been done on .edu sites that's been published, that's probably kind of buried. Now, I want you to go create some great blog posts, some great content, that references this stuff, that turns it into a graphic, that makes a clever video about it, and then I want you to email whoever was responsible for the research, and I guarantee half the time they are going to link to you from that website, from the .edu website. They're going to be like, "Oh, this is great. Someone turned my research into an infographic on a commercial site. Very cool. Great to see that application in the real world. Thank you. Here's a link . . ." from an .edu that's not spammy, that's completely inbound, white hat because it's making the Web a better place. There are ways to figure out all of this stuff.

Then I want you take this last and final step, step five. Beat them by targeting the tactics, the channels, the people, and the keywords that they don't target. Remember what spam does. Spam tends to look at, if here is the keyword demand curve and we've got the head in here with all the popular keywords, that's where all the spam is. You very rarely, extremely rarely, see spam down in the tail. So if you can do things like user generated content, building a community, building tons of longtail great content, having a blog, having a forum, a place where real people participate and are creating a kind of Q&A site, you're going to target all that longtail. Remember 70% of the keyword volume is in here. This is only 30% up in the fat head and the chunky middle. Great. Fantastic way to work around them. Or think about ways that they can't target, the channels that spammers, especially link spammers never target - social media, forums, and communities. Rarely do they ever target blogs. Those people don't take those sites seriously. They don't take them authentically. Think about the branding elements you can build. You can have a better site design, a higher conversion rate, a way better funnel. People subscribe to your email. To follow you, subscribe to your RSS feed. No spammer is ever going to get that, and those are customers that you can keep capturing again and again and again, because when you do inbound marketing, when you do white hat marketing, you don't have to just push your site up the rankings. You can approach it from a holistic point of view and win in all sorts of tactics and all sorts of channels. That's what I love about this field too.

All right everyone. I hope you've enjoyed this edition of Whiteboard Friday. I hope you'll feel maybe a little bit less stressed out about that nasty spammer who is ranking above you. I hope you'll see us again next week for another edition of Whiteboard Friday.

Video transcription by Speechpad.com


Do you like this post? Yes No

Search Engine Algorithm Basics

Posted: 08 Dec 2011 03:33 AM PST

Posted by rolfbroer

A good search engine does not attempt to return the pages that best match the input query. A good search engine tries to answer the underlying question. If you become aware of this you'll understand why Google (and other search engines), use a complex algorithm to determine what results they should return. The factors in the algorithm consist of "hard factors" as the number of backlinks to a page and perhaps some social recommendations through likes and +1' s. These are usually external influences. You also have the factors on the page itself. For this the way a page is build and various page elements play a role in the algorithm. But only by analyzing the on-site and off-site factors is it possible for Google to determine which pages will answer is the question behind the query. For this Google will have to analyze the text on a page.

In this article I will elaborate on the problems of a search engine and optional solutions. At the end of this article we haven’t revealed Google’s algorithm (unfortunately), but we’ll be one step closer to understand some advice we often give as an SEO. There will be some formulas, but do not panic. This article isn’t just about those formulas. The article contains a excel file. Oh and the best thing: I will use some Dutch delights to illustrate the problems.

Croquets and Bitterballen
Behold: Croquets are the elongated and bitterballen are the round ones ;-)

True OR False
Search engines have evolved tremendously in recent years, but at first they could only deal with Boolean operators. In simple terms, a term was included in a document or not. Something was true or false, 1 or 0. Additionally you could use the operators as AND, OR and NOT to search documents that contain multiple terms or to exclude terms. This sounds fairly simple, but it does have some problems with it. Suppose we have two documents, which consist of the following texts:

Doc1:
“And our restaurant in New York serves croquets and bitterballen.”

Doc2:
“In the Netherlands you retrieve croquets and frikandellen from the wall.”
 

Frikandellen
Oops, almost forgot to show you the frikandellen ;-)

If we were to build a search engine, the first step is tokenization of the text. We want to be able to quickly determine which documents contain a term. This is easier if we all put tokens in a database. A token is any single term in a text, so how many tokens does Doc1 contain?

At the moment you started to answer this question for yourself, you probably thought about the definition of a "term". Actually, in the example "New York" should be recognized as one term. How we can determine that the two individual words are actually one word is outside the scope of this article, so at the moment we threat each separate word as a separate token. So we have 10 tokens in Doc1 and 11 tokens in Doc2. To avoid duplication of information in our database, we will store types and not the tokens.

Types are the unique tokens in a text. In the example Doc1 contains twice the token "and". In this example I ignore the fact that “and” appears once with and once without being capitalized. As with the determination of a term, there are techniques to determine whether something actually needs to be capitalized. In this case, we assume that we can store it without a capital and that “And” & “and” are the same type.

By storing all the types in the database with the documents where we can find them, we’re able to search within the database with the help of Booleans. The search "croquets" will result in both Doc1 and Doc2. The search for "croquets AND bitterballen" will only return Doc1 as a result. The problem with this method is that you are likely to get too much or too little results. In addition, it lacks the ability to organize the results. If we want to improve our method we have to determine what we can use other then the presence / absence of a term in a document. Which on-page factors would you use to organize the results if you were Google?

Zone Indexes
A relatively simple method is to use zone indexes. A web page can be divided into different zones. Think of a title, description, author and body. By adding a weight to each zone in a document, we’re able to calculate a simple score for each document. This is one of the first on page methods search engines used to determine the subject of a page. The operation of scores by zone indexes is as follows:

Suppose we add the following weights ​​to each zone:

Zone Weight
title 0.4
description 0.1
content 0.5

We perform the following search query:
“croquets AND bitterballen”

And we have a document with the following zones:

Zone Content Boolean Score
title New York Café 0 0
description Café with delicious croquets and bitterballen 1 0.1
content Our restaurant in New York serves croquets and bitterballen 1 0.5
    Total 0.6

Because at some point everyone started abusing the weights assigned to for example the description, it became more important for Google to split the body in different zones and assign a different weight to each individual zone in the body.

This is quite difficult because the web contains a variety of documents with different structures. The interpretation of an XML document by such a machine is quite simple. When interpreting an HTML document it becomes harder for a machine. The structure and tags are much more limited, which makes the analysis more difficult. Of course there will be HTML5 in the near future and Google supports microformats, but it still has its limitations. For example if you know that Google assigns more weight to content within the <content> tag and less to content in the <footer> tag, you’ll never use the <footer> tag.

To determine the context of a page, Google will have to divide a web page into blocks. This way Google can judge which blocks on a page are important and which are not. One of the methods that can be used is the text / code ratio. A block on a page that contains much more text than HTML code contains probably the main content on the page. A block that contains many links / HTML code and little content is probably the menu. This is why choosing the right WYSIWYG editor is very important. Some of these editors use a a lot of unnecessary HTML code.

The use of text / code ratio is just one of the methods which a search engine can use to divide a page into blocks. Bill Slawski talked about identifying blocks earlier this year.

The advantage of the zone indexes method is that you can calculate quite simple a score for each document. A disadvantage of course is that many documents can get the same score.

Term frequency
When I asked you to think of on-page factors you would use to determine relevance of a document, you probably thought about the frequency of the query terms. It is a logical step to increase weight to each document using the search terms more often.

Some SEO agencies stick to the story of using the keywords on a certain percentage in the text. We all know that isn’t true, but let me show you why. I'll try to explain it on the basis of the following examples. Here are some formulas to emerge, but as I said it is the outline of the story that matters.

The numbers in the table below are the number of occurrences of a word in the document (also called term frequency or tf). So which document has a better score for the query: croquets and bitterballen ?

  croquets and café bitterballen Amsterdam ...
Doc1 8 10 3 2 0  
Doc2 1 20 3 9 2  
DocN ... ... ... ... ...  
Query 1 1 0 1 0  

The score for both documents would be as follows:
score(“croquets and bitterballen”, Doc1) = 8 + 10 + 2 = 20
score(“croquets and bitterballen”, Doc2) = 1 + 20 + 9 = 30

Document 2 is in this case closer related to the query. In this example the term “and” gains the most weight, but is this fair? It is a stop word, and we like to give it only a little value. We can achieve this by using inverse document frequency (tf-idf), which is the opposite of document frequency (df). Document frequency is the number of documents where a term occurs. Inverse document frequency is, well, the opposite. As the number of documents in which a term grows, idf will shrink.

You can calculate idf by dividing the total number of documents you have in your corpus by the number of documents containing the term and then take the logarithm of that quotient.

Suppose that the IDF of our query terms are as follows:
Idf(croquets)            = 5
Idf(and)                   = 0.01
Idf(bitterballen)         = 2

Then you get the following scores:
score(“croquets and bitterballen”, Doc1) = 8*5  + 10*0.01 + 2*2 = 44.1
score(“croquets and bitterballen”, Doc2) = 1*5 + 20*0.01 + 9*2 = 23.2

Now Doc1 has a better score. But now we don’t take the length into account. One document can contain much more content then another document, without being more relevant. A long document gains a higher score quite easy with this method.

Vector model
We can solve this by looking at the cosine similarity of a document. An exact explanation of the theory behind this method is outside the scope of this article, but you can think about it as an kind of harmonic mean between the query terms in the document. I made an excel file, so you can play with it yourself. There is an explanation in the file itself. You need the following metrics:

  • Query terms - each separate term in the query.
  • Document frequency - how many documents does Google know containing that term?
  • Term frequency - the frequency for each separate query term in the document (add this Focus Keyword widget made by Sander Tamaëla to your bookmarks, very helpful for this part)

Here's an example where I actually used the model. The website had a page that was designed to rank for "fiets kopen" which is Dutch for “buying bikes”. The problem was that the wrong page (the homepage) was ranking for the query.

For the formula, we include the previously mentioned inverse document frequency (idf). For this we need the total number of documents in the index of Google. For this we assume N = 10.4 billion.

An explanation of the table below:

  • tf = term frequency
  • df = document frequency
  • idf = inverse document frequency
  • Wt,q = weight for term in query
  • Wt,d = weight for term in document
  • Product = Wt,q * Wt,d
  • Score = Sum of the products

The main page, which was ranking: http://www.fietsentoko.nl/

term Query Document Product
  tf df idf Wt,q tf Wf Wt,d  
Fiets 1 25.500.000 3.610493159 3.610493159 21 441 0.70711 2.55302
Kopen 1 118.000.000 2.945151332 2.9452 21 441 0.70711 2.08258
              Score: 4.6356

The page I wanted to rank: http://www.fietsentoko.nl/fietsen/

term Query Document Product
  tf df idf Wt,q tf Wf Wt,d  
Fiets 1 25.500.000 3.610493159 3.610493159 22 484 0.61782 2.23063
Kopen 1 118.000.000 2.945151332 2.945151332 28 784 0.78631 2.31584
              Score: 4.54647

Although the second document contains the query terms more often, the score of the document for the query was lower (higher is better). This was because the lack of balance between the query terms. Following this calculation, I changed the text on the page, and increased the use of the term “fietsen” and decreased the use of “kopen” which is a more generic term in the search engine and has less weight. This changed the score as follows:

term Query Document Product
  tf df idf Wt,q tf Wf Wt,d  
Fiets 1 25.500.000 3.610493159 3.610493159 28 784 0.78631 2.83897
Kopen 1 118.000.000 2.945151332 2.945151332 22 484 0.61782 1.81960
              Score: 4.6586

After a few days, Google crawled the page and the document I changed started to rank for the term. We can conclude that the number of times you use a term is not necessarily important. It is important to find the right balance for the terms you want to rank.

Speed up the process
To perform this calculation for each document that meets the search query, cost a lot of processing power. You can fix this by adding some static values ​​to determine for which documents you want to calculate the score. For example PageRank is a good static value. When you first calculate the score for the pages matching the query and having an high PageRank, you have a good change to find some documents which would end up in the top 10 of the results anyway.

Another possibility is the use of champion lists. For each term take only the top N documents with the best score for that term. If you then have a multi term query, you can intersect those lists to find documents containing all query terms and probably have a high score. Only if there are too few documents containing all terms, you can search in all documents. So you’re not going to rank by only finding the best vector score, you have the have your statics scores right as well.

Relevance feedback
Relevance feedback is assigning more or less value to a term in a query, based on the relevance of a document. Using relevance feedback, a search engine can change the user query without telling the user.

The first step here is to determine whether a document is relevant or not. Although there are search engines where you can specify if a result or a document is relevant or not, Google hasn’t had such a function for a long time. Their first attempt was by adding the favorite star at the search results. Now they are trying it with the Google+ button. If enough people start pushing the button at a certain result, Google will start considering the document relevant for that query.

Another method is to look at the current pages that rank well. These will be considered relevant. The danger of this method is topic drift. If you're looking for bitterballen and croquettes, and the best ranking pages are all snack bars in Amsterdam, the danger is that you will assign value to Amsterdam and end up with just snack bars in Amsterdam in the results.

Another way for Google is to use is by simply using data mining. They can also look at the CTR of different pages. Pages where the CTR is higher and have a lower bounce rate then average can be considered relevant. Pages with a very high bounce rate will just be irrelevant.

An example of how we can use this data for adjusting the query term weights is Rochio's feedback formula. It comes down to adjusting the value of each term in the query and possibly adding additional query terms. The formula for this is as follows:
Rochhio feedback formula

The table below is a visual representation of this formula. Suppose we apply the following values ​​:
Query terms: +1 (alpha)
Relevant terms: +1 (beta)
Irrelevant terms: -0.5 (gamma)

We have the following query:
“croquets and bitterballen”

The relevance of the following documents is as follows:
Doc1   : relevant
Doc2   : relevant
Doc3   : not relevant

Terms Q Doc1 Doc2 Doc3 Weight new query
croquets 1 1 1 0 1 + 1 – 0        = 2
and 1 1 0 1 1 + 0.5 – 0.5  = 1
bitterballen 1 0 0 0 1 + 0 - 0         = 1
café 0 0 1 0 0 + 0.5 – 0     = 0.5
Amsterdam 0 0 0 1 0 + 0 – 0.5     = -0.5  = 0

The new query is as follows:
croquets(2) and(1) bitterballen(1) cafe(0.5)

The value for each term is the weight that it gets in your query. We can use those weights in our vector calculations. Although the term Amsterdam was given a score of -0.5, the adjust negative values back to 0. In this way we do not exclude terms from the search results. And although café did not appear in the original query, it was added and was given a weight in the new query.

Suppose Google uses this way of relevance feedback, then you could look at pages that already rank for a particular query. By using the same vocabulary, you can ensure that you get the most out of this way of relevance feedback.

Takeaways
In short, we’ve considered one of the options for assigning a value to a document based on the content of the page. Although the vector method is fairly accurate, it is certainly not the only method to calculate relevance. There are many adjustments to the model and it also remains only a part of the complete algorithm of search engines like Google. We have taken a look into relevance feedback as well. *cough* panda *cough*. I hope I’ve given you some insights in the methods search engine can use other then external factors. Now it's time to discuss this and to go play with the excel file :-)


Do you like this post? Yes No

Niciun comentariu:

Trimiteți un comentariu