vineri, 3 august 2012

Investigating Panda & Duplicate Content Issues

Investigating Panda & Duplicate Content Issues

Link to SEOptimise » blog

Investigating Panda & Duplicate Content Issues

Posted: 02 Aug 2012 05:56 AM PDT

During a recent analysis of a website (blog with less than 50k visitors a week), we came across some interesting factors that led to us taking a different approach to investigation.

The Problem:

  • The site faced 20-40% drop in traffic corresponding with periods in roll outs of the Panda Algorithm.
  • The site saw a loss in rankings, but no consistency across them – some keywords moved down a few positions, while others went off the first 2-3 pages of the SERPs.
  • The site is a blog, and as a result most of the content written was original and unique, and written by a single person based on their research and experience.
  • The site has been in existence for over 6 years and attracts a lot of natural links – in fact no link building to the site has ever been carried out.

From the above, this doesn't seem like your typical target for Panda, but the dates of the traffic drops were too much of a coincidence.

Panda: A Reminder

High-quality sites algorithm improvements. [launch codenames "PPtl" and "Stitch", project codename "Panda"] In 2011, we launched the Panda algorithm change, targeted at finding more high-quality sites. We improved how Panda interacts with our indexing and ranking systems, making it more integrated into our pipelines. We also released a minor update to refresh the data for Panda.

And one piece of advice for working on a recovery:

One other specific piece of guidance we’ve offered is that low-quality content on some parts of a website can impact the whole site's rankings, and thus removing low quality pages, merging or improving the content of individual shallow pages into more useful pages, or moving low quality pages to a different domain could eventually help the rankings of your higher-quality content.

(bolded sections highlighted by us)

A full list of questions to ask and answer when analysing a Panda hit site: http://googlewebmastercentral.blogspot.co.uk/2011/05/more-guidance-on-building-high-quality.html  I previously covered this here: http://www.seoptimise.com/blog/2011/10/seo-tactics-to-tame-the-panda.html  and Kevin covered it here http://www.seoptimise.com/blog/2011/05/how-to-survive-a-panda-attack.html

The Investigation

Often, QUALITY sites that were affected by Panda lost only a few key pages in rankings across a whole host of content.  Keep in mind that Panda looked at a range of signals on content quality. On the sites that we have worked, the two main factors were:

  1. Originality of content (duplicates? Does it appear original and unique?).
  2. Volume of content and signals of sharing the content.

NOTE:  A lot of SEOs believe that the Panda Algo is NOT page specific – but experience shows that a few major ranking losses and a site wide dip, due to Panda, are linked to specific pieces of content that causes an activation of the Panda filter. We have found that a cross-section of Panda-hit sites would either take major site-wide hits (thus losing right across the board) or would lose a few key sections, content pieces, AND have smaller losses across the site.

"…improve the quality of content or if there is some part of your site that has got especially low-quality content, or stuff that really (is) not all that useful, then it might make sense to not have that content on your site…." http://www.youtube.com/watch?v=gMDx8wFAYYE

(bolding ours)

Keyword Referral Variation Analysis

Typically we would run a range of different investigations on the site, including a Keyword Referral Variation analysis on post and pre Panda hit periods. This is done for sites that never really did any SEO, and as such didn't keep a record of rankings and ranking fluctuations. It gives a decent snapshot of what the primary drivers were.

However, one more issue reared its head – "not provided". In the same Year-on-Year period pre-panda, the site had virtually no "not provided" keywords, but in the post-panda season the value grew to such a large portion that a Year-on-Year keyword analysis would be largely flawed:

As you can see from the above sheet, the top 10 key phrases show a significant dip, but the massive growth in "not provided" makes that analysis importent.  Over 31K visitors weren't attributed a keyword, so it's difficult to gauge where the primary ranking drops would be.

Isolating Page Losses

With "not provided" hiding referrers, it's not possible to isolate specific keyword hits. This means we may have to take another route to isolating areas of loss.

An interesting way to do this is to look at pages, previously driving traffic, broken down to identify which pages have lost their referrals from search compared to previous periods.

Next we extracted the top 100 entry pages for each period – 2011 and 2012, cross referenced them and highlighted common pages, new pages and cross referenced the traffic.

In 2011, the top 100 pages captured 367K visitors. In 2012, this figure was 290K – a difference of 79K.

The key was to isolate all the top content pages to understand what their primary losses were.

What we did:

  1. Downloaded all the top 100 entry pages and visitors for Period 1 and Period 2.
  2. Dropped them into excel and instead of messing about with pivots etc. we went for the low tech version. This included conditional formatting and as a result found all that were in both columns and highlighted them.
  3. After which we sorted by colour, then alphabet, which gave us a neat comparison.
  4. We then ran a simple formula on visitor drop and then used conditional formatting to highlight scale.

The result:

This is just a snapshot, but you can see that there were a couple of really good pages in there, and some completely obliterated.

The next step would be to isolate all the big drops and look at those pages individually. We like to strip them all out and create a new sheet for clients to look at, using conditional formatting to highlight severe losses and order of priority:

Building a Strategy to Recover from Panda

Is it possible to recover? Yes. Is it easy? No. The change is algorithm-based, which means tweaking, then waiting, then tweaking a bit more. But, a recovery is possible, if we isolate the issues.

Quoting Matt Cutt's Latest video:  http://www.youtube.com/watch?feature=player_embedded&v=8IzUuhTyvJk

"Remember, Panda is a hundred percent algorithmic. There's nothing manual involved in that. And we haven't made any manual exceptions. And the Panda algorithm, we tend to run it every so often. It's not like it runs every day. It's one of these things where you might run it once a month or something like that. Typically, you're gonna refresh the data for it. And at the same time, you might pull in new signals. And those signals might be able to say, 'Ah, this is a higher-quality site."

The steps needed in this case:

  1. Isolate all the top losses and work out the total loss as a result.  (done!)
  2. Identify a common factor amongst these to try and form a pattern.
  3. Confirm that the pattern identified matches industry reported reasons for Panda Filters.
  4. Investigate any other issues such as loss of link equity, navigation, crawlability.
  5. Fix said issues.
  6. Try and add more positive signals to fixed content.
  7. Wait for a Panda refresh and see how site performs.
  8. Rinse and repeat.

Step 1. Isolating the data

In the analysis of the top 100 pages, the 25 pages that were identified with losses, contributed to a 95K drop in visitors in the time frame analysed! These pages should be a starting point for the site in terms of trying to understand patterns.

Step 2. Common patterns

Interestingly, poor content and duplicate content can often trigger Panda hits – one of the main targets were scraper sites. One of the first things we tend to do is isolate content from hit pages and "fuzzy match" them to search results. Which is a fancy way of saying, take a piece of content, drop it into the search bar and see what comes up!

 Random samples of content from a page dropped into Google:

The pink sections are matches to the content – i.e. Google bolds them. As you can see, 3 out of the first 4 are exact or close matches. The grey block is the site in question.

We took another part of the page, in quotes this time, and dropped it into Google:

Insanely there were 42 exact matches – the original site didn't even show up on the first page!  As a side note, I checked the date on most of those sites – they are published at least a year AFTER the clients original content.

We did this for the top 100 hit pages and came up with:

Interestingly, as we move down the scale of pages that were hit, the lower the hits, the lower the duplication! Now don't take that as a correlation, but it is interesting nonetheless.

Summary

What did we learn? Essentially, although this blog owner has been publishing GOOD content for many, many years, the fact that people have been copying and pasting that content for years only came to light AFTER Panda.

  • Most of the pages hit were either copied in part, or full, by either scraper sites, or really powerful sites.
  • Despite the age of the articles, some of them have been overtaken by other sites on this basis.
  • Some older articles that have lost rankings, to add insult to injury, have had inbound link loss.
  • In certain situations, articles have been overtaken by “authority sites” such as .gov sites – something that isn't easy to recover from.
  • In some situations, articles are a lot less useful to queries than they were, primarily because exact match articles have replaced them or the query no longer has any volume.
  • Google also updated caffeine filters (freshness algorithm) and the age of content may cause it to drop in favour of fresher content

The type of sites we found copying their content or sections of their content:

-          Scrapers.

-          Forums (such as money savings expert forum users quoting them!).

-          Other blogs taking large blocks of text.

-          Genuine Businesses(!) taking whole sections off to promote their own businesses.

-          Government sites that were using portions of their text written years prior about certain guidelines.

The losses came by a dip in rankings through:

-          Massively copied pages disappearing from the index.

-          Content that had portions quoted in forums dropped a few places.

-          Pages that had content that was copied by high authority forums and Government sites fell out of index or were massively lowered in rankings.

-          A resultant Panda filter, in our opinion, which lowered the value of the whole site.

-          Some content was topical in the previous year, which has low search volume now. This is a natural dip in traffic.

Other issues contributing to these factors that may play a role:

-          Most of the losses came from older established blog posts – not been updated in a long time. Was Google using a filter to show newer pages with that content instead?

-          Some key pages suffered link loss – although this needs further investigation, there may be further room to help that along.

-          Some of these posts were written way before social shares existed – thus low or no social shares on them for a very long time – nor was there any incentive on the content to do so.

-          Bounce rates had gone up from 70% average to 85% average – we investigated the content – it is still relevant and correct, but we feel that the date which is prominently displayed on the post may put visitors off, who are bouncing away.

Related Google Algorithm factors that may also be to blame:

-          Caffeine (close correlation with content outranking).

-          Bounce backs (increase in bounce rate)? (not proven).

-          Lack of Social Signals (not proven).

 

Potential Actions and Solutions:

Try and get Author Attribution – since this is a blog, which has been running for a long time, and has established followers, they should get their attribution in place – this helps make the site a lot more "legitimate". More reading here:

-          http://www.seomoz.org/blog/authorship-google-plus-link-building

-           http://yoast.com/wordpress-rel-author-rel-me/

Legal action and a DMCA request – where necessary and appropriate, get content disavowed in Google. Further resources here:

-           http://blog.kissmetrics.com/content-scrapers/  https://docs.google.com/spreadsheet/viewform?formkey=dGM4TXhIOFd3c1hZR2NHUDN1NmllU0E6MQ

-          https://www.google.com/webmasters/tools/dmca-notice?pli=1&

Refresh and rewrite badly hit pages. A take down request etc. takes time – it is often easier to rewrite content, at the same time making it "fresh" in the algorithm's view, and so a double benefit.  Adding more content to "light pages" and adding resources etc. to heavy pages also seems to help make a piece of content authoritative.

Social shares and fresh links. The content is actually of a decent quality, but the lack of focus for social sharing means they are losing out on valuable traffic, at the same time not sending social signals that legitimises the content. Generating a decent number of +1's, tweets, etc. can help get the content re-crawled more often as well. At the same time, to speed up the re-ranking process, we suggest getting some fresh links into content that has had the most link loss.

 

© SEOptimise - Download our free business guide to blogging whitepaper and sign-up for the SEOptimise monthly newsletter. Investigating Panda & Duplicate Content Issues

Related posts:

  1. Did Google Just Roll-Out Panda 3.2 (2012 Edition)?
  2. Is Reading Level a Google Panda Algorithm Factor?
  3. SEO Tactics to Tame the Panda

Niciun comentariu:

Trimiteți un comentariu