luni, 2 iulie 2012

Googlebot Crawl Issue Identification Through Server Logs

Googlebot Crawl Issue Identification Through Server Logs


Googlebot Crawl Issue Identification Through Server Logs

Posted: 01 Jul 2012 07:56 PM PDT

Posted by Dave Sottimano

Sifting through server logs has made me infinitely better at my job as an SEO. If you're already using them as part of your analysis, congrats - if not, I encourage you to read this post.

In this post we’re going to:

  • Briefly introduce a server log hit
  • Understand common issues with Googlebot's crawl
  • Use a server log to see Googlebot's crawl path.
  • Look at a real issue with Googlebot wasting crawl budget and fix it.
  • Introduce or reacquaint you with my favourite data analyzer.

It’s critical to SEOs because:

  • Webmaster tools, 3rd party crawlers and search operators won’t give you the full story.
  • You’ll understand how Googlebot behaves on your site, and it will make you a better SEO.

I’m going to casually assume that you at least know what server logs are and how to obtain them. Just in case you've never seen a server log before, let's take a look at a sample "hit".

Anatomy of a server log hit

Each line in a server log represents a "hit" to the web server. The following illustrations can help explain:

File request example: brochure_download.pdf

A request for /page-a.html will likely end up with multiple hits because we need to get the images, css and any other files needed to render that page.

Image credit: Media College 

Example hit

Every server is inherently different in logging hits, but they typically give similar information that is organized into fields. Below is a sample hit to an Apache web server, and I've purposely cut down the fields to make this simpler to understand:

50.56.92.47 - - [31/May/2012:12:21:17 +0100] "GET" - "/wp-content/themes/esp/help.php"  - "404" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - www.example.com - 
 
Field Name Value
IP 50.56.92.47
Date 31/May/2012:12:21:17 +0100
Method GET
Response Code 404
User-agent Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
URI_request /wp-content/themes/esp/help.php
Host www.example.com
In reality, there are many more fields and a wealth of information that can only be gained through web server logs. 
 

Googlebot crawl issues you can find with logs

Specifically in regards to SEO, we want to make sure that Google is crawling the pages we want to be crawled on our site - because we want them to rank. We already know what we can do internally to help pages rank in search results, such as:

  1. Ensure the pages are internally linked.
  2. Keep important pages as close to the root as possible.
  3. Ensure that the pages do not return errors.
This is all typically standard stuff and you can get this information easily without server logs, but I want more, I want to see Googlebot.
 
I want to look for Googlebot specific issues like:
  1. Unnecesary crawl budget expenditure
  2. Page it considers important / not important
  3. If there are any bot traps
  4. Is Google making up 404 errors by trying to make up URLs (think JavaScript)
  5. Is Google trying to fill out forms? (Yes, it happens)

Using server logs to see Googlebot

Step 1:  Get some server logs.

Ask your client, or download a set of server logs from your hosting company. The point is to try and capture Googlebot visiting your site, except we don’t know when that’s going to happen – so you might need a few days worth of logs, or just a few hours.

To give you a real example:

Example domain has a PageRank of 6, DA of 80 and receives 200,000 visits a day.  Their IIS server logs will amount to 4gB a day, but because the site is so popular, Googlebot visits at least once a day.

In this case, I would recommend a full day worth of logs to ensure we catch Googlebot.

Step 2:  Download & Install Splunk.

Head over to http://www.splunk.com, sign up and download the product – free edition.

Note: the free edition will only let you upload 500mb per 24 hours.

Step 3: Adding your server log data to Splunk

I would recommend that you put your server logs on you local machine to make this process nice and easy.

I've put together a quick few screencasts, I know they sound cheesy, but whatever.

Step 4: Only displaying hits containing Googlebot as the user-agent

Step 5: Export to Excel

Simply click on the Export link and wait for your massive CSV to download. (Note: If the link doesn't appear, it's because the search isn't finished yet)

The Analysis, problem & the fix

The problem

Every time Googlebot came by the site, it spent most of it's time crawling PPC pages and internal JSON scripts. Just to give you an idea of how much time and crawl budget was wasted, please see below:

The real problem is that we had pages on the site that hadn't been indexed, and this was the cause. I wouldn't have found this without the server logs and I'm very grateful I did.

A look into my Excel spreadsheet

How to confirm what you're seeing is actually Googlebot

It's possible to crawl or visit a site using the Googlebot user agent, and even worse - it's possible to spoof the Googlebot IP. I always double check a list of IPs to what I see in the server log report and I use the method officially outline by Google.

How did I fix this?

1) Crawling PPC pages

I checked that these pages weren't indexed or receiving any traffic first, then I used robots.txt to block only Googlebot from these pages. I was very careful about this since I wanted to make sure that I didn't block Google Adbot (the robot that needs to crawl PPC pages).

  User-agent: Googlebot  Disallow: /*/cppcr/  Disallow: /cppcr

2) Infinite GET requests to JSON scripts

This was just another simple robots.txt block because Google didn't need to request these scripts. Googlebot basically got caught in a form, over and over again. Realistically, there's no reason for any bot to crawl this, so I set the user-agent to all (*).

  User-agent: *  Disallow: /*/json/  Disallow: /json  

Results

I'm pretty happy to say that a week later, there was an increase of 7,000 pages in the index as reported by Webmaster tools. 

Rand wrote about some good tips to prevent crawling issues, so I recommend you checking it out, as well as special thanks to the folks at ratedpeople.com for being kind enough to let me analyze and experiment on their site.

Additional resources

Feel free to follow me on Twitter @dsottimano, don't forget to randomly hug a developer - even if they say they don't like it :)


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

Photo of the Day: President Obama Views Colorado Fire Damage

The White House

Your Daily Snapshot for
Monday, July 2, 2012

 

Photo of the Day: President Obama Views Colorado Fire Damage

President Barack Obama views fire damage with firefighters and elected officials in Colorado Springs, Colo., June 29, 2012.

President Barack Obama views fire damage with firefighters and elected officials in Colorado Springs, Colo., June 29, 2012. (Official White House Photo by Pete Souza)

In Case You Missed It

Here are some of the top stories from the White House blog:

First Lady Michelle Obama at the African Methodist Episcopal Church's General Conference
The First Lady addressed the 49th Quadrennial Session of the African Methodist Episcopal (AME) Church's General Conference in Nashville, Tennessee.

Weekly Address: An All-Hands-On-Deck Approach to Fighting the Colorado Wildfires
President Obama thanks the brave firefighters and volunteers who are providing food, water, and shelter to those who have been impacted by the devastating Waldo Canyon fire, and makes clear that his administration will continue to bring all resources available to assist efforts to combat the fires.

Major Step Forward for Gulf Coast Restoration
Our goal and commitment is not simply to address the damage caused by the Deepwater Horizon oil spill - it is to ensure the long-term improvement and restoration of the Gulf Coast and its unique ecosystems.

Get Updates

Sign up for the Daily Snapshot

Stay Connected

This email was sent to e0nstar1.blog@gmail.com
Manage Subscriptions for e0nstar1.blog@gmail.com
Sign Up for Updates from the White House

Unsubscribe | Privacy Policy

Please do not reply to this email. Contact the White House

The White House • 1600 Pennsylvania Ave NW • Washington, DC 20500 • 202-456-1111

 

Seth's Blog : Patina vs. shine

Patina vs. shine

Shine is fresh and new and it sparkles. Shiny catches the eye and it appeals to the neophiliac, to the person in search of polish.

Patina, on the other hand, can only be earned. Patina communicates trust (because the untrusted don't last long enough to earn a patina) and it appeals to a very different audience.

The old guy at the gym in spandex, taking steroids and brutalizing himself on the big machine--he's trying to be both and accomplishing neither.

Brands and organizations face the same choice. A book like Permission Marketing could be updated weekly, in a vain attempt on my part to keep it shiny. But that makes no sense, as the ideas in it are important because they've been right for a decade, not because they're new. That's what a new title is for.

The challenge, then, is to let your classics thrive precisely because they've earned the right, because they have a patina of quality--but not to rest on those laurels, but to get busy inventing the new shiny thing for those that demand it.



More Recent Articles

[You're getting this note because you subscribed to Seth Godin's blog.]

Don't want to get this email anymore? Click the link below to unsubscribe.




Your requested content delivery powered by FeedBlitz, LLC, 9 Thoreau Way, Sudbury, MA 01776, USA. +1.978.776.9498

 

duminică, 1 iulie 2012

Mish's Global Economic Trend Analysis

Mish's Global Economic Trend Analysis


China Manufacturing Weakens 8th Month; Will the US Economy Continue to Decouple From the Rest of the World?

Posted: 01 Jul 2012 09:07 PM PDT

The global economy led by Europe and China continues its downward path. Will the US follow?

First let's take a look at China. Markit reports China Manufacturing PMI Declines 8th Consecutive Month.
Key points

  • New orders fall to greatest extent in seven months, as export orders slump
  • Factory output declines marginally in comparison; stocks of finished goods rise 
  • Input costs and output charges down at sharpest rates in 39- and 42-months respectively

China's goods producers reported an eighth successive month-on-month deterioration in operating conditions during June, as output, incoming new orders and employment continued to decrease. After adjusting for seasonal factors, the HSBC Purchasing Managers' Index™ (PMI™) – a composite indicator designed to give a single-figure snapshot of operating conditions in the manufacturing economy – inched lower from 48.4 to 48.2 in June, a level indicative of a modest pace of deterioration in business conditions. For the second quarter as a whole, the index averaged its lowest quarterly value since Q1 2009.

A lack of demand was behind the latest deterioration in operating conditions, with total and foreign new orders falling at accelerated rates in June. New export orders placed at goods producers dropped at the steepest rate in over three years. North America and Europe were both cited as sources of new order book weakness. Meanwhile, the month-on-month fall in overall new orders (exports plus domestic) was the strongest in 2012 to date. The drop in total new orders led to a further decline in manufacturing output, extending the current period of contraction to four months. However, the rate of decline in factory output remained marginal.

Comment

Commenting on the China Manufacturing PMI™ survey, Hongbin Qu, Chief Economist, China & Co-Head of Asian Economic Research at HSBC said: "It is all about growth and employment. As external demand has weakened and domestic demand hasn't shown a meaningful improvement in response to earlier easing measures, growth is likely to be on track for further slowdown, hence weighing on the jobs market. But as inflation eases sharply, Beijing has plenty of room and policy ammunition to avoid a hard landing. We expect more decisive easing efforts to come through in the coming months."
China PMI vs. Shanghai Stock Index

The following charts show an interesting story of unsustainable growth and over-exuberance by China cheerleaders nearly everywhere.

China PMI



$SSEC Shanghai Stock Index



Decoupling Review

Notice the bubble in 2007. That's when all sorts of ridiculous decoupling theories, US hyperinflation scenarios, US treasury crash scenarios, crude is going to $200, Natural Gas is going to $40, and other nonsensical ideas came out of the woodwork, many in book form, some still persisting to this day.

Instead, the reverse happened! It was the US that decoupled from the global economy. Moreover,  China has been exposed for the malinvestment bubble that it is.

Now, in 2012, nearly everyone but the die-hard hyperinflationists thinks the US will decouple from the global economy. This reverse-decoupling idea is primarily based on the absurd belief the Fed will not let the economy or the stock market down (when the Fed is in fact not in control). For further discussion, please see Is There a Limit on Central Bank's Ability to Inflate?

The debate on the Fed will remain, but the facts show that I disagreed with decoupling in 2007 and I disagree with reverse-decoupling theories now.

Please see 12 Reasons US Recession Has Arrived (Or Will Shortly) for detailed rationale.

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com
Click Here To Scroll Thru My Recent Post List


Email From Lead Analyst at EIA on Petroleum Usage

Posted: 01 Jul 2012 11:27 AM PDT

In response to my post 3-Month Petroleum Usage Chart for March, April, May Shows 14 Years of Supply Demand Growth has Vanished (with charts from Tim Wallace) I received a nice email from James Beck, Lead Analyst, Weekly Petroleum Status Report Team, Energy Information Administration (EIA).
Hello Mike and Tim

I just wanted to chime in on your latest charts. As the Lead Analyst for the Weekly Petroleum Status Report at the Energy Information Administration, I appreciate that you use our numbers.

While I do appreciate the use of the weekly numbers, I wanted to send you these three charts (with all of their data included) based on the EIA's Petroleum Supply Monthly which supports your point that demand for gasoline is at 2002 levels and that total petroleum product demand is at 1997/98 levels.

Additionally, I have included the distillate demand chart which shows that since the recession began in 2008, we have had distillate demand at 2000-2002 levels, and 2012 has the second-weakest Jan-Mar level since 2002 (2012 is 0.3% higher than the Jan-Mar demand for 2010, which was the lowest since 2000).

Since diesel demand is a very good proxy for the health of the economy (all shipping uses diesel--trucking, rail, barge, etc.), this weakening might be an indication of things to come.

The reason to look at the monthly numbers is that they are more reliable than the weekly as the survey is of the entire industry and there is a great deal of extra time used to verify the data. Many people believe that the monthly numbers are a revision of the weekly numbers. This is not true. These are separate surveys. Where the monthly surveys the entire industry and collects much more detailed information, the weekly information is based on a sample of the industry drawn from the monthly reporters, collects less information, and is focused on timeliness versus completeness.

The weekly numbers are estimates of the most recent week's data based on the sample and are a snapshot in time. The weekly is a very good indicator of the data, but the monthly is the touchstone (at least until the Petroleum Supply Annual is released--which is, in fact, a revision of the monthly data).

I hope you can make use of the charts. Please let me know if I can be of further assistance.

Thank you,
James Beck
Lead Analyst,
Weekly Petroleum Supply Team
Energy Information Administration Office of Petroleum and Biofuels Statistics
Jet Fuel and Propane

In a follow-up email I asked about jet fuel and received this response.
Hello Mish

Seems KJet is at lowest Jan-Mar level since 1992. KJet suffered post 9/11 then with high fuel costs in 2006-2008. There has been a watershed change in how airlines operate because of the fuel cost (higher occupancy; fewer routes; different business processes for taxiing, at-gate operations, for efficient jets, etc.). Even when passenger miles recovered to pre-9/11 levels, the demand for kjet remained much lower.

Propane is highly seasonal, but even there the Jan-Mar level is lowest since 1995.
James Beck
Monthly Delays

The reason Tim Wallace uses weekly data is one of timeliness. There are long delays in waiting for monthly stats. It is nice to see that the monthly charts below confirm what Tim Wallace has been saying.

Here are the monthly charts from James Beck.

Because of seasonal variations, the proper comparison in each of the charts below is red-dot to red-dot.

Total Petroleum Usage



Diesel and Heating Oil



Gasoline



Jet Fuel



Propane



Thanks James and Tim!

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com
Click Here To Scroll Thru My Recent Post List


Seth's Blog : "All we need is 250 votes..."

"All we need is 250 votes..."

This is cruel marketing.

If you're like me, you've gotten dozens of emails over the last week about a promotion that Chase and Living Social are running in which they're promising local businesses that work within their community a chance to win a grant for $250,000. The emails almost always have the line,

All we need is a vote from 250 kind friends and supporters like you.

Here's why it's doubly dangerous. First, clearly the organization doesn't actually get a grant in exchange for only getting 250 online votes. Hey, 250 online votes won't even get you a pack of chewing gum these days. No, all the votes do is make you eligible to apply for the grant. And yet the organization, perhaps a worthy one, is now spamming thousands of people offering this sliver of hope, all in rush to get 250 votes, even though the chances that anything will happen are perilously close to zero. There are only 12 grants available in total. That's pitiful. Hopes raised, hopes dashed.

And then, for the small businesses, the ones who get through this hurdle and then get through the hurdle of the application, once again, hopes raised, hopes dashed.

There's nothing wrong with competitions and difficult to achieve goals. Nothing wrong with making it hard to get into Brown or get a Gates Foundation grant. The dangerous mistake is making the organizations (and then their core supporters) think it's likely, or easy. You end up not only burning the brand of Living Social and Chase (who probably had good intentions) but by extension, hurting the brand and permission relationships of the very organizations you're trying to help. Peter and the wolf... the villagers aren't going to come next time.

Pepsi did the same thing with charities last year, and my concern is the same: when you activate your supporters, you need a clear path to victory, not a wild goose chase.

One significant way around this: have the outbound messages of the tribe be about more than the grant. Figure out how putting in the effort to help your local organization actually strengthens ties, instead of weakening them. The pursuit could be even better than the prize if you establish the right groundwork.

To be really clear: it's harder to cut through the clutter than ever before, but just because a gimmick is going to cut through the clutter doesn't mean you should use it. It doesn't pay to make a lot of noise if that noise ends up hurting you in the long run.



More Recent Articles

[You're getting this note because you subscribed to Seth Godin's blog.]

Don't want to get this email anymore? Click the link below to unsubscribe.




Your requested content delivery powered by FeedBlitz, LLC, 9 Thoreau Way, Sudbury, MA 01776, USA. +1.978.776.9498