A Content Marketer's Guide to Data Scraping

A Content Marketer's Guide to Data Scraping

A Content Marketer's Guide to Data Scraping

Posted: 01 Jun 2014

Posted by MatthewBarby

As digital marketers, big data should be what we use to inform a lot of the decisions we make. Using intelligence to understand what works within your industry is absolutely crucial within content campaigns, but it blows my mind to know that so many businesses aren't focusing on it.

One reason I often hear from businesses is that they don't have the budget to invest in complex and expensive tools that can feed in reams of data to them. That said, you don't always need to invest in expensive tools to gather valuable intelligence — this is where data scraping comes in.

Just so you understand, here's a very brief overview of what data scraping is from Wikipedia:

"Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program."

Essentially, it involves crawling through a web page and gathering nuggets of information that you can use for your analysis. For example, you could search through a site like Search Engine Land and scrape the author names of each of the posts that have been published, and then you could correlate this to social share data to find who the top performing authors are on that website.

Hopefully, you can start to see how this data can be valuable. What's more, it doesn't require any coding knowledge — if you're able to follow my simple instructions, you can start gathering information that will inform your content campaigns. I've recently used this research to help me get a post published on the front page of BuzzFeed, getting viewed over 100,000 times and channeling a huge amount of traffic through to my blog.

Disclaimer: One thing that I really need to stress before you read on is the fact that scraping a website may breach its terms of service. You should ensure that this isn't the case before carrying out any scraping activities. For example, Twitter completely prohibits the scraping of information on their site. This is from their Terms of Service:

"crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited"

Google similarly forbids the scraping of content from their web properties:

Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.

So be careful, kids.

Content analysis

Mastering the basics of data scraping will open up a whole new world of possibilities for content analysis. I'd advise any content marketer (or at least a member of their team) to get clued up on this.

Before I get started on the specific examples, you'll need to ensure that you have Microsoft Excel on your computer (everyone should have Excel!) and also the SEO Tools plugin for Excel (free download here). I put together a full tutorial on using the SEO tools plugin that you may also be interested in.

Alongside this, you'll want a web crawling tool like Screaming Frog's SEO Spider or Xenu Link Sleuth (both have free options). Once you've got these set up, you'll be able to do everything that I outline below.

So here are some ways in which you can use scraping to analyse content and how this can be applied into your content marketing campaigns:

1. Finding the different authors of a blog

Analysing big publications and blogs to find who the influential authors are can give you some really valuable data. Once you have a list of all the authors on a blog, you can find out which of those have created content that has performed well on social media, had a lot of engagement within the comments and also gather extra stats around their social following, etc.

I use this information on a daily basis to build relationships with influential writers and get my content placed on top tier websites. Here's how you can do it:

Step 1: Gather a list of the URLs from the domain you're analysing using Screaming Frog's SEO Spider. Simply add the root domain into Screaming Frog's interface and hit start (if you haven't used this tool before, you can check out my tutorial here).

Once the tool has finished gathering all the URLs (this can take a little while for big websites), simply export them all to an Excel spreadsheet.

Step 2: Open up Google Chrome and navigate to one of the article pages of the domain you're analysing and find where they mention the author's name (this is usually within an author bio section or underneath the post title). Once you've found this, right-click their name and select inspect element (this will bring up the Chrome developer console).

Within the developer console, the line of code associated to the author's name that you selected will be highlighted (see the below image). All you need to do now is right-click on the highlighted line of code and press Copy XPath.

For the Search Engine Land website, the following code would be copied:


This may not make any sense to you at this stage, but bear with me and you'll see how it works.

Step 3: Go back to your spreadsheet of URLs and get rid of all the extra information that Screaming Frog gives you, leaving just the list of raw URLs – add these to the first column (column A) of your worksheet.

Step 4: In cell B2, add the following formula:


Just to break this formula down for you, the function XPathOnUrl allows you to use the XPath code directly within (this is with the SEO Tools plugin installed; it won't work without this). The first element of the function specifies which URL we are going to scrape. In this instance I've selected cell A2, which contains a URL from the crawl I did within Screaming Frog (alternatively, you could just type the URL, making sure that you wrap it within quotation marks).

Finally, the last part of the function is our XPath code that we gathered. One thing to note is that you have to remove the quotation marks from the code and replace them with apostrophes. In this example, I'm referring to the "leftCol" section, which I've changed to 'leftCol' — if you don't do this, Excel won't read the formula correctly.

Once you press enter, there may be a couple of seconds delay whilst the SEO Tools plugin crawls the page, then it will return a result. It's worth mentioning that within the example I've given above, we're looking for author names on article pages, so if I try to run this on a URL that isn't an article (e.g. the homepage) I will get an error.

For those interested, the XPath code itself works by starting at the top of the code of the URL specified and following the instructions outlined to find on-page elements and return results. So, for the following code:


We're telling it to look for any element (//*) that has an id of leftCol (@id='leftCol') and then go down to the second div tag after this (div[2]), followed by a p tag, a span tag and finally, an a tag (/p/span/a). The result returned should be the text within this a tag.

Don't worry if you don't understand this, but if you do, it will help you to create your own XPath. For example, if you wanted to grab the output of an a tag that has rel=author attached to it (another great way of finding page authors), then you could use some XPath that looked a little something like this:


As a full formula within Excel it would look something like this:


Once you've created the formula, you can drag it down and apply it to a large number of URLs all at once. This is a huge time-saver as you'd have to manually go through each website and copy/paste each author to get the same results without scraping – I don't need to explain how long this would take.

Now that I've explained the basics, I'll show you some other ways in which scraping can be used…

2. Finding extra details around page authors

So, we've found a list of author names, which is great, but to really get some more insight into the authors we will need more data. Again, this can often be scraped from the website you're analysing.

Most blogs/publications that list the names of the article author will actually have individual author pages. Again, using Search Engine Land as an example, if you click my name at the top of this post you will be taken to a page that has more details on me, including my Twitter profile, Google+ profile and LinkedIn profile. This is the kind of data that I'd want to gather because it gives me a point of contact for the author I'm looking to get in touch with.

Here's how you can do it.

Step 1: First we need to get the author profile URLs so that we can scrape the extra details off of them. To do this, you can use the same approach to find the author's name, with just a little addition to the formula:

=XPathOnUrl(A2,"//a[@rel='author']", <strong>"href"</strong>)  

The addition of the "href" part of the formula will extract the output of the href attribute of the atag. In Lehman terms, it will find the hyperlink attached to the author name and return that URL as a result.

Step 2: Now that we have the author profile page URLs, you can go on and gather the social media profiles. Instead of scraping the article URLs, we'll be using the profile URLs.

So, like last time, we need to find the XPath code to gather the Twitter, Google+ and LinkedIn links. To do this, open up Google Chrome and navigate to one of the author profile pages, right-click on the Twitter link and select Inspect Element.

Once you've done this, hover over the highlighted line of code within Chrome's developer tools, right-click and select Copy XPath.

Step 3: Finally, open up your Excel spreadsheet and add in the following formula (using the XPath that you've copied over):

=XPathOnUrl(C2,"//*[@id='leftCol']/div[2]/p/a[2]", "href")  

Remember that this is the code for scraping Search Engine Land, so if you're doing this on a different website, it will almost certainly be different. One important thing to highlight here is that I've selected cell C2 here, which contains the URL of the author profile page and not just the article page. As well as this, you'll notice that I've included "href" at the end because we want the actual Twitter profile URL and not just the words 'Twitter'.

You can now repeat this same process to get the Google+ and LinkedIn profile URLs and add it to your spreadsheet. Hopefully you're starting to see the value in this, and how it can be used to gather a lot of intelligence that can be used for all kinds of online activity, not least your SEO and social media campaigns.

3. Gathering the follower counts across social networks

Now that we have the author's social media accounts, it makes sense to get their follower counts so that they can be ranked based on influence within the spreadsheet.

Here are the final XPath formulae that you can plug straight into Excel for each network to get their follower counts. All you'll need to do is replace the text INSERT SOCIAL PROFILE URL with the cell reference to the Google+/LinkedIn URL:





4. Scraping page titles

Once you've got a list of URLs, you're going to want to get an idea of what the content is actually about. Using this quick bit of XPath against any URL will display the title of the page:


To be fair, if you're using the SEO Tools plugin for Excel then you can just use the built-in feature to scrape page titles, but it's always handy to know how to do it manually!

A nice extra touch for analysis is to look at the number of words used within the page titles. To do this, use the following formula:


From this you can get an understanding of what the optimum title length of a post within a website is. This is really handy if you're pitching an article to a specific publication. If you make the post the best possible fit for the site and back up your decisions with historical data, you stand a much better chance of success.

Taking this a step further, you can gather the social shares for each URL using the following functions:







Note: You can also use a tool like URL Profiler to pull in this data, which is much better for large data sets. The tool also helps you to gather large chunks of data from other social networks, link data sources like Ahrefs, Majestic SEO and Moz, which is awesome.

If you want to get even more social stats then you can use the SharedCount API, and this is how you go about doing it…

Firstly, create a new column in your Excel spreadsheet and add the following formula (where A2 is the URL of the webpage you want to gather social stats for):


You should now have a cell that contains your webpage URL prefixed with the SharedCount API URL. This is what we will use to gather social stats. Now here's the Excel formula to use for each network (where B2 is the cell that contaiins the formula above):













Facebook Shares:


Facebook Comments:


Once you have this data, you can start looking much deeper into the elements of a successful post. Here's an example of a chart that I created around a large sample of articles that I analysed within

The chart looks at the average number of social shares that an article on Upworthy receives vs the number of words within its title. This is invaluable data that can be used across a whole host of different on-page elements to get the perfect article template for the site you're pitching to.

See, big data is useful!

5. Date/time the post was published

Along with analysing the details of headlines that are working within a site, you may want to look at the optimal posting times for best results. This is something that I regularly do within my blogs to ensure that I'm getting the best possible return from the time I spend writing.

Every site is different, which makes it very difficult for an automated, one-size-fits-all tool to gather this information. Some sites will have this data within the <head> section of their webpages, but others will display it directly under the article headline. Again, Search Engine Land is a perfect example of a website doing this…

So here's how you can scrape this information from the articles on Search Engine Land:


Now you've got the date and time of the post. You may want to trim this down and reformat it for your data analysis, but you've got it all in Excel so that should be pretty easy.

Extra reading

Data scraping is seriously powerful, and once you've had a bit of a play around with it you'll also realise that it's not that complicated. The examples that I've given are just a starting point but once you get your creative head on, you'll soon start to see the opportunities that arise from this intelligence.

Here's some extra reading that you might find useful:


  • Start using actual data to inform your content campaigns instead of going on your gut feeling.
  • Gather intelligence around specific domains you want to target for content placement and create the perfect post for their audience.
  • Get clued up on XPath and JSON through using the SEO Tools plugin for Excel.
  • Spend more time analysing what content will get you results as opposed to what sites will give you links!
  • Check the website's ToS before scraping.

Treating people with kindness


Treating people with kindness

One theory says that if you treat people well, you're more likely to encourage them to do what you want, making all the effort pay off. Do this, get that.

Another one, which I prefer, is that you might consider treating people with kindness merely because you can. Regardless of what they choose to do in response, this is what you choose to do. Because you can.



Mish's Global Economic Trend Analysis

Mish's Global Economic Trend Analysis

Japan Orders and Output Decline Second Month; Does it Mean Anything?

Posted: 01 Jun 2014

Orders and output in Japan contracted for the second month. However, the decline was small and it comes on the heels of a tax increase that shifted demand forward a couple months ago.

Markit reports Slower Decline in Japanese Manufacturing Output in May
Key Points:

  • Output and new orders fall for the second month running, but at slower pace
  • Exports continue to fall
  • Rate of job creation eases


Japanese manufacturing firms saw a decline in output for the second month running in May alongside a con tinued fall in new orders and new export orders. That said, rates of decline for both new orders and output eased from those seen in April. Employment numbers grew in May for the tenth month running, albeit at a slower pace. The headline seasonally adjusted Markit/JMMA Purchasing Managers' Index™ (PMI™) – a composite indicator designed to provide a single - figure snapshot of the performance of the manufacturing economy – posted at 49.9 in May, up from 49.4 in April. This signalled a broad stabilisation in business conditions in the sector, following the decline in April.

Output fell for the second month running in May. Similar to April, panellists commented on a decline in demand due to the sales tax increase. That said, the deterioration in output eased in comparison to the previous month. Following a similar trend, new orders continued to fall with panellists again blaming the sales tax rise. However, the decline in new orders was only slight and weaker than in the previous month, with the seasonally adj usted New Orders Index moving closer to the 50.0 no - change mark. Alongside the falls in output and new orders was also a reduction in new export business. 

May recorded the fastest fall in work outstanding since July 2013. Japanese manufacturing companies attributed this to a drop in business after the increase in sales tax. Despite falls in output and new orders, Japanese manufacturers in May saw employment growth for the tenth month running as companies took on extra staff in anticipation of workload growth. That said, the rate of job creation eased to the slowest since last November.
Too Early To Tell 

The positive aspect in the report is the strength in jobs, yet job growth has slowed. The weak aspect is declining exports which cannot be blamed on a sales tax hike.

Moreover, the Yen has been fairly stable recently so one cannot blame the drop in exports on a strengthening currency.

All things considered, the report is somewhat a mixed bag. It will take another month or two to assess Japan properly.

Mike "Mish" Shedlock

Mish Reader Who Speaks Russian and Reads Ukrainian Updates the Situation in Ukraine

Posted: 01 Jun 2014 10:24 AM PDT

News media reporting of the situation in Ukraine has nearly vanished in the past couple of weeks. Here is an update from reader Jacob Dreizin, a US citizen who is fluent in Russian and can read Ukrainian.

Just wanted to give you an update on Ukraine. A lot has happened since I wrote you last.

The rebels in the eastern regions are clearing their hinterland of minor government positions, and securing their supply lines from Russia. The Russian government has started to allow volunteers and weapons to move across the border without interference from its side.

Whether intended or not, it is the classic insurgent battle plan. Once this stage is completed, the remaining government forces in Donetsk and Lugansk regions will be so isolated as to have no choice but to "temporarily redeploy and regroup", that is, to retreat. 

The rebels now have MANPADS, Man-Portable-Air-Defense-Systems as suggested by the below photo.

Keep in mind that the average Ukrainian soldier is being paid around $100 per month, which often arrives late.  Most of these people have no motivation to risk their lives in a prolonged war of subjugation and occupation in the eastern regions. So far, Kiev has been compensating for this with its better-paid special police detachments as well as with various yahoo militias funded by the oligarch Igor Kolomoisky, owner of Ukraine's largest bank.

This cannot last. At some point, the body count will be such that a critical mass of the security forces will simply refuse to fight or even to be deployed in the war zones.  We are already starting to see this.  According to the Ministry of Interior, 13 west Ukrainian paramilitary police were fired today for refusing deployment to the eastern regions.  There was also a recent case where around 100 soldiers in one reserve unit refused call-up orders. 

Moreover, it is well known that the military is poorly fed, subsisting on donations as well as shipments of American MREs (meals ready to eat). There is a lot of grumbling in the ranks, and at some point we are going to start seeing mutinies and mass desertions.

Eventually, the Ukrainian war effort will grind down and then collapse.  Meanwhile, the rebels grow stronger by roughly 100 men each day, on average. If Russia cuts off the gas in a few days, as threatened, this will be a huge blow, as it will degrade European support for Kiev.

Finally, please take a look at the video below of a pre-funeral memorial service for five rebel militiamen in a small city in the east. Pay special attention to the segment between 1:48 and 2:07. It looks to me like around two thousand people turned out for this event. So when you read that these rebels are "terrorists" and that theirs' is not a popular movement, and that the recent independence vote was a total fraud, and that it is all the work of Russia, you know you are reading Western media bias.

If you want, I can start sending you photos of destroyed homes and dead civilians, just to show that it is happening, and the media here couldn't care less.

All the best,

Jacob Dreizin

For Jacob's previous email, please see Inside Ukraine: Mish Reader Who Speaks Ukrainian and Russian Challenges Western Media View of Events

Alternative Viewpoints

Clearly Jacob's point of view is completely different than that presented by Western media.

Is it accurate?

I really do not know. However, it certainly is possible, even though some may consider the his viewpoint to be nothing but pro-Russian propaganda.

In a propaganda-war, the truth is frequently somewhere in the middle. But where in the middle? I will leave that for the reader to decide.

Meanwhile, my own opinion has been the Eastern regions will not become part of Russia for the simple reason Russia does not want the associated problems.

Will the issue of natural gas supply bring the crisis to head sooner rather than later? We will find out shortly.

Mike "Mish" Shedlock

The people who started Staples didn't do it...


The people who started Staples didn't do it...

because they love office supplies.

They did it because they love organizing and running profitable retail businesses. They love hiring and leasing and telling a story that converts prospects into customers. Postits are sort of irrelevant.

You shouldn't become a middle school math teacher because you love math. You should do it because you love teaching.

I hope Staples has a senior buyer who actually does love office supplies. I hope that textbooks get written by people who love, really love, the topic they're writing about. It's easy, though, to fool ourselves into believing that going up the ladder means we get to do more of the thing we started out doing.

It's often the case that the people we surround ourselves with (and the tasks we do) have far more to do with job satisfaction and performance than the subject of our work.



Mish's Global Economic Trend Analysis

Mish's Global Economic Trend Analysis

Wine Country Conference II Videos: Stephanie Pomboy "Confessions of Ben Bernanke", Mebane Faber “Global Stock Valuations”

Posted: 31 May 2014

A second set of Wine Country Conference Speaker Presentation videos is now available.

This set features Stephanie Pomboy on the "Confessions of Ben Bernanke", Mebane Faber on "Global Stock Valuations", and panel a discussion with John Hussman, Mebane Faber, Stephanie Pomboy. The final set will be out next week.

This Year's Charity

As with last year, Wine Country Conference II was for charity. This year's cause was Autism. Many of the speakers donated all or part of their expense honorarium to the cause. I did as well, losing money, to put this event on.

Once again, John Hussman and the Hussman Foundation was amazingly generous. The foundation will match donations dollar for dollar, up to $50,000!

If you enjoy the videos (or even if you don't) please Make a Donation to the Autism Society.

Stephanie Pomboy "Confessions of Ben Bernanke"

Mebane Faber "Global Stock Valuations"

John Hussman, Mebane Faber, Stephanie Pomboy Panel Discussion

Here is a link to the first set of videos: Wine Country Conference II Videos: Introduction and Hussman on "A Very Mean Reversion"

Mike "Mish" Shedlock

Reducing Carbon Pollution in Our Power Plants

Here's what's going on at the White House today.

Weekly Address: Reducing Carbon Pollution in Our Power Plants

In this week's address, President Obama discussed new actions by the Environmental Protection Agency to cut dangerous carbon pollution, a plan that builds on the efforts already taken by many states, cities and companies. These new commonsense guidelines to reduce carbon pollution from power plants were created with feedback from businesses, and state and local governments, and they would build a clean energy economy while reducing carbon pollution.

The President discussed this new plan from the Children's National Medical Center in Washington, D.C., where he visited children whose asthma is aggravated by air pollution. As a parent, the President said he is dedicated to make sure our planet is cleaner and safer for future generations.

Click here to watch this week's Weekly Address.

Watch: President Obama delivers the weekly address

  Top Stories

Helping Young People Stay on Track

Three months ago, President Obama launched My Brother's Keeper -- a new initiative to ensure that America's boys and young men of color reach their full potential. And yesterday, the My Brother's Keeper Task Force released a report on its progress over the initiative's first 90 days.

Video player: My Brother's Keeper

Learn more about the initiative -- and find out how you can get involved in your own community.


"America Must Always Lead"

On Wednesday, President Obama traveled to West Point to congratulate the newest officers in the U.S. Army and to reflect on America's foreign policy agenda. The President acknowledged that our world is changing with accelerating speed and that America must be equipped to respond to an increasingly dynamic environment.

The President at West Point

President Obama stressed that the United States is a global leader -- a nation that "must always lead on the world stage."


President Obama's 'Inner Nerd' Comes Out at the White House Science Fair

Auto-retracting bridges made of Legos, remote-controlled search-and-rescue robots, and a 12 year-old who already has two patents. Those were just a few of the highlights from the fourth-ever White House Science Fair on Tuesday, which featured some of the nation's brightest and most innovative young scientists.

The Fourth-Ever White House Science Fair

The President spent almost an hour chatting with the participants, calling the event "one of my favorite things all year long."


Ending the War in Afghanistan

In the White House Rose Garden on Tuesday, President Obama talked about the United States' next steps in Afghanistan, and how "we will bring America's longest war to a responsible end."

Bringing Our Troops Home

"When I took office, we had nearly 180,000 troops in harm's way," President Obama said. "By the end of this year, we will have less than 10,000."


Honoring Our Veterans on Memorial Day

Hours after returning from a surprise visit to Afghanistan, President Obama traveled across the Potomac to Arlington National Cemetery to honor fallen servicemembers and their families.

President Obama Honoring Our Veterans

The President laid a wreath at the Tomb of the Unknown Soldier and closed his remarks by saying that Memorial Day is a day to "rededicate ourselves to our sacred obligations to all who wear America's uniform, and to the families who stand by them, always."


As always, to see even more of this week's events, watch the latest episode of West Wing Week:

Video player: West Wing Week



Mish's Global Economic Trend Analysis

Mish's Global Economic Trend Analysis

Balancing the Budget and the Trade Deficit is Easy: Return to Gold Standard

Posted: 30 May 2014

The Daily Ticker's Lauren Lyster conducted an interesting interview today with British Member of Parliament Kwasi Kwarteng on gold and balancing the budget.

To play the video, click on the preceding link.

Kwarteng is author of War and Gold, a Five-Hundred-Year History of Empires, Adventures, and Debt.

Kwarteng notes the historic stability under gold standards, specifically citing the 2008 financial crisis and national debt level as problems related to the Fed and printing paper money.

"The credit crunch, the credit bubble that preceded it, and the huge amounts of debt and deficits that we have are related to paper money," says Kwarteng.

Laruen asked "After the Fed has printed trillions of dollars, just to look at the past several years of expanding the money supply, how do you put the genie back in the bottle?"

"If the Chinese unilaterally declared that the renminbi would be pegged to gold it would essentially recreate the gold standard," responds Kwarteng.

Yet, Kwarteng admits that China's export policy likely precludes that from happening.

Missing the Boat on Trade

My one disagreement in an otherwise excellent discussion is that Kwarteng misses the boat on trade in a major way. He maintains that the UK cannot go back to the gold standard because of trade deficits. His take is governments need to balance budgets and increase exports first.

He has that point backwards. Trade deficits will not fix themselves. Competitive, beggar-thy-neighbor tactics would prevent that. And if the UK and US balanced their budgets, the pound and dollar would soar, and exports would drop.

In contrast, a return to the gold standard will not only fix deficit spending, but it will cure trade deficits in a flash.

That's what I mean by "easy". Certainly, the current political environment and the ensuing short-term pain would be anything but "easy".

The trade issue is extremely important. For a recap, please see Hugo Salinas Price and Michael Pettis on the Trade Imbalance Dilemma; Gold's Honest Discipline Revisited.

Mike "Mish" Shedlock