Central Perk

miercuri, 13 martie 2013

SEO Blog

How To Avoid Wasting Money On Marketing
Mobile SEO: The Best Practices
How To Create A Good Infographic

Posted: 12 Mar 2013 10:29 PM PDT

Marketing your medical practice is one of the things every doctor knows they need to do but most don't know how to do it. Furthermore, a misguided marketing campaign can have detrimental affects from attracting the wrong clients to attracting no clients at all. And worst of all, bad marketing...
Read more »

Mobile SEO: The Best Practices

Posted: 12 Mar 2013 10:13 PM PDT

A recent report from Nielsen revealed that 50% of mobile phones users in the United States are using smartphones, and those who are buying a new phone are more likely to choose a smartphone. This is driving the growth of Americans who are using high-speed mobile broadband. According to Informa...
Read more »

How To Create A Good Infographic

Posted: 12 Mar 2013 10:06 PM PDT

We are besieged by the flood of information given to us every day; thus, the amount of information that needs to be processed can become overwhelming and time-consuming. For this reason, we need to find an effective way to communicate information in an appealing way. Infographics: An In-Depth Look Infographics...
Read more »

Behind the Scenes of Fresh Web Explorer

Posted: 12 Mar 2013 07:22 PM PDT

Posted by dan.lecocq

Fresh Web Explorer is conceptually simple -- it's really just a giant feed reader. Well, just a few million of what we think are the most important feeds on the web.

At a high level, it's arranged as a pipeline, beginning with crawling the feeds themselves and ending with inserting the crawled data into our index. In between, we filter out URLs that we've already seen in the last few months, and then crawl and do a certain amount of processing. Of course, this wouldn't be much of an article if it ended here, with the simplicity. So, onwards!

The smallest atoms of work that the pipeline deals with is a job. These are pulled off of various queues by a fleet of workers, processed, and then handed off to other workers. Different stages take different amounts of times, are best suited to certain types of machines, and thus it makes sense to use queues in this way. Because of the volume of data that must move through the system, it's impractical to pass the data along with each job. In fact, workers are frequently uploading to and downloading from S3 (Amazon's Simple Storage Service) and just passing around references to the data stored there.

The queueing system itself is one we talked about several months ago called “qless." Fresh Web Explorer is actually one of the two projects for which qless was written (campaign crawl's the other), though it has found adoption for other projects from our data science team to other as-of-yet announced projects. Here’s an example of what part of our crawl queue looks like, for example:

In each of the following sections, I’ll be talk about some of the hidden challenges to many of these seemingly-innocuous stages of the pipeline, as well as the particular ways in which we’ve tackled them. To kick this process off, we begin with the primordial soup out of which this crawl emerges: the schedule of our feeds to crawl.

Scheduling

Like you might expect on the web, a few domains are responsible for most of the feeds that we crawl. Domains like Feedburner and Blogspot come to mind, in particular. This becomes problematic in terms of balancing politeness with crawling in a reasonable timeframe. For some context, our goal is to crawl every feed in our index roughly every four hours, and yet some of these domains have hundreds of thousands of feeds. To make matters worse, this is a distributed crawl on several workers, and coordination between workers is severely detrimental to performance.

With job queues in general, it's important to strike a balance between too many jobs and jobs that take too long. Jobs sometimes fail and must be retried, but if the job represents too much work, a retry represents a lot of wasted work. Yet, if there are too many jobs, the queueing system becomes inundated with operations about maintaining the state of the queues.

To allow crawlers to crawl independently and not have to coordinate page fetches with one another, we pack as many URLs from one domain as we can into a single job subject to the constraint that it could be crawled in a reasonable amount of time (on the order of minutes, not hours). In the case of large domains, fortunately, the intuition is that if they're sufficiently popular on the web, then they can handle larger amounts of traffic. So we pack all these URLs into a handful of slightly larger-than-normal jobs in order to limit the parallelism, and so long as each worker obeys politeness rules, we're guaranteed a global close approximation to politeness.

Deduping URLs

Suffice it to say, we're reluctant to recrawl URLs repeatedly. To that end, one of the stages of this pipeline is to keep track of and remove all the URLs that we've seen in the last few months. We intentionally kept the feed crawling stage simple and filter-free, and it just passes _every_ url it sees to the deduplication stage. As a result, we need to process hundreds of millions of URLs in a streaming fashion and filter as needed.

As you can imagine, simply storing a list of all the URLs we've seen (even normalized) would consume a lot of storage, and checking would be relatively slow. Even using an index would likely not be fast enough, or small enough, to fit on a few machines. Enter the bloom filter. Bloom filters are probabilistic data structures that allow you to relatively compactly store information about objects in a set (say, the set of URLs we've seen in the last week or month). You can't ask a bloom filter to list out all the members of the set, but it does allow you to add and query specific members.

Fortunately, we don't need to know all the URLs we've seen, but just answer the question: have we seen _this_ url or _that_ url. A couple of downsides to bloom filters: 1) they don't support deletions, and 2) they do have a small false positive rate. The false positive rate can be controlled by allocating more space in memory, and we've limited ours to 1 in 100,000. In practice, it turns out to often be less than that limit, but it's the highest rate we're comfortable with. To get around the lack of being able to remove items from the set, we must resort to other tricks.

We actually maintain several bloom filters; one for the current month, another for the previous month, and so on and so forth. We only add URLs that we've seen to the current month, but when filtering URLs out, we check each of the filters for the last _k_ months. In order to allow these operations to be distributed across a number of workers, we use an in-memory (but disk-backed) database called Redis and our own Python bindings for an in-Redis bloom filter, pyreBloom. This enables us to filter tens of thousands of URLs per second and thus, keep pace.

Crawling

We've gone through several iterations of a Python-based crawler, and we've learned a number of lessons in the process. This subject is enough to merit its own article, so if you're interested, keep an eye on the dev blog for an article on the subject.

The gist of it is that we need a way to efficiently fetch URLs from many sources in parallel. In practice for Fresh Web Explorer, this is hundreds or thousands of hosts at any one time, but at peak it's been on the order of tens of thousands. Your first instinct might be to reach for threads (and it's not a bad instinct), but it comes with a lot of inefficiencies at the expense of conceptual simplicity.

There are mechanisms for the ever-popular asynchronous I/O that are relatively well-known. Depending on what circles in which you travel, you may have encountered some of them. Node.js, Twisted, Tornado, libev, libevent, etc. At their root, these all use two main libraries: kqueue and epoll (depending on your system). The trouble is that these libraries expose a callback interface that can make it quite difficult to keep code concise and straightforward. A callback is a function you’ve written that you give to a library to run when it’s done doing it’s processing. It’s something along the lines of saying, ‘fetch this page, and when you’re done, run this function with the result.’ While this doesn’t always lead to convoluted code, it can all too easily lead to so-called ‘callback hell.’

To our rescue comes threading's lesser-known cousin, coroutines and incarnated in gevent. We've tried a number of approaches, and in particular we've been burned by the aptly-named “twisted.” Gevent has been the sword that has cut the gordian knot of crawling. Of course, it's not a panacea, and we've written a lot of code to help make common crawling tasks easy. Tasks like URL parsing and normalization, and robots.txt parsing. In fact, the Python bindings for qless even have a mode that is gevent-compatible, so we can still keep our job code simple and still make full use of gevent's power.

A few crawlers is actually all it takes to maintain steady state for us, but we’ve had periods where we wanted to accelerate crawling (for backlogs, or to recrawl when experimenting). By way of an example of the kind of power the coroutines offer, here are some of our crawl rates for various status codes scaled down to 10%. This graph is from a time when we were using 10 modestly-sized machines, and while maintaining politeness they sustain about 1250 URLs/second including parsing, which amounts to about 108 million URLs a day at a cost of about $1 per million. Of course, this step alone is just a portion of the work that goes into making Fresh Web Explorer.

Dechroming

There's a small amount of processing associated with our crawling. Parse the page, look at some headers, et. all, but the most interesting feature of this process is the dechroming: trying to remove all the non-content markup in a page, from sidebars to headers to ads. It's a difficult task, and no solution will be perfect. Despite that, through numerous hours and great effort (the vast majority of which has been provided by our data scientist, Dr. Matt Peters) we have a reasonable approach.

Dechroming is an area of active research in certain fields, and there are certainly some promising approaches. Many of the earlier approaches (including that of blogscape from our tools section, Fresh Web Explorer’s predecessor) relied on finding many examples from a given site, and then using that information to try to find the common groups of elements. This has the obvious downside of needing to be able to quickly and easily access other examples from any given site at any given time. Not only this, but it's quite sensitive to changes to website markup and changes in chrome.

Most current research focuses instead on finding a way to differentiate chrome from content with a single page example. We actually began our work by implementing a couple of algorithms described in papers. Perhaps the easiest to conceptually understand is one in which a distribution of the amount of text per block (this doesn't have a 1:1 correspondence with HTML tags, necessarily) and then finding the clumps within that. The intuition is that the main content is likely to be larger sequential blocks of text than, say, comments or sidebars. In the end, our approach ended up being a combination of several techniques and you can find out more about it in our "dragnet" repo.

All told

Fresh Web Explorer has been in the works for a long while -- perhaps longer than I'd care to admit. It has been rife with obstacles overcome (both operational and algorithmic) and lessons learned. These lessons will be carried forward in subsequent iterations and future projects. There are many changes we’d like to make given this hindsight and of course we will. Refactoring and maintaining code is often more time-consuming than writing the original!

The feedback from our community has generally been positive so far, which is encouraging. Obviously we hope this is something that will not only be useful, but also enjoyable for our customers. The less-than-positive feedback has highlighted some issues of which we are aware, most of which are high on our priorities, and leaves us raring to go to make it better.

On many points here there are many equally valid approaches. While time and space don’t permit us to present a complete picture, we’ve tried to pull out the most important parts. If there are particular questions you have about other aspects of this project or why we chose to tackle an issue one way or another, please comment! We’re happy to field any thoughts you might have on the subject :)

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

Photo of the Day: Waiting on the Rain

Your Daily Snapshot for
Wednesday, March 13, 2013

Today at 2 p.m. ET: Director of the National Economic Council and Assistant to President Obama for Economic Policy Gene Sperling answers your questions on Reddit. Find out how to participate.

Photo of the Day: Waiting on the Rain

President Barack Obama waits for a heavy rain to pass before crossing West Executive Avenue from the Eisenhower Executive Office Building to the West Wing of the White House, March 12, 2013. (Official White House Photo by Pete Souza)

Automated Monthly Search Engine Submission

This is an SubmitStart News Mailing. Remove me from this list

		Quick & Easy Increase your website's exposure and save time with this one-stop automated monthly search engine submission service.
		Over 120,000 sites submitted More than 120,000 websites have been submitted to the leading search engines using SubmitStart - now it's your turn!
		Automatic Resubmissions Your website will be submitted every month to make sure the search engines don't drop your listing.

Sent to e0nstar1.blog@gmail.com — why did I get this?
unsubscribe from this list | update subscription preferences
SubmitStart · Trade Center · Kristian IV:s väg 3 · Halmstad 302 50

Seth's Blog : Choose your customers first

Choose your customers first

It seems obvious, doesn't it? Each cohort of customers has a particular worldview, a set of problems, a small possible set of solutions available. Each cohort has a price they're willing to pay, a story they're willing to hear, a period of time they're willing to invest.

And yet...

And yet too often, we pick the product or service first, deciding that it's perfect and then rushing to market, sure that the audience will sort itself out. Too often, though, we end up with nothing.

Examples:

The real estate broker ought to pick which sort of buyer before she goes out to buy business cards, rent an office or get listings.

The bowling alley investor ought to pick whether he's hoping for serious league players or girls-night-out partiers before he buys a building or uniforms.

The yoga instructor, the corporate coach, the app developer--in every case, first figure out who you'd like to do business with, then go make something just for them. The more specific the better...

• Email to a friend •

marți, 12 martie 2013

Mish's Global Economic Trend Analysis

Spain's Budget Deficit Grew by 35.4% in January to 1.2% of GDP; Spain's Tax Revenue Drops 20% in Face of VAT Hikes
An Offer You Cannot Refuse; EU Passes Law Forcing Countries to Take Bailout; Is Spain the First Target?
Mish Android App on Google Play
Housing Construction in France Lowest in 50 Years; Hollande Responds With Measures to Support Building "For the Public Good"

Spain's Budget Deficit Grew by 35.4% in January to 1.2% of GDP; Spain's Tax Revenue Drops 20% in Face of VAT Hikes

Posted: 12 Mar 2013 04:52 PM PDT

Here's a story you can expect to see in the Wall Street Journal or Financial Times tomorrow. You can read it here today.

Via Google Translate, El Economista reports Spain's Budget Deficit Grew by 35.4% in January to 1.2% of GDP.

The government deficit in terms of national accounts in January reached 12.729 billion euros, equivalent to 1.2% of GDP, representing an increase of 35.4% over January 2012.

According to the budget execution data published in January by the Ministry of Finance website, the cash deficit in January came to 15,252,000, which are the result of a fall in net income of 37% (5.789 billion) and a expenses increased by 15.4% (21.041 billion).

Tax revenues fell 20% to 10.608 billion, among other causes by the accumulation of returns earlier this year, says Finance.

Negative Indirect Tax Collection

In fact, the state's revenue from indirect taxes (VAT and special) was negative at EUR 1.647 billion as a result of increased returns, but maintaining the territorial government had total revenues of 1.530 billion, 29.1% less.

The corporate tax revenue has also resulted in negative returns by 1.131 billion. Revenue from direct taxes (income and companies) fell by 18.2% to 9.078 billion.

In the item of expenditure, there was notable increase of 23.3% from the payment of the interest on the debt, which rose from 6.250 billion in January 2012 to 7.709 billion euros in January 2013. There was also a 10.4% increase in payments by current transfers to 9.474 billion.

Within these transfers, Social Security payments grew by 40.2% (2.334 billion), mainly due to the state budget for 2013 makes a greater contribution to the minimum pension supplements.

In the first month of the year, the central government deficit reached 0.89%. The state deficit target for 2013 is 3.8% of GDP, while the target for the total deficit of Public Administration is 4.5% from 6.7% in 2012, as recently reported by the Ministry of Finance.

Spain must cut its budget deficit to 2.8% in 2014, but it is expected that the European Commission extended the deadline for compliance with this commitment.

Summary

Spain's budget deficit for the month of January was 0.89% not counting regional deficits.
The target for the entire year is 3.8% of GDP.
On that basis, Spain went through 23.42% of its annual budget in a single month.
Spain's deficit target including regions and transfer payment is 4.5% of GDP.
The deficit including regions and transfer payments was 1.2% of GDP.
On that basis, Spain blew 26.67 % of its budget in a single month.
Territorial government revenues declined 29.1%
Income Tax revenue (corporate + personal) fell 18.2%
Social Security payments grew by 40.2%
Overall transfer payments increased 23.3%

Odds of Success Zero Percent

Odds Spain hits its budget target of 4.5% in 2013 is precisely 0.00%.

I believe we have an answer to the question I asked earlier today: Offer You Cannot Refuse; EU Passes Law Forcing Countries to Take Bailout; Is Spain the First Target?

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com

An Offer You Cannot Refuse; EU Passes Law Forcing Countries to Take Bailout; Is Spain the First Target?

Posted: 12 Mar 2013 11:56 AM PDT

Want a bailout? Need a bailout? Actually, it does not matter what your country wants or needs.

By a 526 to 86 vote, the nannycrats in Brussels just passed a regulation that will require a country to accept a bailout if offered.

Via Google translate from El Economista, Brussels may force a country to ask for a rescue if eurozone threat.

The full European Parliament on Tuesday gave its final approval to the rule giving new powers to the European Commission to monitor national budgets of eurozone countries and even request changes before parliamentary approval. According to this regulation, agreed with the Twenty, Brussels may force a state to ransom.

According to this rule, which goes ahead with 526 votes in favor, 86 against and 66 abstentions, the governments are obliged to send to Brussels its draft budget for next year by 15 October each year.

The EU executive may publish its opinion on the national and even request changes if it believes that deviate from the objectives of consolidation undertaken by each country. However, your request will not be binding.

In addition, the new standard allows Brussels submit to increased surveillance to countries that threaten the stability of the eurozone and even force them to ask for a rescue, with the objective of minimizing their costs.

Surveillance cycle

Vice President of the Commission responsible for Economic Affairs, Olli Rehn, said on Tuesday that the adoption of this standard "will complete the cycle of budgetary surveillance for euro area Member States."

Rehn has argued that if these rules had existed since the birth of the euro "would never have experienced a crisis of such magnitude."

An Offer You Cannot Refuse

Rehn is a liar, a fool, or both. I vote both.

The EU had nothing but praise for Spain when the Spanish housing bubble was brewing. It would not have done anything other than what it did, which is cheerlead the housing boom, just as Bernanke and Greenspan did in the US.

I like the translation "force a state to ransom".

The EU has twice offered Spain a bailout. Spain has rejected the offer twice. The next offer just may be the one that Spain cannot refuse.

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com

Mish Android App on Google Play

Posted: 12 Mar 2013 10:29 AM PDT

I have a new android app that's available on Google Play. You can download it to your android device from a button on the right sidebar of my blog that looks like this:

Just click on that button from an android device to load.

I am reworking my iPhone app and hope to have something out in a couple of weeks or so.

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com

Housing Construction in France Lowest in 50 Years; Hollande Responds With Measures to Support Building "For the Public Good"

Posted: 11 Mar 2013 11:55 PM PDT

Housing starts in France will fall to 280,000-300,000 in 2013, the lowest level in 50 years warns developer Nexity. The government wants 500,000 units per year.

French president Francois Hollande thinks he knows the proper amount of houses that need to be built. Therefore, Hollande confirmed measures to support building quickly.

Here is a Mish-modified translation from Les Echos...

Emergency. This is the word that comes to everyone's lips about building. Housing is at its lowest level since fifty years. François Hollande confirmed in an interview yesterday that "support for building" will be amplified quickly for the "public good".

The Ministry of Housing was happy about yesterday's statements from the Head of State: "This means we are moving towards an ambitious plan". We recall the campaign promise to build 500,000 homes per year, of which 150,000 will be in social housing to offset the increase in the VAT rate.

France is in the midst of a deflating property bubble. Nonetheless, Hollande wants to build more houses anyway. His rationale is interesting. Hollande wants to offset the increase in the VAT, taxes that he hiked.

Hollande is on a mission to wreck France, and he is succeeding spectacularly as the following history shows.

June 8, 2012: Please consider economically insane proposal by French president Francois Hollande "Make Layoffs So Expensive For Companies That It's Not Worth It"

August 13, 2012: In France, Government spending amounts to 55% of total domestic output. For discussion, please see Hollande's Honeymoon is Over; 54% of Voters Unhappy; Unions Promise "War" in September.

November 29, 2012: Given that any clear-thinking person should quickly realize that if companies cannot fire workers they will be extremely reluctant to hire them in the first place, it should be no surprise to discover French Unemployment Highest in 14 Years (And It's Going to Get Much Worse).

December 28, 2012: Economic implosion in France is underway. French Retail Sales Contract 9th Consecutive Month as Cost Inflation Surges

February 6, 2013: Germany Rebounds but ... France Economic Implosion Accelerates; Record Decrease in Service Employment in Italy

February 21, 2013: France Sinks Further Into Gutter; PMI Accelerates to 4-Year Low; "Core" of Europe Now Consists of Germany Only

March 6, 2012: Eurozone Downturn Accelerates Despite German Growth; Divergence to France Widest in 15 Years

For the public good, Hollande ought to resign along with his entire socialist government.

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com

"Wine Country" Economic Conference Hosted By Mish
Click on Image to Learn More

miercuri, 13 martie 2013

SEO Blog

SEO Blog

Behind the Scenes of Fresh Web Explorer

Behind the Scenes of Fresh Web Explorer

Scheduling

Deduping URLs

Crawling

Dechroming

All told

Photo of the Day: Waiting on the Rain

Automated Monthly Search Engine Submission

Seth's Blog : Choose your customers first

Choose your customers first

More Recent Articles

marți, 12 martie 2013

Mish's Global Economic Trend Analysis

Mish's Global Economic Trend Analysis

Pagini

Persoane interesate

Arhivă blog

miercuri, 13 martie 2013

Scheduling

Deduping URLs

Crawling

Dechroming

All told

More Recent Articles

marți, 12 martie 2013

Pagini

Persoane interesate

Arhivă blog

Subscribe to