Central Perk: 7 sept. 2010

marți, 7 septembrie 2010

Mish's Global Economic Trend Analysis

European Credit Stress Returns With Vengeance - Irish, Portuguese Bond Spread at All Time High - Yen Soars - Gold Hits All Time High
NBER Likely to say "Recession Ended" July 2009; Assessing the Real Time Probability US Back in Recession
B.J. Lawson Pulls Into Lead in North Carolina Over Democratic Incumbent
Infrastructure Bank: Obama's Desperate Attempt to Win Midterm Democrat Votes; Stimulus Déjà Vu

European Credit Stress Returns With Vengeance - Irish, Portuguese Bond Spread at All Time High - Yen Soars - Gold Hits All Time High

Posted: 07 Sep 2010 07:56 PM PDT

The risk aversion trade was back in play today with treasuries, the dollar, the Yen, and gold all rallying while the Euro and European government bonds (except German Bunds) were under significant pressure.

Please consider Stocks, Irish Bonds Drop, Gold, Yen Rally on Europe Concern

Stocks slid, while Greek, Portuguese and Irish bonds tumbled, gold rose to a record and the yen surged to a 15-year high versus the dollar on concern Europe's debt crisis will worsen. U.S. and German bonds rallied.

The MSCI World Index slid 1.1 percent and the Standard & Poor's 500 Index lost 1.2 percent at 4 p.m. in New York. The gaps between 10-year German bond yields and Irish and Portuguese debt grew to all-time highs, while the German-Greek yield spread increased to the widest since May. The yen rose to as little as 83.52 per dollar as the Bank of Japan refrained from increasing bank loans. Ten-year Treasury yields lost 10 basis points to 2.6 percent. Gold futures closed at $1,259.30 an ounce.

Banks led stocks lower on concern European lenders will require more capital to compensate for holdings of bonds in the region's weakest economies. Germany's banking association said yesterday that the nation's banks need to raise $135 billion and Pacific Investment Management Co. said Greece still faces "substantial" default risk.

"The challenges haven't gone away," said James Dunigan, chief investment officer at PNC Wealth Management in Philadelphia, which oversees $103 billion. "The European debt worries that haunted us earlier this year are showing up again. Even as last week we had a couple of economic signals that weren't as bad as we thought, the headwinds have been around."

The rally in gold, Treasuries and the yen came as investors sought assets perceived as the safest. Even after a 750 billion euro ($960 billion) bailout for the weaker economies in the euro zone, investors are skittish about sovereign debt of some nations -- and about the banks that hold the region's government bonds. A default by Greece could trigger the collapse of banks with large sovereign-bond holdings, says Konrad Becker, an analyst at Merck Finck & Co. in Munich.

The German bund yield dropped 8 basis points to 2.26 percent. Greek bonds plunged, pushing the yield on the 10-year security up 28 basis points relative to bunds to 942 basis points, the most since the European Union and International Monetary Fund crafted the bailout package in May.

The German-Irish 10-year yield spread climbed to as wide as 380 basis points, the highest since Bloomberg started compiling the data, from 343 basis points. It was at 372 basis points as of the close of trading in New York. The Portuguese-German spread reached 356 basis points, also a record, from 333 basis points. Wider spreads signaled increased concern that the most indebted European nations will struggle to fund budget deficits.

Stocks Rally Halted

The S&P 500 dropped for the first time in five days, halting its longest streak of gains since July. Wells Fargo & Co., JPMorgan Chase & Co. and Bank of America Corp. dropped at least 2.2 percent to pace a retreat in 77 of 80 financial companies in the index. Oracle Corp. rallied 5.9 percent after naming Mark Hurd, former chief executive officer of Hewlett- Packard Co., as a president.

Japan's and Australia's central banks signaled the outlook for U.S. growth is deteriorating, making it tougher for them to set monetary policy. The Reserve Bank of Australia extended a pause in raising interest rates "for the time being" today, even after the nation's gross domestic product rose the most since 2007. The Bank of Japan said it's prepared to add more monetary stimulus after last week's emergency decision to expand a credit program.

Dollar, Yen Strengthen

The dollar strengthened against all 16 major currencies except the yen and franc. The Dollar Index, which gauges the currency against six major trading partners, rallied 1 percent to 82.827.

Gold Record Close

Gold futures for December delivery rose 0.7 percent to $1,259.30 an ounce on the Comex in New York, its highest closing price ever. Copper for delivery in December fell 0.8 percent to $3.4705 a pound in New York. Crude for October delivery retreated 0.7 percent to $74.09 a barrel on the New York Mercantile Exchange.

Another name for the risk aversion play is the deflation play. There is plenty of room for the dollar, treasuries, gold, and German government bonds to rally while the rest of the commodity complex drifts lower.

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com
Click Here To Scroll Thru My Recent Post List

NBER Likely to say "Recession Ended" July 2009; Assessing the Real Time Probability US Back in Recession

Posted: 07 Sep 2010 12:03 PM PDT

One problem with the NBER recession dating analysis is that it is months and sometimes years late in making its assessments.

Marcelle Chauvet, professor of economics at the University of California addresses those shortfalls in an interesting article called Real Time Analysis of the U.S. Business Cycle

Although careful deliberations are applied to determine turning points, the NBER procedure cannot be used to monitor business cycles on a current basis. Generally, the committee meets months after a turning point (that is, the beginning or end of an economic recession) has occurred and releases a decision only when there is no doubt regarding the dating. This certainty can be achieved only by examining a substantial amount of ex post revised data. Thus, the NBER dating procedure cannot be used in real time. For example, the NBER announced only in July 2003, twenty months after the fact, that the 2001 recession had ended in November 2001.

Some models, however, can gauge how weak or strong the economy is and date business cycles in real time. In particular, the dynamic factor Markov switching model (DFMS) in Chauvet (1998) has been very successful in dating business cycles in real time and in closely reproducing the NBER dating.

The model yields a monthly indicator of the U.S. business cycles and probabilities of recessions and expansions when applied to the same series used by the NBER: nonagricultural employment, real personal income, real manufacturing and trade sales, and industrial production.

What does the DFMS nonlinear probability model tell us about U.S. recessions?

Since 1959 the U.S. economy has experienced eight recessions. Figure 1 shows the business cycle indicator, and Figure 2 shows the smoothed probabilities of recessions obtained from the DFMS model and the NBER recession dating. The probabilities are obtained using full sample information (that is, all information available from 1959 up to now).

As Figure 2 illustrates, the probabilities increase substantially at the beginning of recessions (peaks) and decrease around the end of the recessions (troughs). Recessions are generally short, lasting on average a year, whereas expansions are much longer, averaging about five years. The 1990s experienced the longest U.S. expansion (ten years) in the past 150 years, while the 2007–09 recession was the longest in the past 50 years.

Current probability of recession

Because of a two-month delay in the availability of the manufacturing and trade sales series, the probabilities of recession are also available only with a two-month delay.

The most recent probability of recession from the DFMS model is for June 2010, which uses information up to September 2010. The probability that the U.S. economy is in a recession in June is 24.7 percent.

The Beginning and End of the 2007-2009 Recession

Inquiring minds are reading the Center for Research on Economic and Financial Cycles article dating The Beginning and End of the 2007-2009 Recession

The Figure shows the real time probabilities of recession from the Dynamic Factor Model with Regime Switching (Chauvet 1998). The probabilities indicate that the U.S. recession started in December 2007.

The NBER only announced that the recession began in December 2007 twelve months later, in December 2008.

Review of the Odds Over Time - To Date

Marcelle Chauvet updates the odds we are currently in recession in his blog
Real Time Probabilities of Recession

U.S. Recession ended in June/July 2009

Probability of Recession in June 2010 INCREASED to 24.7% after being below 10% for the last 7 months and below 50% since July 2009.

2009 January 100.0

-4.4

February
98.7

-2.8

March
96.1

-2.6

April
82.8

-0.9

May
77.2

-1.1

June
66.2

-1.6

July
27.9

0.4

August
21.9

0.4

September 18.6

-0.3

October
11.0

-0.2

November
3.1

1.1

December
2.2

0.8

2010 January 1.4

0.8

February
0.9

0.8

March
0.5

1.3

April
0.8

1.6

May
2.5

1.4

June
24.7

0.2

That is a partial table. Marcelle Chauvet shows the odds starting in October 2007. Notice how the odds the US is currently in recession have risen from 2.5% in May to 24.7% in June.

Odds Higher Today Than June

We will see the odds for August in a couple months. Those odds will be higher than today because of all the recent grim economic data.

Bear in mind Chauvet posts the odds we are already in recession. When the current odds are soaring at an amazing rate (which they are), the odds of a going into recession at a future time, will be much higher.

Some don't see it that way.

Here is a snip is from Bloomberg as discussed in Nonsense from NBER on Odds of Double-Dip

Harvard University Professor Martin Feldstein, who sits on the Business Cycle Dating Committee of the National Bureau of Economic Research says "There's still a significant risk, maybe one chance in three, that there will be a double dip." Fellow panel member and Princeton University Professor Mark Watson said those odds are "way too high" and puts them instead at "one in 10 or maybe one in 20."

Double-dip odds of one-in-10 or one-in-20? When the odds are roughly 1-in-4 we are already in recession as of June?

If Marcelle Chauvet's model is accurate (and assuming the recession is over, his model is the most realistic model I have seen to date), then the above snip was indeed nonsense, not from the NBER per se, but rather from one or two economists who sit on the panel, most notably Princeton University Professor Mark Watson.

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com
Click Here To Scroll Thru My Recent Post List

B.J. Lawson Pulls Into Lead in North Carolina Over Democratic Incumbent

Posted: 07 Sep 2010 10:47 AM PDT

On Sunday, I received an email from B.J. Lawson saying he has pulled into a small lead over incumbent David Price.

B.J. Writes ...

Dear Friends,

This election will be proof positive that public sentiment has shifted towards shrinking the size of the federal government, restoring the Constitution and returning to fiscal responsibility. This shift is essential -- not just for my campaign -- but also for the future of our country.

My opponent, David Price, has been described as a "true blue liberal" who has fought his entire career for single payer healthcare and other big government, big spending programs. That would explain why Mr. Price has voted with Nancy Pelosi more than any other congressman. They also entered Congress together in 1987.

Washington is broken and voters are waking up.

Washington takes too much and spends too much in an attempt to be all things to all people. I am running for Congress to end runaway government spending, balance our budget, cut taxes and restore the Constitution.

If we don't change directions in 2010, our children and future generations will suffer immensely. Our race is the perfect opportunity to send Washington D.C. a message: It's time to stop mortgaging our children's future and get our nation's financial affairs in order.

I hope you will support us in this effort. There are too many issues needing our attention for us to remain silent and divided, and I hope you'll join us in this historic race to take back our government, and our future.

Yours in Freedom,
William "BJ" Lawson
Republican for Congress North Carolina's 4th District

Sentiment has indeed shifted. I am starting to expect a blowout in the midterm elections as
Voters Strongly Favor Non-Incumbent GOP Newcomers in Midterm Elections.

B.J. Lawson Profile

Inquiring minds will want to check out where Lawson Stands on the Issues

Cut Taxes to Stimulate Job Growth
Reduce the Size of Government
Reform the Federal Regulatory Burden
Reduce Spending to Restore Fiscal Balance
Empower Local Education
Restore Trust in Government

Those are highlights. Click on the above link for details.

David Price, his Democratic congressional opponent in the upcoming election admits to not reading or understanding health care legislation before voting.

Lawson says "Passing legislation that is not fully understood, or understandable, is simply legislative malpractice. We must demand better of our elected representatives if we are to restore the trust and legitimacy of our federal government."

Charles Goyette Interview

Please click on this link to download and play this B.J. Lawson Interview with talk show host Charles Goyette.

Money is always welcome, but so is your time and energy! Please click here to Volunteer Time or Services to B.J. Lawson.

Please do what you can to support B.J. Lawson. He is of a rare Ron Paul mode, and we cannot afford to let any opportunities to elect such candidates slip through the cracks.

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com
Click Here To Scroll Thru My Recent Post List

Infrastructure Bank: Obama's Desperate Attempt to Win Midterm Democrat Votes; Stimulus Déjà Vu

Posted: 06 Sep 2010 11:40 PM PDT

The president's pandering to public unions has backfired and now he wants to create an "infrastructure bank" which would be run by the government but would pool tax dollars with private investment.

The New York Times reports Obama Offers a Transit Plan to Create Jobs.

President Obama, looking to stimulate a sluggish economy and create jobs, called Monday for Congress to approve major upgrades to the nation's roads, rail lines and runways — part of a six-year plan that would cost tens of billions of dollars and create a government-run bank to finance innovative transportation projects.

Central to the plan is the president's call for an "infrastructure bank," which would be run by the government but would pool tax dollars with private investment, the White House says.

some leading proponents of such a bank — including Gov. Arnold Schwarzenegger, Republican of California; Gov. Ed Rendell, Democrat of Pennsylvania; and Michael R. Bloomberg, the independent mayor of New York — would like to see it finance a broader range of projects, including water and clean-energy projects. They say such a bank would spur innovation by allowing a panel of experts to approve projects on merit, rather than having lawmakers simply steer transportation money back home.

"It will change the way Washington spends your tax dollars," Mr. Obama said here, "reforming the haphazard and patchwork way we fund and maintain our infrastructure to focus less on wasteful earmarks and outdated formulas, and more on competition and innovation that gives us the best bang for the buck."

Mish Comment: What a bunch of crock. If the president was genuinely interested in keeping costs down he would ask Democrats to scrap Davis Bacon and collective bargaining.

The White House did not offer a price tag for the full measure or say how many jobs it would create. If Congress simply reauthorized the expired transportation bill and accounted for inflation, the new measure would cost about $350 billion over the next six years. But Mr. Obama wants to "frontload" the new bill with an additional $50 billion in initial investment to generate jobs, and vowed it would be "fully paid for." The White House is proposing to offset the $50 billion by eliminating tax breaks and subsidies for the oil and gas industry.

Mish Comment: If the bill was fully paid for the the President ought to have the balls to say how. In simple terms he is either disingenuous or a blatant liar. Is there not even $50 billion in military spending he could cut? Nothing?

After months of campaigning on the theme that the president's $787 billion stimulus package was wasteful, Republicans sought Monday to tag the new plan with the stimulus label. The Republican National Committee called it "stimulus déjà vu," and Representative Eric Cantor of Virginia, the House Republican whip, characterized it as "yet another government stimulus effort."

But Governors Rendell and Schwarzenegger, and Mayor Bloomberg, who in 2008 founded a bipartisan coalition to promote transportation upgrades, praised Mr. Obama. And in policy circles, the plan, especially the call for the infrastructure bank, is generating serious debate.

Mish Comment: Schwarzenegger and Mayor Bloomberg both have no backbone. Bloomberg panders to unions and until recently Schwarzenegger refused to play hardball. All there are hoping for a large share of the transit plan.

There is no shortage of projects in search of money. The problem, analysts say, is that Congress, which would create the bank, is not known for its ability to single out strategic priorities for growth. Instead, it traditionally builds broad support by giving a little something to everybody — Montana, for instance, would get a small amount of Amtrak money in return for its support for improvements along the Northeast corridor.

Samuel Staley, director of urban growth and land-use policy for the Reason Foundation, a libertarian research group, said the best way to spend money efficiently would be to establish the bank as a revolving loan fund so that money for new projects would not become available until money for previous projects had been repaid.

Mr. Staley expressed concern that in their zeal to spur growth and create jobs, Congress and the Obama administration would not impose such limits.

"With the $800 billion stimulus program, they were literally just dumping money into the economy," he said. "There was little legitimate cost-benefit analysis."

Mish Comment: There is never a shortage of ways Congress can and will waste taxpayer money. This will not change unless and until there is balanced budget amendment. Until taxes have to be raised to fund projects, Congress and any

Business Tax relief

In addition to the Infrastructure Bank, Obama to Propose Business Tax Relief, Spending to Spur Growth

Obama tomorrow will announce an expanded tax incentive to encourage business investment, an administration official said on condition of anonymity. Obama also will urge Congress to extend permanently and expand a research-and-development tax credit for businesses, costing about $100 billion over a decade. He began the rollout of initiatives yesterday in Milwaukee, calling for $50 billion in the first of a six-year program to fix roads, railways and runways and modernize the air-traffic control system.

Elections in less than two months to decide U.S. House seats and about a third of the Senate are focused on unemployment near 10 percent and a budget deficit swelled by the government's financial-system bailout. Obama is traveling this week to Midwestern states where joblessness is hurting some Democratic candidates' chances of getting elected.

At an event tomorrow in Cleveland, Obama will propose allowing companies to fully deduct the cost of purchasing equipment such as tractors, wind turbines, computers and solar panels, the official said.

In 2008 and 2009, companies could deduct 50 percent of their costs using so-called bonus depreciation. The latest proposal would increase the tax break to 100 percent through the end of 2011 and would make it retroactive to Sept. 8, 2010, the official said. The bonus depreciation measure would cost $30 billion over 10 years. It and the proposed permanent extension of the research tax credit have garnered the support of the business community.

Speaking to union members and their families on the Labor Day holiday in the U.S., Obama called for an "infrastructure bank" and requested money to rebuild 150,000 miles (241,400 kilometers) of roads, construct and maintain 4,000 miles of rail and overhaul 150 miles of runways.

Senate Republican Leader Mitch McConnell, of Kentucky, responded in a statement that the "latest plan for another stimulus should be met with justifiable skepticism," and "Americans are rightly skeptical about Washington Democrats asking for more money."

"Infrastructure programs are always popular for stimulus talk but disappointing in practice," Douglas Holtz-Eakin, president of the Washington-based American Action Forum and a former adviser to the 2008 presidential campaign of Senator John McCain, a Republican from Arizona.

Holtz-Eakin also questioned whether Congress will agree to more spending, given signs of growing voter opposition to a deficit that the Congressional Budget Office estimates will reach $1.3 trillion in the fiscal year ending Sept. 30, near last year's record shortfall of $1.4 trillion.

'Politics'

"The ratio of politics to substance in this effort is infinite," Holtz-Eakin said.

Things are looking very bleak for Obama in the midterm elections, and even Democrats are starting to shy away from many of his policies, including healthcare.

For more on Labor Day pandering, please see Labor Day Insanity from Clinton's Secretary of Labor

Mike "Mish" Shedlock
http://globaleconomicanalysis.blogspot.com
Click Here To Scroll Thru My Recent Post List

SEOmoz Daily SEO Blog

Latent Dirichlet Allocation (LDA) and Google's Rankings are Remarkably Well Correlated

Posted: 06 Sep 2010 11:42 AM PDT

Posted by randfish

Last week at our annual mozinar, Ben Hendrickson gave a talk on a unique methodology for improving SEO. The reception was overwhelming - I've never previously been part of a professional event where thunderous applause broke out not once but multiple times in the midst of a speaker's remarks.

Ben Hendrickson of SEOmoz speaking at the London Distilled/SEOmoz PRO Training
_
Ben Hendrickson speaking in last Fall at the Distilled/SEOmoz PRO Training London
(he'll be returning this year)
_

I doubt I can recreate the energy and excitement of the 320-person filled room that day, but my goal in this post is to help explain the concepts of topic modeling, vector space models as they relate to information retrieval and the work we've done on LDA (Latent Dirichlet Allocation). I'll also try to explain the relationship and potential applications to the practice of SEO.

A Request: Curiously, prior to the release of this post and our research publicly, there have been a number of negative remarks and criticisms from several folks in the search community suggesting that LDA (or topic modeling in general) is definitively not used by the search engines. We think there's a lot of evidence to suggest engines do use these, but we'd be excited to see contradicting evidence presented. If you have such work, please do publish!

The Search Rankings Pie Chart

Many of us are likely familar with the ranking factors survey SEOmoz conducts every two years (we'll have another one next year and I expect some exciting/interesting differences). Of course, we know that this aggregation of opinion is likely missing out on many factors and may over or under-emphasize the ones it does show.

Here's an illustration I created for a presentation recently to help illustrate the major categories in the overall results:

Illustration of Ranking Factors Survey Data

This suggests that many SEOs don't ascribe much weight to on-page optimization
_

I myself have often felt that from all the metrics, tests and observations of Google's ranking results, the importance of on-page factors like keyword usage or TF*IDF (explained below) is fairly small. Certainly, I've not observed many results, even in low competitive spaces, where one can simply add in a few more repetitions of the keyword, maybe toss in a few synonyms or "related searches" and improve rankings. This experience, which many SEOs I've talked to share, has led me to believe that linking signals are an overwhelming majority of how the engines order results.

But, I love to be wrong.

Some of the work we've been doing around topic modeling, specifically using a process called LDA (Latent Dirichlet Allocation), has shown some surprisingly strong results. This has made me (and I think a lot of the folks who attended Ben's talk last Tuesday) question whether it was simply a naive application of the concept of "relevancy" or "keyword usage" that gave us this biased perspective.

Why Search Engines Need Topic Modeling

Some queries are very simple - a search for "wikipedia" is non-ambiguous, straightforward and can be effectively returned by even a very basic web search engine. Other searches aren't nearly as simple. Let's look at how engines might order two results - a simple problem most of the time that can be somewhat complex depending on the situation.

Query for Batman

Query for Chief Wiggum

Query for Superman

Query for Pianist

For complex queries or when relating large quantities of results with lots of content-related signals, search engines need ways to determine the intent of a particular page. Simply because it mentions the keyword 4 or 5 times in prominent places or even mentions similar phrases/synonyms won't necessarily mean that it's truly relevant to the searcher's query.

Historically, lots of SEOs have put effort into this process, so what we're doing here isn't revolutionary, and topic models, LDA included, have been around for a long time. However, no one in the field, to our knowledge, has made a topic modeling system public or compared its output with Google rankings (to help see how potentially influential these signals might be). The work Ben presented, and the really exciting bit (IMO), is in those numbers.

Term Vector Spaces & Topic Modeling

Term vector spaces, topic modeling and cosine similarity sound like a tough concepts, and when Ben first mentioned them on stage, a lot of the attendees (myself included) felt a bit lost. However, Ben (along with Will Critchlow, whose Cambridge mathematics degree came in handy) helped explain these to me, and I'll do my best to replicate that here:

Simplistic Term Vector Model

In this imaginary example, every word in the English language is related to either "cat" or "dog," the only topics available. To measure whether a word is more related to "dog," we use a vector space model that creates those relationships mathematically. The illustration above does a reasonable job showing our simplistic world. Words like "bigfoot" are perfectly in the middle with no more closeness to "cat" than to "dog." But words like "canine" and "feline" are clearly closer to one that the other and the degree of the angle in the vector model illustrates this (and gives us a number).

BTW - in an LDA vector space model, topics wouldn't have exact label associations like "dog" and "cat" but would instead be things like "the vector around the topic of dogs."

Unfortunately, I can't really visualize beyond this step, as it relies on taking the simple model above and scaling it to thousands or millions of topics, each of which would have its own dimension (and anyone who's tried knows that drawing more than 3 dimensions in a blog post is pretty hard). Using this construct, the model can compute the similarity between any word or groups of words and the topics its created. You can learn more about this from Stanford University's posting of Introduction to Information Retrieval, which has a specific section on Vector Space Models.

Correlation of our LDA Results w/ Google.com Rankings

Over the last 10 months, Ben (with help from other SEOmoz team members) has put together a topic modeling system based on a relatively simple implementation of LDA. While it's certainly challenging to do this work, we doubt we're the first SEO-focused organization to do so, though possibly the first to make it publicly available.

When we first started this research, we didn't know what kind of an input LDA/topic modeling might have on search engines. Thus, on completion, we were pretty excited (maybe even ecstatic) to see the following results:

Correlation Between Google.com Rankings and Various Single Metrics

(the vertical blue bars indicate standard error in the diagram, which is relatively low thanks to the large sample set)
_

Using the same process we did for our release of Google vs. Bing correlation/ranking data at SMX Advanced (we posted much more detail on the process here), we've shown the Spearman correlations for a set of metrics familiar to most SEOs against some of the LDA results, including:

TF*IDF - the classic term weighting formula, TF*IDF measures keyword usage in a more accurate way than a more primitive metric like keyword density. In this case, we just took the TF*IDF score of the page content that appeared in Google's rankings
Followed IPs - this is our highest correlated single link-based metric, and shows the number of unique IP addresses hosting a website that contains a followed link to the URL. As we've shown in the past, with metrics like Page Authority (which uses machine learning to build more complex ranking models) we can do even better, but it's valuable in this context to just think and compare raw link numbers.
LDA Cosine - this is the score produced from the new LDA labs tool. It measures the cosine similarity of topics between a given page or content block and the topics produced by the query.

The correlation with rankings of the LDA scores are uncanny. Certainly, they're not a perfect correlation, but that shouldn't be expected given the supposed complexity of Google's ranking algorithm and the many factors therein. But, seeing LDA scores show this dramatic result made us seriously question whether there was causation at work here (and we hope to do additional research via our ranking models to attempt to show that impact). Perhaps, good links are more likely to point to pages that are more "relevant" via a topic model or some other aspect of Google's algorithm that we don't yet understand naturally biases towards these.

However, given that many SEO best practices (e.g. keywords in title tags, static URLs and ) have dramatically lower correlations and the same difficulties proving causation, we suspect a lot of SEO professionals will be deeply interested in trying this approach.

The LDA Labs Tool Now Available; Some Recommendations for Testing & Use

We've just recently made the LDA Labs tool available. You can use this to input a word, phrase, chunk of text or an entire page's content (via the URL input box) along with a desired query (the keyword term/phrase you want to rank for) and the tool will give back a score that represents the cosine similarity in a percentage form (100% = perfect, 0% = no relationship).

When you use the tool, be aware of a few issues:

Scores Change Slightly with Each Run
This is because, like a pollster interviewing 100 voters in a city to get a sense of the local electorate, we check a sample of the topics a content+query combo could fit with (checking every possibility would take an exceptionally long time). You can, therefore, expect the percentage output to flux 1-5% each time you check a page/content block against a query.
Scores are for English Only
Unfortunately, because our topics are built from a corpus of English language documents, we can't currently provide scores for non-English queries.
LDA isn't the Whole Picture
Remember that while the average correlation is in the 0.33 range, we shouldn't expect scores for any given set of search results to go in precisely descending order (a correlation of 1.0 would suggest that behavior).
The Tool Currently Runs Against Google.com in the US only
You should be able to see the same results the tool extracts from by using a personalization-agnostic search string like http://www.google.com/xhtml?q=my+search&pws=0
Using Synonyms, "Related Searches" or Wonder Wheel Suggestions May Not Help
Term vector models are more sophisticated representations of "concepts" and "topics," so while many SEOs have long recommended using synonyms or adding "related searches" as keywords on their pages and others have suggested the importance of "topically relevant content" there haven't been great ways to measure these or show their correlation with rankings. The scores you see from the tool will be based on a much less naive interpretation of the connections between words than these classic approaches.
Scores are Relative (20% might not be bad)
Don't presume that getting a 15% or a 20% is always a terrible result. If the folks ranking in the top 10 all have LDA scores in the 10-20% range, you're likely doing a reasonable job. Some queries simply won't produce results that fit remarkably well with given topics (which could be a weakness of our model or a weirdness about the query itself).
Our Topic Models Don't Currently Use Phrases
Right now, the topics we construct are around single word concepts. We imagine that the search engines have probably gone above and beyond this into topic modeling that leverages multi-word phrases, too, and we hope to get there someday ourselves.
Keyword Spamming Might Improve Your LDA Score, But Probably Not Your Rankings
Like anything else in the SEO world, manipulatively applying the process is probably a terrible idea. Even if this tool worked perfectly to measure keyword relevance and topic modeling in Google, it would be unwise to simply stuff 50 words over and over on your page to get the highest LDA score you could. Quality content that real people actually want to find should be the goal of SEO and Google's almost certainly sophisticated enough to determine the different between junk content that matches topic models and real content that real users will like (even if the tool's scoring can't do that).

If you're trying to do serious SEO analysis and improvement, my suggested methodology is to build a chart something like this:

SERPs analysis of "SEO" in Google.com w/ Linkscape Metrics + LDA (click for larger)

Right now, you can use Keyword Difficulty's export function and then add in some of these metrics manually (though in the future, we're working towards building this type of analysis right into the web app beta).

Once you've got a chart like this, you can get a better sense of what's propping up your competitors rankings - anchor text, domain authority, or maybe something related to topic modeling relevancy (which the LDA tool could help with).

Undoubtedly, Google's More Sophisticated than This

While the correlations are high, and the excitement around the tool both inside SEOmoz and from a lot of our members and community is equally high, this is not us "reversing the algorithm." We may have built a great tool for improving the relevancy of your pages and helping to judge whether topic modeling is another component in the rankings, but it remains to be seen if we can simply improve scores on pages and see them rise in the results.

What's exciting to us isn't that we've found a secret formula (LDA has been written about for years and vector space models have been around for decades), but that we're making a potentially valuable addition to the parts of SEO we've traditionally had little measurement around.

BTW - Thanks to Michael Cottam, who suggested the reference of research work by a number of Googlers on pLDA. There are hundreds of papers from Google and Microsoft (Bing) researchers around LDA-related topics, too, for those interested. Reading through some of these, you can see that major search engines have almost certainly built more advanced models to handle this problem. Our correlation and testing of the tool's usefulness will show whether a naive implementation can still provide value for optimizing pages.

For those who'd like to investigate more, we've made all of our raw data available here (in XLS format, though you'll need a more sophisticated model to do LDA). If you have interest in digging into this, feel free to email Ben at SEOmoz dot org.

How Do I Explain this to the Boss/Client?

The simplest method I've found is to use an analogy like:

If we want to rank well for "the rolling stones" it's probably a really good idea to use words like "Mick Jagger," "Keith Richards," and "tour dates." It's also probably not super smart to use words like "rubies," "emeralds," "gemstones," or the phrase "gathers no moss," as these might confuse search engines (and visitors) as to the topic we're covering.

This tool tries to give a best guess number about how well we're doing on this front vs. other people on the web (or sample blocks of words or content we might want to try). Hopefully, it can help us figure out when we've done something like writing about the Stones but forgetting to mention Keith Richards.

As always, we're looking forward to your feedback and results. We've already had some folks write in to us saying they used the tool to optimize the contents of some pages and seen dramatic rankings boosts. As we know, that might not mean anything about the tool itself or the process, but it certainly has us hoping for great things.

p.s. The next step, obviously, is to produce a tool that can make recommendations on words to add or remove to help improve this score. That's certainly something we're looking into.

p.p.s. We're leaving the Labs LDA tool free for anyone to use for a while, as we'd love to hear what the community thinks of the process and want to get as broad input as possible. Future iterations may be PRO-only.

Do you like this post? Yes No

Daily Snapshot: Renewing and Expanding America's Roads, Rails and Runways

Your Daily Snapshot for
Tuesday, September 7, 2010

Photo of the Day

President Barack Obama waves to the crowd after speaking at the Milwaukee Laborfest in Milwaukee, Wisc., Sept. 6, 2010. (Official White House Photo by Pete Souza)

View more photos.

Today's Schedule

All times are Eastern Daylight Time

10:00 AM: The President and the Vice President receive the Presidential Daily Briefing

10:30 AM: The President and the Vice President receive the Economic Daily Briefing

11:10 AM: The President and the Vice President meet with Secretary Clinton

11:50 AM: The President welcomes NATO Secretary General Rasmussen

12:00 PM: Briefing by Press Secretary Robert Gibbs