Central Perk: Introducing SEOmoz's Updated Page Authority and Domain Authority

joi, 24 noiembrie 2011

Introducing SEOmoz's Updated Page Authority and Domain Authority

Posted: 23 Nov 2011 06:39 AM PST

Here at Moz, we take metrics and analytics seriously and work hard to ensure that our metrics are first rate. Among our most important link metrics are Page Authority and Domain Authority. Accordingly, we have been working to improve these so that they more accurately reflect a given page or domain's ability to rank in search results. This blog entry provides an overview of these metrics and introduces our new Authority models with a deep technical dive.

What are Page and Domain Authority?

Page and Domain Authority are machine learning ranking models that predict the likelihood of a single page or domain to rank in search results, regardless of page content. Their input is the 41 link metrics available in our Linkscape URL Metrics API call and their output is a score on a scale from 1 to 100. They are keyword agnostic because they do not use any information about the page content.

Why are Page and Domain Authority being updated?

Since these models predict search engine position, it is important to update them periodically to capture changes in the search engines' ranking algorithms. In addition, this update includes some changes to the underlying models resulting in increased accuracy. Our favorite measure of accuracy is the mean Spearman Correlation over a collection of SERPs. The next chart compares the correlations on several previous indices and the next index release (Index 47).

The new model out performs the old model on the same data using the top 30 search results, and performs better if more results are used (top 50). Note that these are out of sample predictions.

When will the models change? Will this affect my scores?

The models will be updated when we roll out the next Linkscape index update, sometime during the week of November 28. Your scores will likely change a little, and may potentially change by as many as 20 points or more. I'll present some data later in this post that shows most PRO and Free Trial members with campaigns will see a slight increase in their Page Authority.

What does this mean if I use Page Authority and Domain Authority data?

First, the metrics will be better at predicting search position, and Page Authority will remain the single highest correlated metric with search position that we have seen (including mozRank and the other 100+ metrics we examined in our Search Engine Ranking Factors study). However, since we don't yet have a good web spam scoring system, sites that manipulate search engines will slip by us (and look like an outlier), so a human review is still wise.

Before presenting some details of the models, I'd like to illustrate what we mean by a "machine learning ranking model." The table below shows the top 26 results for the keyword "pumpkin recipes" with a few of our Linkscape metrics (Google-US search engine; this is from an older data set and older index, but serves as a good illustration).

Pumpkin Recipes SERP result

As you can see, there is quite a spread among the different metrics illustrated, with some of the pages having a few links and others 1,000+ links. The Linking Root Domains are also spread from only 46 Linking Root Domains to 200,000+. The Page Authority model takes these link metrics as input (plus 36 other link metrics not shown) and predicts the SERP ordering. Since it only takes into account link metrics (and explicitly ignores any page or keyword content), but search engines take many ranking factors into consideration, the model cannot be 100% accurate. Indeed, in this SERP, the top result benefits from an exact domain match to the keyword and helps explain its #1 position despite its relatively low link metrics. However, since Page Authority only takes link metrics as input, it is a single aggregate score that explains how likely a page is to rank in search based only on links. Domain Authority is similar for domain wide ranking. The models are trained on a large collection of Google-US SERP results.

Despite restricting to only link metrics, the new Page and Domain Authority models do a good job of predicting SERP ordering and improve substantially over the existing models. This increased accuracy is due in part to the new model's ability to better separate pages with moderate Page Authority values into higher and lower scores.

This chart shows the distribution of the Page Authority values for the new and old models over a data set generated from 10,000+ SERPs that includes 200,000+ unique pages (similar to the one used in our Search Engine Ranking Factors). As you can see, the new model has "fatter tails" and moves some of the pages with moderate scores to higher and lower values resulting in better discriminating power. The average Page Authority for both sets is about the same, but the new model has a higher standard deviation, consistent with a larger spread. In addition to the smaller SERP data set, this larger spread is also present in our entire 40+ billion page index (plotted with the logarithm of page/domain count to see the details in the tails):

One interesting comparison is the change in Page Authority for the domains, subdomains and sub-folders PRO and Free Trial members are tracking in our campaign based tools.

The top left panel in the chart shows that the new model shifts the distribution of Page Authority for the active domains, subdomains and sub-folders to the right. The distribution of the change in Page Authority is included in the top right panel, and shows that most of the campaigns have a small increase in their scores (average increase is 3.7), with some sites increasing by 20 points or more. A scatter plot of the individual campaign changes is illustrated in the bottom panel, and shows that 82% of the active domains, subdomains and sub-folders will see an increase in their Page Authority (these are the dots above the gray line). It should be noted that these comparisons are based solely on changes in the model, and any additional links that these campaigns have acquired since the last index update will act to increase the scores (and conversely, any links that have been dropped will act to decrease scores).

The remainder of this post provides more detail about these metrics. To sum up this first part, the models underlying the Page and Domain Authority metrics will be updated with the next Linkscape index update. This will improve their ability to predict search position, due in part to the new model's better ability to separate pages based on their link profiles. Page Authority will remain the single highest correlated metric with search position that we have seen.

The rest of the post provides a deeper look at these models, and a lot of what follows is quite technical. Fortunately, none of this information is needed to actually use these Authority scores (just as understanding the details of Google's search algorithm is not necessary to use it). However, if you are curious about some of the details then read on.

The previous discussion has centered around distributions of Page Authority across a set of pages. To gain a better understanding of the models' characteristics, we need to explore its behavior on the inputs. However, the inputs are a 41 dimensional space and it's impossible (for me at least!) to visualize anything in 41 dimensions. As an alternative, we can attempt to reduce the dimensionality to something more manageable. The intuition here is that pages that have a lot of links probably have a lot of external links, followed links, a high mozRank, etc. Domains that have a lot of linking root domains probably have a lot of linking IPs, linking subdomains, a high domain mozRank, etc. One approach we could take is simply to select a subset of metrics (like the table in the "pumpkin recipes" SERP above) and examine those. However, this throws away the information from the other metrics and will inherently be more noisy then something that uses all of them. Principal Component Analysis (PCA) is an alternate approach that uses all of the data. Before diving into the PCA decomposition of the data, I'll take a step back and explain what PCA is with an example.

Principal Component Analysis is a technique that reduces dimensionality by projecting the data onto Principal Components (PC) that explain most of the variability in the original data. This figure illustrates PCA on a small two dimensional data set:

This sample data looks roughly like an ellipse. PCA computes two principal components illustrated by the red lines and labeled in the graph that roughly align with the axes of the ellipse.& One representation of the data is the familiar (x, y) coordinates. A second, equivalent representation is the projection of this data onto the principal components illustrated by the labeled points. Take the upper point (7.5, 6). Given these two values, it's hard to determine where it is in the ellipse. However, if we project it onto the PCs we get (4.5, 1.2) which tells us that it is far to the right of the center along the main axis (the 4.5 value) and a little up along the second axis (the 1.2 value).

We can do the same thing with the link metrics, only instead of using two inputs we use all 41 inputs. After doing so, something remarkable happens:

Two principal components naturally emerge that collectively explain 88% of the covariance in the original data! Put another way, almost all of the data lies in some sort of strange ellipse in our 41 dimensional space. Moreover, these PCs have a very natural link to our intuition. The first PC, which I'll call the Domain/Subdomain PC projects strongly onto the domain and subdomain related metrics (upper panel, blue and red lines), and has a very small projection onto the page metric (upper panel green lines). The second PC has the opposite property and projects strongly onto page related metrics with a small projection onto Domain/Subdomain metrics.

Don't worry if you didn't follow all of that technical mumbo jumbo in the last few paragraphs. Here's the key point: instead of talking about number of links, followed external links to domains, linking root domains, etc. we can instead talk about just two things - an aggregate domain/subdomain link metric and an aggregate page link metric and recover most of the information in the original 41 metrics.

Armed with this new knowledge, we can revisit the 10K SERP data and analyze it in with these aggregate metrics.

This chart shows the joint distribution of the 10K SERP data projected onto these PCs, along with the marginal distribution of each on the top and right hand side. At the bottom left side of the chart are pages with low values for each PC signifying that the page doesn't have many links and they are on domains without many links. There aren't many of these in the SERP data since these are unlikely to rank in search results. In the upper right are heavily linked to pages on heavily linked to domains, the most popular pages on the internet. Again, there aren't many of these pages in the SERP data because there aren't many of them on the internet (e.g. twitter.com, google.com, etc.) Interestingly, most of the SERP data falls into one of two distinct clusters. By examining the follow figure we can identify these clusters:

This chart shows the average folder depth of each search result, where folder depth is defined as the number of slashes (/) after the home page (with 1 defined to be the home page). By comparing with the previous chart, we can identify the two distinct clusters as home pages and pages deep on heavily linked to domains.

To circle back to search position, we can plot the average search position:

We see a general trend toward higher search position as the aggregate page and domain metrics increase. This data set only collected the top 30 results for each keyword, so values of average search position greater than 16 are in the bottom half of our data set. Finally, we can visually confirm that our Page and Domain Authority models capture this behavior and gain further insight into the new vs old model differences:

This is a dense figure, but here are the most important pieces. First, Page Authority captures the overall behavior seen in the Average Search position plot, with higher scores for pages that rank higher and lower scores for pages that rank lower (top left). Second, comparing the old vs new models, we see that the new model predicts higher scores for the most heavily linked to pages and lower scores for the least heavily linked to pages, consistent with our previous observation that the new model does a better job discriminating among pages.

Do you like this post? Yes No