Alexa Technology


How and Why We Crawl the Web

Alexa is continually crawling all publicly available web sites to create a series of snapshots of the Web. We use the data we collect to create features and services:

  • Site Information: Traffic rankings, pictures of sites, links pointing to sites and more
  • Related Links: Sites that are similar to the one you are currently viewing
Alexa has been crawling the Web since early 1996, and we have continually increased the amount of information that we gather. We are currently gathering approximately 1.6 Terabytes (1600 gigabytes) of Web content per day. After each snapshot of the Web, which takes approximately two months to complete, Alexa has gathered 4.5 Billion pages from over 16 million sites.

To programmatically access Alexa's vast information about the Web, please visit Alexa Web Information Service. To keep Alexa from crawling your site, please visit this page.

Gathering Web Usage Information

In addition to the Alexa Crawl, which can tell us what is on the Web, Alexa utilizes web usage information, which tells us what is being seen on the web. This information comes from the community of Alexa Toolbar users. Each member of the community, in addition to getting a useful tool, is giving back. Simply by using the toolbar each member contributes valuable information about the web, how it is used, what is important and what is not. This information is returned to the community with improved Related Links, Traffic Rankings and more.

Finding Patterns in Data

The Alexa services are derived from our uniquely powerful combination of Web content and usage information.

  • Site Stats

    Alexa gathers Site Stats from a variety of sources to provide key statistics about each site on the web. These include: Traffic Rank and Speed which are derived from Web usage information, and Other sites that link to this site, and Online Since, both of which come from Web content. For an example of Site Stats, see the Alexa Overview page for Schwab.com.

  • Contact Info

    Alexa provides contact information for Web sites by mining for Web content gathered in the crawl. This information includes Site Owner, Address, Phone Number and contact e-mail address. See Contact Info for Schwab.com.

  • Traffic Details

    Web usage information is utilized to provide information about the number of page views and number of users that Web sites receive. This data is also the basis for the Alexa traffic rank and traffic history graphs. See Traffic Details for Schwab.com

    Our goal for these features is to help people navigate the Web more efficiently by giving them all the information they need to make informed decisions about the sites they visit.

  • Related Links

    Whenever an Alexa Toolbar user visits a web page, the Alexa Toolbar retrieves information from the Alexa servers to suggest other pages that might be of interest to the user. To generate Related Links, we use several techniques, including:

    • The usage paths of the collective Alexa community- this is the most important source of our information, since these paths show us which web sites our users believe are important and interesting.
    • Clustering - the hundreds of millions of links on the Web can be used to find clusters of sites that are similar and relevant to one another. We mine this data by using custom databases to find and identify these clusters.
    • Users' suggestions - we consider our users' suggestions to augment our Related Links recommendations.

The Alexa Toolbar

The Alexa toolbar is a program written by Alexa Internet that users install into the browser. Every time the user changes pages, the Alexa toolbar communicates with Alexa servers to retrieve information which is then displayed in the toolbar.

Donation of the Information to the Internet Archive

As a service to future historians, scholars, and other interested parties, Alexa Internet donates a copy of each crawl of the Web to the Internet Archive, a (501(c)3) nonprofit organization committed to the long-term preservation and maintenance of a growing collection of data about the Web. At Alexa, we believe that saving and preserving our early digital heritage is important today and essential for future generations. We also believe that a public charity is the best kind of organization for preserving this global asset. More information about accessing archived materials is available at the Internet Archive, www.archive.org




Guest Room