HTTP Archive

The HTTP Archive is an open-source project and initiative that tracks how the World Wide Web is built. Operating as a permanent repository of web performance data, it periodically crawls millions of URLs to record detailed metrics about page composition, load times, accessibility, and the adoption of web technologies. The project is maintained by a core group of developers and is officially part of the Internet Archive, a 501(c)(3) non-profit organization.[1]

History

The HTTP Archive was founded in October 2010 by web performance pioneer Steve Souders.[2] Inspired by Brewster Kahle's creation of the Internet Archive—which focuses on preserving the content of the Web—Souders recognized a missing piece in web preservation: the need to maintain a historical record of how digitized content is constructed, coded, and served to users.

Methodology and data aollection

The HTTP Archive collects data by running tests using an underlying infrastructure based on WebPageTest. The system periodically simulates web browsers visiting URLs and collects granular data on the resources loaded. This includes tracking the total size of web pages, failed requests, and the specific technologies utilized (such as JavaScript frameworks, image formats, server types, and web fonts).

When the project initially launched, it began by tracking the performance metrics of approximately 17,000 top websites, deriving its list from sources such as Alexa Internet and the Fortune 500. Over the years, the scope of the archive has expanded significantly. Today, the project relies heavily on the Chrome User Experience Report (CrUX) to source its URL lists. By utilizing CrUX—a dataset managed by Google that reflects real-world field data, Core Web Vitals, and actual user loading experiences—the HTTP Archive analyzes millions of active URLs globally across both mobile and desktop platforms. This integration provides a highly representative, critical dataset for web developers, performance engineers, and researchers to spot emerging trends in web development.

Available data and technical details

The HTTP Archive has amassed a highly comprehensive, multi-petabyte dataset of historical web performance data dating back to 2010. Because of the immense scale of collecting metadata from millions of websites each month, the raw datasets are stored and made publicly queryable via Google BigQuery.

The archive stores detailed information about each page load heavily tied to the HAR (HTTP Archive) format, a JSON-formatted standard initially developed by the W3C. The data is segmented into specialized BigQuery tables to facilitate efficient research and reduce querying costs. Key tables and data types include:

  • Pages and Requests (HAR Tables): The core datasets containing HAR extracts for each page URL (`crawl.pages`) and individual resource requests (`crawl.requests`).
  • Blobs and Payloads (Lighthouse Evaluations): Beyond basic metrics, the archive stores detailed Lighthouse audit reports, page metadata, page summaries, and raw response payload bodies. By integrating Lighthouse, the dataset natively provides comprehensive scoring and evaluations for a page's performance, accessibility, SEO, and overall adherence to web best practices.
  • Blink Features: Tables (such as `blink_features.usage`) that track the detection, usage percentages, and sampling of specific Blink browser engine features across the web.
  • Custom Metrics: Data structures capturing specialized evaluations and metadata. This includes in-depth insights into privacy configurations (such as cookie usage, consent management, and tracker detection), the prevalence of advertising elements, web sustainability (carbon footprints), and the underlying technology stacks websites use.


Researchers, scholars, and developers looking to conduct their own analysis, view database schemas, or explore interactive query examples can find comprehensive documentation and a "Getting Started" guide at the project's technical resource hub, har.fyi.[3]

Publications

The Web Almanac

Since 2019, the HTTP Archive has published the Web Almanac, an annual, comprehensive "State of the Web" report.[4] The Almanac is authored by dozens of industry experts and volunteers who analyze the HTTP Archive's datasets. The publication features chapters detailing year-over-year trends in various categories, including web performance, JavaScript, CSS, security, accessibility, sustainability, and SEO.

Organization and sponsorship

The HTTP Archive operates as an open-source initiative run by community contributors and a core maintenance team. It is a recognized project under the umbrella of the Internet Archive.

Because crawling and storing data for millions of websites is resource-intensive, the operational costs and infrastructure of the HTTP Archive are supported by a coalition of corporate sponsors from the technology and web performance sectors. Notable sponsors and partners have included Google, Mozilla, New Relic, O'Reilly Media, Fastly, Akamai, Catchpoint, and Etsy.

See also

References

  1. ^ "About the HTTP Archive". httparchive.org. Retrieved May 15, 2026.
  2. ^ Souders, Steve (March 30, 2011). "Announcing the HTTP Archive". stevesouders.com. Retrieved May 15, 2026.
  3. ^ "Getting started | har.fyi". har.fyi. Retrieved May 15, 2026.
  4. ^ "The Web Almanac by HTTP Archive". almanac.httparchive.org. Retrieved May 15, 2026.

Content Disclaimer

Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.

  1. The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
  2. There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
  3. It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
  4. Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
  5. Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.