Skip to content

JavaScript library for measuring the size and composition of archived web pages.

License

Notifications You must be signed in to change notification settings

overbrowsing/wasteback-machine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wasteback Machine

NPM version npm PRs Welcome

What is Wasteback Machine?

Wasteback Machine is a JavaScript library for analysing archived web pages, measuring their size and composition to support retrospective, quantitative web research.

Features

  • Archive-agnostic access: Works with web archives that use the Memento Protocol and expose the unmodified archived page via the id_ endpoint.
  • Page composition analysis: Analyses the full structure of an archived page, including HTML, stylesheets, scripts, images, fonts, and more.
  • Resource inventory: Produces an optional structured list of all discovered resources with their URLs, types, and byte sizes.
  • Byte-accurate measurement: Precisely measures the size of each resource, cleans stylesheets and scripts to remove archive-injected content, and excludes any resources that are not part of the original page.
  • Completeness scoring: Calculates how completely an archived page and its resources were retrieved.

Supported Web Archives

Web Archive Organisation Web Archive ID ⭐️
Arquivo.pt 🇵🇹 FCCN/FCT arq
Australia Web Archive (Trove) 🇦🇺 National Library of Australia awa
Webarchiv 🇨🇿 National Library of the Czech Republic cz
Government of Canada Web Archive 🇨🇦 Library and Archives Canada gcwa
Wayback Machine 🇺🇸 Internet Archive ia
Icelandic Web Archive (Vefsafn.is) 🇮🇸 National and University Library of Iceland iwa
Library of Congress Web Archive 🇺🇸 Library of Congress loc
National Library of Ireland Web Archive 🇮🇪 National Library of Ireland nliwa
New Zealand Web Archive 🇳🇿 National Library of New Zealand nzwa
PRONI Web Archive 🇬🇧 The Public Record Office of Northern Ireland pwa
Spletni Arhiv 🇸🇮 National and University Library of Slovenia slo
UK Government Web Archive (UKGWA) 🇬🇧 The National Archives ukgwa
UK Web Archive (Offline) 🇬🇧 British Library ukwa

⭐️ This ID is used to select the web archive you want to query.

Adding a New Web Archive

If you maintain a web archive not currently supported, please contact us at overbrowsing@ed.ac.uk.

Installation

Using NPM

To install the Wasteback Machine as a dependency for your projects using NPM:

npm i @overbrowsing/wasteback-machine

Using Yarn

To install the Wasteback Machine as a dependency for your projects using Yarn:

yarn add @overbrowsing/wasteback-machine

Usage

The Wasteback Machine provides two primary functions:

  1. Fetch available memento-datetimes within a specific web archive for a given URL and time range.
  2. Analyse a specific memento from a specific web archive to measure its page size and composition.

1. Fetch Available Memento-datetimes

Get all mementos for https://nytimes.com between 1996 and 2025 from the Wayback Machine (ia)

import { getMementos } from "@overbrowsing/wasteback-machine";

const mementos = await getMementos(
  "ia", // Web archive ID (ia = Wayback Machine)
  "https://nytimes.uk", // Target URL
  1996, // Start year
  2025 // End year
);

console.log(mementos);

Example Output

[
  '19961112181513',
  '19961112181513',
  '19961112181513',
  '19961219002950'...
]

2. Analyse a Specific Memento

Analyse https://nytimes.com from November 12, 1996 from the Wayback Machine (ia)

import { getMementoSizes } from "@overbrowsing/wasteback-machine";

const mementoData = await getMementoSizes(
  "ia", // Web Archive ID (ia = Wayback Machine)
  "https://nytimes.com", // Target URL
  "19961112181513", // Memento datetime
  { includeResources: true } // Resource list (true/false)
);

console.log(mementoData);

Example Output

{
  url: 'https://nytimes.com',
  requestedMemento: '19961112181513',
  memento: '19961112181513',
  mementoUrl: 'https://web.archive.org/web/19961112181513if_/https://nytimes.com',
  archive: 'Wayback Machine',
  archiveOrg: 'Internet Archive',
  archiveUrl: 'https://web.archive.org',
  sizes: {
    html: { bytes: 1653, count: 1 },
    stylesheet: { bytes: 0, count: 0 },
    script: { bytes: 0, count: 0 },
    image: { bytes: 46226, count: 2 },
    video: { bytes: 0, count: 0 },
    audio: { bytes: 0, count: 0 },
    font: { bytes: 0, count: 0 },
    flash: { bytes: 0, count: 0 },
    plugin: { bytes: 0, count: 0 },
    data: { bytes: 0, count: 0 },
    document: { bytes: 0, count: 0 },
    other: { bytes: 0, count: 0 },
    total: { bytes: 47879, count: 3 }
  },
  completeness: '100%',
  resources: [
    {
      url: 'https://web.archive.org/web/19961112181513im_/http://www.nytimes.com/index.gif',
      type: 'image',
      size: 45259
    },
    {
      url: 'https://web.archive.org/web/19961112181513im_/http://www.nytimes.com/free-images/marker.gif',
      type: 'image',
      size: 967
    }
  ]
}

Wasteback Machine CLI

The Wasteback Machine CLI lets you easily query web archives, fetch mementos for a given URL and date, and see page size, composition, and estimated emissions using CO2.js.

Quick Start

To initate Wasteback Machine CLI using NPM:

npm run cli

CLI Prompts

1. Enter web archive ID ('help' to list archives or [Enter ↵] = Wayback Machine):
2. Enter URL to analyse:
3. Enter target year (YYYY):
4. Enter target month (MM or [Enter ↵] = 01):
5. Enter target day (DD or [Enter ↵] = 01):

Example Output

________________________________________________________

MEMENTO INFO

  Memento URL:    https://web.archive.org/web/19961112181513if_/https://nytimes.com
  Web Archive:    Wayback Machine
  Organisation:   Internet Archive
  Website:        https://web.archive.org

________________________________________________________

PAGE SIZE

  Data:           46.76 KB
  Emissions:      0.014 g CO₂e
  Completeness:   100%

________________________________________________________

PAGE COMPOSITION

  HTML
      Count:      1
      Data:       1653 bytes (3.5%)
      Emissions:  0.000 g CO₂e

  IMAGE
      Count:      2
      Data:       46226 bytes (96.5%)
      Emissions:  0.013 g CO₂e

________________________________________________________

Methodology

For details of the underlying methodology, assumptions, and limitations, please refer to our paper DOI 10.1371/journal.pclm.0000767.

Wasteback Machine was developed as part of doctoral research at The University of Edinburgh’s Institute for Design Informatics.

Disclaimer

Important

Wasteback Machine is provided for informational and research purposes only. The authors make no guarantees about the accuracy of the results and disclaim any liability for their use. Use of Wasteback Machine is subject to the terms of service of each respective web archive.

Contributing

Contributions are welcome! Please submit an issue or a pull request.

Licenses

The Wasteback Machine is licensed under Apache 2.0. For full licensing details, see the LICENSE file.

About

JavaScript library for measuring the size and composition of archived web pages.

Topics

Resources

License

Stars

Watchers

Forks