Glossary Term
Web archiving
History and Development of Web Archiving
- The Internet Archive, founded in 1996, was one of the first large-scale web archiving projects.
- The Internet Archive released the Wayback Machine in 2001, a search engine for viewing archived web content.
- As of 2018, the Internet Archive stored 40 petabytes of data.
- Other web archiving projects launched around the same time include the National Library of Canada's project, Australia's Pandora, Tasmanian web archives, and Sweden's Kulturarw3.
- The International Web Archiving Workshop (IWAW) provided a platform for sharing experiences and ideas from 2001 to 2010.
Methods of Collection for Web Archiving
- Web archivists collect various types of web content, including HTML web pages, style sheets, JavaScript, images, and video.
- Metadata about the collected resources, such as access time and content length, is archived to establish authenticity and provenance.
- Web crawlers, such as Heritrix, HTTrack, and Wget, are commonly used to automate the collection process.
- Services like the Wayback Machine and WebCite offer on-demand web archiving through web crawling techniques.
- Database archiving involves extracting the content of database-driven websites into a standard schema, allowing multiple databases to be accessed using a single system.
Remote Harvesting for Web Archiving
- Web crawlers are commonly used for remote harvesting of web content.
- Web crawlers access web pages similarly to how users with browsers view the web.
- Examples of web crawlers used for web archiving include Heritrix, HTTrack, and Wget.
- The Wayback Machine and WebCite are free services that use web crawling techniques to archive web resources.
- Remote harvesting provides a simple method for collecting web content.
Database Archiving for Web Archiving
- Database archiving involves extracting database content into a standard schema, often using XML.
- Tools like DeepArc and Xinq enable the archiving and online delivery of database content.
- DeepArc maps a relational database to an XML schema, exporting the content into an XML document.
- Xinq allows basic querying and retrieval functionality for the archived content.
- While the original layout and behavior may not be preserved, database archiving facilitates access to multiple databases through a single system.
Transactional Archiving for Web Archiving
- Transactional archiving collects actual transactions between web servers and web browsers.
- It is used to preserve evidence of viewed content for legal or regulatory compliance.
- A transactional archiving system intercepts HTTP requests and responses, filtering duplicates, and storing responses as bitstreams.
- It captures the content viewed on a particular website on a given date.
- Transactional archiving is important for organizations that need to disclose and retain information.