Skip to main content
Glossary Term

Web archiving

History and Development of Web Archiving - The Internet Archive, founded in 1996, was one of the first large-scale web archiving projects. - The Internet Archive released the Wayback Machine in 2001, a search engine for viewing archived web content. - As of 2018, the Internet Archive stored 40 petabytes of data. - Other web archiving projects launched around the same time include the National Library of Canada's project, Australia's Pandora, Tasmanian web archives, and Sweden's Kulturarw3. - The International Web Archiving Workshop (IWAW) provided a platform for sharing experiences and ideas from 2001 to 2010. Methods of Collection for Web Archiving - Web archivists collect various types of web content, including HTML web pages, style sheets, JavaScript, images, and video. - Metadata about the collected resources, such as access time and content length, is archived to establish authenticity and provenance. - Web crawlers, such as Heritrix, HTTrack, and Wget, are commonly used to automate the collection process. - Services like the Wayback Machine and WebCite offer on-demand web archiving through web crawling techniques. - Database archiving involves extracting the content of database-driven websites into a standard schema, allowing multiple databases to be accessed using a single system. Remote Harvesting for Web Archiving - Web crawlers are commonly used for remote harvesting of web content. - Web crawlers access web pages similarly to how users with browsers view the web. - Examples of web crawlers used for web archiving include Heritrix, HTTrack, and Wget. - The Wayback Machine and WebCite are free services that use web crawling techniques to archive web resources. - Remote harvesting provides a simple method for collecting web content. Database Archiving for Web Archiving - Database archiving involves extracting database content into a standard schema, often using XML. - Tools like DeepArc and Xinq enable the archiving and online delivery of database content. - DeepArc maps a relational database to an XML schema, exporting the content into an XML document. - Xinq allows basic querying and retrieval functionality for the archived content. - While the original layout and behavior may not be preserved, database archiving facilitates access to multiple databases through a single system. Transactional Archiving for Web Archiving - Transactional archiving collects actual transactions between web servers and web browsers. - It is used to preserve evidence of viewed content for legal or regulatory compliance. - A transactional archiving system intercepts HTTP requests and responses, filtering duplicates, and storing responses as bitstreams. - It captures the content viewed on a particular website on a given date. - Transactional archiving is important for organizations that need to disclose and retain information.