History and Development of Web Archiving
– The Internet Archive, founded in 1996, was one of the first large-scale web archiving projects.
– The Internet Archive released the Wayback Machine in 2001, a search engine for viewing archived web content.
– As of 2018, the Internet Archive stored 40 petabytes of data.
– Other web archiving projects launched around the same time include the National Library of Canada’s project, Australia’s Pandora, Tasmanian web archives, and Sweden’s Kulturarw3.
– The International Web Archiving Workshop (IWAW) provided a platform for sharing experiences and ideas from 2001 to 2010.
Methods of Collection for Web Archiving
– Web archivists collect various types of web content, including HTML web pages, style sheets, JavaScript, images, and video.
– Metadata about the collected resources, such as access time and content length, is archived to establish authenticity and provenance.
– Web crawlers, such as Heritrix, HTTrack, and Wget, are commonly used to automate the collection process.
– Services like the Wayback Machine and WebCite offer on-demand web archiving through web crawling techniques.
– Database archiving involves extracting the content of database-driven websites into a standard schema, allowing multiple databases to be accessed using a single system.
Remote Harvesting for Web Archiving
– Web crawlers are commonly used for remote harvesting of web content.
– Web crawlers access web pages similarly to how users with browsers view the web.
– Examples of web crawlers used for web archiving include Heritrix, HTTrack, and Wget.
– The Wayback Machine and WebCite are free services that use web crawling techniques to archive web resources.
– Remote harvesting provides a simple method for collecting web content.
Database Archiving for Web Archiving
– Database archiving involves extracting database content into a standard schema, often using XML.
– Tools like DeepArc and Xinq enable the archiving and online delivery of database content.
– DeepArc maps a relational database to an XML schema, exporting the content into an XML document.
– Xinq allows basic querying and retrieval functionality for the archived content.
– While the original layout and behavior may not be preserved, database archiving facilitates access to multiple databases through a single system.
Transactional Archiving for Web Archiving
– Transactional archiving collects actual transactions between web servers and web browsers.
– It is used to preserve evidence of viewed content for legal or regulatory compliance.
– A transactional archiving system intercepts HTTP requests and responses, filtering duplicates, and storing responses as bitstreams.
– It captures the content viewed on a particular website on a given date.
– Transactional archiving is important for organizations that need to disclose and retain information.
Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the Wayback Machine, which strives to maintain an archive of the entire Web.
The growing portion of human culture created and recorded on the web makes it inevitable that more and more libraries and archives will have to face the challenges of web archiving. National libraries, national archives and various consortia of organizations are also involved in archiving culturally important Web content.
Commercial web archiving software and services are also available to organizations who need to archive their own web content for corporate heritage, regulatory, or legal purposes.