Skip to main content
Glossary Term

Web scraping

History and Techniques of Web Scraping - Web scraping dates back to the birth of the World Wide Web in 1989. - The first web robot, World Wide Web Wanderer, was created in June 1993. - In December 1993, the first crawler-based web search engine, JumpStation, was launched. - In 2000, the first Web API and API crawler were created. - Salesforce and eBay launched their own API in 2000, allowing access to public data. - Web scraping is the process of automatically mining data from the World Wide Web. - It shares a common goal with the semantic web vision. - Breakthroughs in text processing, semantic understanding, AI, and human-computer interactions are needed. - Techniques include human copy-and-paste, text pattern matching, HTTP programming, HTML parsing, and DOM parsing. - Some websites set up barriers to prevent machine automation, requiring manual examination and copy-and-paste. - Manual copying and pasting data from a web page is the simplest form of web scraping. - Sometimes, even the best web-scraping technology cannot replace human intervention. - Barriers set up by websites may require manual examination and copy-and-paste. - Machine automation may be prevented in these cases. - Human copy-and-paste may be the only workable solution. - UNIX grep command or regular expression-matching facilities can extract information from web pages. - Programming languages like Perl or Python can be used for this approach. - Patterns are matched to extract desired data. - Text pattern matching is a simple yet powerful approach. - It can be used to retrieve specific data from web pages. - Static and dynamic web pages can be retrieved by posting HTTP requests. - Socket programming is used to send HTTP requests to remote web servers. - Web pages are retrieved using HTTP programming. - Both static and dynamic web pages can be accessed. - HTTP programming allows for the retrieval of web page content. Software for Web Scraping - Many software tools available for web scraping. - Software can automatically recognize the data structure of a page. - Some software provides a recording interface to avoid manual coding. - Scripting functions can be used to extract and transform content. - Database interfaces can store scraped data in local databases. Legal Issues of Web Scraping (United States) - Web scraping legality varies worldwide. - Terms of service may prohibit web scraping on some websites. - Enforceability of terms is unclear. - Legal claims in the United States to prevent web scraping: copyright infringement, violation of the Computer Fraud and Abuse Act (CFAA), and trespass to chattel. - Case law on web scraping is still evolving. - Website owners can use legal claims to prevent web scraping. - Copyright law allows duplication of facts. - Users of scrapers may be held liable for trespass to chattels. - American Airlines successfully obtained an injunction against FareChase for web scraping. - Southwest Airlines also challenged screen-scraping practices. Legal Issues of Web Scraping (European Union) - Danish court ruled against systematic crawling and deep linking by a portal site. - Case involved crawling and indexing of a real estate site. - Ruling highlights legal issues surrounding web scraping in the EU. - Laws regarding web scraping vary across EU member states. - European Court of Justice has made some rulings on web scraping. Case Studies and Legality of Web Scraping - Craigslist successfully sued 3Taps for violating the Computer Fraud and Abuse Act. - American Airlines obtained an injunction against FareChase for web scraping. - Southwest Airlines filed a legal claim against screen-scraping practices. - QVC objected to Resultly's excessive crawling of their site. - Facebook sued Power Ventures for scraping Facebook pages. - Web scraping does not conflict with Danish law or the database directive of the European Union. - In the case of Ryanair Ltd v Billigfluege.de GmbH, Ireland's High Court ruled that Ryanair's click-wrap agreement was legally binding. - The French Data Protection Authority (CNIL) released guidelines stating that publicly available data is still personal data and cannot be repurposed without consent. - The Spam Act 2003 in Australia outlaws some forms of web harvesting, specifically related to email addresses. - Indian courts have not expressly ruled on the legality of web scraping, but violating terms of use prohibiting data scraping would be a violation of contract law and the Information Technology Act, 2000.