Glossary Term
Web scraping
History and Techniques of Web Scraping
- Web scraping dates back to the birth of the World Wide Web in 1989.
- The first web robot, World Wide Web Wanderer, was created in June 1993.
- In December 1993, the first crawler-based web search engine, JumpStation, was launched.
- In 2000, the first Web API and API crawler were created.
- Salesforce and eBay launched their own API in 2000, allowing access to public data.
- Web scraping is the process of automatically mining data from the World Wide Web.
- It shares a common goal with the semantic web vision.
- Breakthroughs in text processing, semantic understanding, AI, and human-computer interactions are needed.
- Techniques include human copy-and-paste, text pattern matching, HTTP programming, HTML parsing, and DOM parsing.
- Some websites set up barriers to prevent machine automation, requiring manual examination and copy-and-paste.
- Manual copying and pasting data from a web page is the simplest form of web scraping.
- Sometimes, even the best web-scraping technology cannot replace human intervention.
- Barriers set up by websites may require manual examination and copy-and-paste.
- Machine automation may be prevented in these cases.
- Human copy-and-paste may be the only workable solution.
- UNIX grep command or regular expression-matching facilities can extract information from web pages.
- Programming languages like Perl or Python can be used for this approach.
- Patterns are matched to extract desired data.
- Text pattern matching is a simple yet powerful approach.
- It can be used to retrieve specific data from web pages.
- Static and dynamic web pages can be retrieved by posting HTTP requests.
- Socket programming is used to send HTTP requests to remote web servers.
- Web pages are retrieved using HTTP programming.
- Both static and dynamic web pages can be accessed.
- HTTP programming allows for the retrieval of web page content.
Software for Web Scraping
- Many software tools available for web scraping.
- Software can automatically recognize the data structure of a page.
- Some software provides a recording interface to avoid manual coding.
- Scripting functions can be used to extract and transform content.
- Database interfaces can store scraped data in local databases.
Legal Issues of Web Scraping (United States)
- Web scraping legality varies worldwide.
- Terms of service may prohibit web scraping on some websites.
- Enforceability of terms is unclear.
- Legal claims in the United States to prevent web scraping: copyright infringement, violation of the Computer Fraud and Abuse Act (CFAA), and trespass to chattel.
- Case law on web scraping is still evolving.
- Website owners can use legal claims to prevent web scraping.
- Copyright law allows duplication of facts.
- Users of scrapers may be held liable for trespass to chattels.
- American Airlines successfully obtained an injunction against FareChase for web scraping.
- Southwest Airlines also challenged screen-scraping practices.
Legal Issues of Web Scraping (European Union)
- Danish court ruled against systematic crawling and deep linking by a portal site.
- Case involved crawling and indexing of a real estate site.
- Ruling highlights legal issues surrounding web scraping in the EU.
- Laws regarding web scraping vary across EU member states.
- European Court of Justice has made some rulings on web scraping.
Case Studies and Legality of Web Scraping
- Craigslist successfully sued 3Taps for violating the Computer Fraud and Abuse Act.
- American Airlines obtained an injunction against FareChase for web scraping.
- Southwest Airlines filed a legal claim against screen-scraping practices.
- QVC objected to Resultly's excessive crawling of their site.
- Facebook sued Power Ventures for scraping Facebook pages.
- Web scraping does not conflict with Danish law or the database directive of the European Union.
- In the case of Ryanair Ltd v Billigfluege.de GmbH, Ireland's High Court ruled that Ryanair's click-wrap agreement was legally binding.
- The French Data Protection Authority (CNIL) released guidelines stating that publicly available data is still personal data and cannot be repurposed without consent.
- The Spam Act 2003 in Australia outlaws some forms of web harvesting, specifically related to email addresses.
- Indian courts have not expressly ruled on the legality of web scraping, but violating terms of use prohibiting data scraping would be a violation of contract law and the Information Technology Act, 2000.