GSA Future Focus: Web Scraping
Updates from GSA’s Emerging Technology office.
Post filed in: Policy
This is the first in a monthly series of posts from GSA’s Emerging Technology Office. We’ll explore emerging technologies, possible policy stances for the US Government, and seek input from other agencies and the public. The thoughts and opinions expressed in these blogs are not official federal guidance.
What’s “Web Scraping?”
Web scraping was invented in the 1990s and is the primary mechanism that search engines, such as Google and Bing, use to find and organize content online. Programs that perform web scraping are variously referred to as bots, crawlers and spiders.
Web scraping, often referred to as screen scraping, web-data extraction, web harvesting, or web crawling, is a process of extracting often unstructured data and textual information from web pages into a form that can easily be used for analysis. This process uses website architecture to identify the data to be collected, or “scraped,” from websites. There are even forms of web scraping that use machine learning and artificial intelligence. Federal agencies are adopting web scraping tools to reduce the need for humans to perform repetitive tasks, which results in both cost and time savings.
Responsible Web Scraping
The programs agencies use for web scraping can overwhelm websites, depending on the frequency and method of scraping. To avoid this, we’ve compiled some guidance and best practices for scraping data:
- Make sure your scraping bot is transparent about who you are and why you’re scraping data.
- If possible, provide multiple mechanisms for targeted websites to improve efficiency by directly submitting data in a structured format, and to request that scraping not be performed.
- Use a modern scraping framework to minimize the impact on the targeted website.
- Consider scraping data during off-peak hours to avoid overloading servers and adversely impacting services and performance of websites.
Our Advice to Agencies
GSA’s Emerging Technology office recommends that all civilian federal agencies adopt the following standards and best practices for web scraping public-facing, non-federal data. These help reduce risk to the government and help data owners understand ways to avoid having their data scraped.
Federal agencies may scrape public facing data from non-government sources, but with the following limitations:
- Use Robots Exclusion Protocol (robots.txt) for all web scraping activities. Robot.txt allows website owners to insert a file that specifically identifies which sections of a website may or may not be scraped.
- If a login or account is required to gain access to data, federal agencies must first review the terms of service, and ensure that there is no exclusion for web-scraping data from the site. If so, agencies must accept and comply with the terms of service.
- Follow applicable federal guidance for sensitive information if scraped data inadvertently contains information that requires enhanced protection (i.e., privacy data, personally identifiable information (PII), protected health information, etc.). Consult your agency’s privacy officer for handling requirements or impacts to the agency Privacy Impact Assessment.
- If scraped data, when aggregated with other sources, results in sensitive data, please follow applicable guidance from your agency’s privacy officer, as in the case above.
- Be aware of and adhere to copyright law when scraping data. For instance:
- The Fair Use doctrine under the Copyright Act provides limited authorization for use of copyrighted work(s) for teaching, scholarship, research and other purposes.
- The Digital Millennium Copyright Act (DMCA) provides further regulations that agencies must comply with.
- While facts are not copyrightable, specific website designs or “creative selections” may be protected from web scraping per section 201 of the Digital Millennium Copyright Act.
- If the website content contains copyright controls, such as Digital Rights Management, DMCA prohibits circumvention of these copyright access control meas]ures. Fair use of this copyrighted material is not a defense to circumvention of protections.
GSA’s Emerging Technology office was created to help make proactive federal policy. We explore how new technologies impact federal agencies and recommend ways the federal government can adapt to them.
What do you think about our recommendations? Let us know by using the hashtag #FutureFocus on social media or email us at firstname.lastname@example.org!