Hackfest 2023 - Back to the Future

Web Scraping Unleashed: Mastering Techniques for Data Harvesting
2023-10-14, 10:00–10:20, Workshops & Speed

This talk offers a concise introduction to web scraping techniques using Python, focusing on automated data extraction from websites. Web scraping enables the systematic collection of web data for various purposes, including content aggregation, research, job hunting, social media analysis, and monitoring legal and compliance issues. It is also a valuable tool for preserving government data, as evidenced during Donald Trump's presidency when various government website data, such as climate change information and LGBTQ+ resources, were altered or removed. This comprehensive overview equips attendees with a versatile toolkit for extracting valuable web data.

This talk presents a brief introduction to various techniques for scraping websites using the Python programming language. Web scraping refers to the automated process of extracting data from websites using software tools or scripts. Web scraping allows you to gather data from multiple sources on the internet and collect it in a structured format for analysis, research, and other purposes.

There are numerous legitimate uses for scraping websites, notably content aggregation, research and data analysis, job searching, social media analysis, and legal/compliance monitoring. Backing up government data can also be helpful. Before being indicted four times (so far), Donald Trump was the 45th president of the United States. During his administration, there were several instances where government data was removed from government websites or altered. Some examples of altered or removed data include climate change and environmental data, healthcare enrollment information, animal welfare records, and LGBTQ+ rights and resources.

We will provide a brief introduction to three different approaches to website scraping. The simplest involves sending requests and using a library like Beautiful Soup to parse the results. This doesn't always work, though, since some sites use client-side Javascript to interact with the server. There are a couple of ways to deal with this. Browsers can be automated with tools like Selenium, and the results can then be parsed with a Python library. However, the results may not be easy to parse, and the final approach presented will show how to intercept and emulate XHR requests. This can potentially yield more data than the page displays.

Are you releasing a tool? – No