Web Scraping

Scrape the web to retrieve your data

Python

Web scraping

Python offers us multiple battle tested solutions to tackle this problem properly.

Author

Alexandro Disla

Published

August 19, 2023

Modified

June 29, 2024

Web scraping: Extracting Data from Websites

Web scraping refers to techniques for collecting or extracting data from websites. It allows one to programmatically obtain large amounts of data that would be difficult or time-consuming to gather manually. It’s very interesting to see how you can leverage your data analytical skills in this process. Let’s see how everything will come together.

Why Web Scraping is Useful

Web scraping has many practical applications and uses. Here are some examples:

Price monitoring - Track prices and price changes for products on e-commerce websites. This can help find deals or monitor trends.
Lead generation - Gather contact information like emails and phone numbers from directories or listings. This is useful for sales and marketing.
Research - Collect data from websites to perform analyses or conduct studies. Examples include gathering product reviews, compiling real estate listings, or analyzing social media trends.
Content aggregation - Build databases or summaries by scraping news sites, blogs, classifieds, and other sources to create curated content sites.
SEO monitoring - Check rankings and keyword positions for a site on search engines like Google. Helps optimize search marketing efforts.

How Web Scraping works

Web scrapers access webpages programmatically and extract the desired information. Here is the general process:

Find the URL of the page to scrape.
Send an HTTP request to download the page content.
Parse through the HTML content to identify relevant data. Common approaches include:
- Pattern matching - Search for strings or regex patterns that identify data.
- DOM parsing - Traverse the DOM (Document Object Model) tree to locate elements.
- XPath queries - Write expressions to navigate through HTML structure and find data.
Extract and store the data, often in a database or spreadsheet.
Repeat the process across many pages to gather larger data sets.

Web scraping can be done through scripting languages like Python, libraries like BeautifulSoup, browser automation tools like Selenium, or fully integrated scraping solutions.

Key Considerations

There are a few key factors to keep in mind when web scraping:

Avoid overloading websites with too many rapid requests, which can be seen as denial of service attacks. Add delays and throttles.
Check websites’ terms of use and robots.txt files to understand if they allow scraping. Some sites prohibit it.
Use caches, proxies, and rotation to distribute requests and avoid getting IP addresses blocked.
In some cases, explicitly identifying as a scraper through a user agent string can help avoid blocks.
Make sure to follow relevant laws and regulations regarding data collection and usage.

In summary, web scraping is a versatile technique to automate the extraction of data from websites for various purposes. When done properly, it is an extremely useful tool for data collection and analysis.

Popular Python Web Scraping Frameworks

There are many Python libraries and frameworks that make web scraping easier. Some popular options include:

BeautifulSoup - HTML/XML parsing library that helps navigate, search, and extract data from HTML. Excellent for basic scraping tasks.
Scrapy - Full framework for large scale web crawling and scraping. Can extract data very quickly and handle large volumes.
Selenium - Automates web browsers to programmatically load pages and extract data. Useful when sites have heavy JavaScript or are harder to scrape.
Requests - Simplifies making HTTP requests to access web pages. Good foundation for APIs and scraping.
lxml - Fast and feature-rich library for XML and HTML manipulation. Helps scrape complex sites.

These libraries can be combined to create powerful scrapers. For example, using Requests and BeautifulSoup together is a common approach. Scrapy and Selenium also integrate with BeautifulSoup.

Analyzing Stack Overflow Question Data

Stack Overflow is one of the largest online communities for software developers to ask and answer programming questions. The site contains a wealth of data that can be analyzed to uncover interesting insights.

In this article, we will scrape a sample of recent Stack Overflow questions using Python and BeautifulSoup. We will then load the data into a Pandas DataFrame to analyze question statistics like views, answers, votes, etc.

Scraping Stack Overflow Questions with BS4

We can use the BeautifulSoup library in Python to parse the HTML of the Stack Overflow homepage and extract the question data.

The first step of course is load the libraries in order to get the required objects.

from requests import get
from bs4 import BeautifulSoup, ResultSet
from pandas import DataFrame
from typing import List,Dict, Any
import plotly.express as px

In the second stage we use the get method from requests to get the response object. The BeautifulSoup class with the specific parser will allow us to parse the content to readable html. The key point is the select method that will pick all the tags with the specific css class. And just like that we get all the questions in our variable. Now we must handle the block further to get the data in a friendly python dictionary.

url = "https://stackoverflow.com/questions"
response = get(url)

# Check if the request was successful
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

# the html.parser allows up to parse the content
soup = BeautifulSoup(response.content, "html.parser")


# Just in case you need the html file
""" with open("stackoverflow_data.html",'wb') as file:
    file.write(
        soup.find('div',id="mainbar").find('div',class_="flush-left").prettify("utf-8")
    ) """

questions = soup.select('.s-post-summary.js-post-summary')

This handler will allow to turn the questions in to a dictionary. The loop statement will allow us to make the selection of the titles, votes, answers, users and the views.

def questions_handler(questions: ResultSet):
    data={
        'title': [],
        'user': [],
        'vote_count': [],
        'answer_count': [],
        'view_count': []
    }
    for question in questions:
        title = question.select_one('.s-link').getText()
        user = question.select_one('.s-user-card--link a').getText()
        vote_count = question.select_one('.s-post-summary--stats-item__emphasized').getText()
        answer_count = question.select_one('.s-post-summary--stats-item:nth-child(2)').getText()
        view_count = question.select_one('.s-post-summary--stats-item:nth-child(3)').getText()
        if title:
            data["title"].append(title)
        if user:
            data["user"].append(user)
        if vote_count:
            data["vote_count"].append(vote_count)
        if answer_count:
            data["answer_count"].append(answer_count)
        if view_count:
            data["view_count"].append(view_count)
    return data

Now we bring theDataFrame class that wiill allow us to turn the python dict into a basic dataframe. Now we will be able to manipulate the data at will. This will lead to the data cleansing phase. This function is capable of detecting any issue withing the scraping process. The user will see a Dataframe regardless.

def build_dataframe(questions_set: ResultSet):
    data = questions_handler(questions=questions_set)
    if not data["title"]:
        return DataFrame({
            "message": ["The Stackoverflow page has been changed"],
            "action": ["correct the scraper"]
        })
    return DataFrame({
        "titles": data["title"],
        "users": data["user"],
        "vote_counts": data["vote_count"],
        "answer_counts": data["answer_count"],
        "view_counts": data["view_count"],
    })

We build the dataframe with the handler.

questions_df = build_dataframe(questions)
questions_df.head(1)

	titles	users	vote_counts	answer_counts	view_counts
0	add attrs to a relationship model field on dja...	Reginaldo Uset Padron	\n0\nvotes\n	\n0\nanswers\n	\n2\nviews\n

EDA

remove the newline who is messing with the counts strings

questions_df['vote_counts'] = questions_df['vote_counts'].str.replace('\n', ' ')
questions_df['answer_counts'] = questions_df['answer_counts'].str.replace('\n', ' ')
questions_df['view_counts'] = questions_df['view_counts'].str.replace('\n', ' ')

questions_df.head(1)

	titles	users	vote_counts	answer_counts	view_counts
0	add attrs to a relationship model field on dja...	Reginaldo Uset Padron	0 votes	0 answers	2 views

Extract the views, answers and the votes

questions_df['vote_counts'] = questions_df['vote_counts'].str.extract('(\d+)').astype(int)
questions_df['answer_counts'] = questions_df['answer_counts'].str.extract('(\d+)').astype(int)
questions_df['view_counts'] = questions_df['view_counts'].str.extract('(\d+)').astype(int)
questions_df.rename(columns={
    "vote_counts": "votes",
    "answer_counts": "answers",
    "view_counts": "views"
},
inplace=True
)
questions_df.head(5)

	titles	users	views
0	add attrs to a relationship model field on dja...	Reginaldo Uset Padron	2
1	Kube-OVN configuration on OpenStack	Vladouse	2
2	Design suggestion for a reporting component	eshwar	2
3	Angular - unable to unsubscribe from valueChan...	apex2022	6
4	Load filter from path with deepar_flutter	Ivan David	4

We can choose to save it to excel format, produce plots or pivot_tables.

Barplots

The views by Users:

px.bar(questions_df, x="views", y="users", text_auto=True)

The answers by Users:

px.bar(questions_df, x="answers", y="users", text_auto=True)

The votes by Users:

px.bar(questions_df, x="votes", y="users", text_auto=True)

Pivot Table

questions_df.pivot_table(
    values=["views","answers","votes"],
    index = 'users',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name="Total"
)

	answers	views	votes
users
AKD2022	1	32	3
Abdulrahman Youssef	0	6	0
Alessandra	0	3	0
Amirhussein	0	2	0
Anand Kannan	0	7	1
BDav25	0	5	0
Branden Jones	1	5	0
Christine Brydges	0	3	0
CrazyCodingCranberry	0	13	2
DARSIN	0	4	0
Damian Kowalski	0	5	0
Danilo	0	9	1
David Thielen	0	3	0
Dosado	0	6	0
Electrious_46	1	9	2
G. Bittencourt	0	7	0
Hamza Bilal	0	5	0
Husamuldeen Abdlrahman	0	6	0
Inder Singh	0	3	0
Ishan Popatia	0	3	0
Ivan David	0	4	0
Josh.K	0	5	0
Kachi	0	13	1
Learn on hard way	0	5	0
Louis.vgn	0	4	1
Netanel	0	5	0
PilotPatel1	0	5	0
Piotr Sygnatowicz	0	5	0
Reginaldo Uset Padron	0	2	0
Roelof Berkepeis	0	5	0
Ronaldo	0	22	3
Sertac TULLUK	0	8	0
SorryBaby	0	6	0
The_Matrix	0	5	0
Tyler Seppala	0	3	0
Vladouse	0	2	0
Yannik Brüggemann	1	9	0
Yehia Gamal	0	8	2
Yurizinhokkjkjk	0	10	4
ZorTik	0	8	0
ansh87	0	9	0
apex2022	0	6	0
devblock	0	4	0
eshwar	0	2	0
negativeattitude	0	7	1
neksodebe	0	6	0
petesing	0	5	0
sapo_cosmico	1	4	0
user1998863	0	15	2
wforl	0	6	2
Total	5	334	25