Web Scraping

Scrape the web to retrieve your data

Python
Web scraping
Python offers us multiple battle tested solutions to tackle this problem properly.
Author

Alexandro Disla

Published

August 19, 2023

Modified

June 29, 2024

Web scraping: Extracting Data from Websites

Web scraping refers to techniques for collecting or extracting data from websites. It allows one to programmatically obtain large amounts of data that would be difficult or time-consuming to gather manually. It’s very interesting to see how you can leverage your data analytical skills in this process. Let’s see how everything will come together.

Why Web Scraping is Useful

Web scraping has many practical applications and uses. Here are some examples:

  • Price monitoring - Track prices and price changes for products on e-commerce websites. This can help find deals or monitor trends.
  • Lead generation - Gather contact information like emails and phone numbers from directories or listings. This is useful for sales and marketing.
  • Research - Collect data from websites to perform analyses or conduct studies. Examples include gathering product reviews, compiling real estate listings, or analyzing social media trends.
  • Content aggregation - Build databases or summaries by scraping news sites, blogs, classifieds, and other sources to create curated content sites.
  • SEO monitoring - Check rankings and keyword positions for a site on search engines like Google. Helps optimize search marketing efforts.

How Web Scraping works

Web scrapers access webpages programmatically and extract the desired information. Here is the general process:

  1. Find the URL of the page to scrape.

  2. Send an HTTP request to download the page content.

  3. Parse through the HTML content to identify relevant data. Common approaches include:

    • Pattern matching - Search for strings or regex patterns that identify data.

    • DOM parsing - Traverse the DOM (Document Object Model) tree to locate elements.

    • XPath queries - Write expressions to navigate through HTML structure and find data.

  4. Extract and store the data, often in a database or spreadsheet.

  5. Repeat the process across many pages to gather larger data sets.

Web scraping can be done through scripting languages like Python, libraries like BeautifulSoup, browser automation tools like Selenium, or fully integrated scraping solutions.

Key Considerations

There are a few key factors to keep in mind when web scraping:

  • Avoid overloading websites with too many rapid requests, which can be seen as denial of service attacks. Add delays and throttles.

  • Check websites’ terms of use and robots.txt files to understand if they allow scraping. Some sites prohibit it.

  • Use caches, proxies, and rotation to distribute requests and avoid getting IP addresses blocked.

  • In some cases, explicitly identifying as a scraper through a user agent string can help avoid blocks.

  • Make sure to follow relevant laws and regulations regarding data collection and usage.

In summary, web scraping is a versatile technique to automate the extraction of data from websites for various purposes. When done properly, it is an extremely useful tool for data collection and analysis.

Analyzing Stack Overflow Question Data

Stack Overflow is one of the largest online communities for software developers to ask and answer programming questions. The site contains a wealth of data that can be analyzed to uncover interesting insights.

In this article, we will scrape a sample of recent Stack Overflow questions using Python and BeautifulSoup. We will then load the data into a Pandas DataFrame to analyze question statistics like views, answers, votes, etc.

Scraping Stack Overflow Questions with BS4

We can use the BeautifulSoup library in Python to parse the HTML of the Stack Overflow homepage and extract the question data.

  1. The first step of course is load the libraries in order to get the required objects.
from requests import get
from bs4 import BeautifulSoup, ResultSet
from pandas import DataFrame
from typing import List,Dict, Any
import plotly.express as px
  1. In the second stage we use the get method from requests to get the response object. The BeautifulSoup class with the specific parser will allow us to parse the content to readable html. The key point is the select method that will pick all the tags with the specific css class. And just like that we get all the questions in our variable. Now we must handle the block further to get the data in a friendly python dictionary.
url = "https://stackoverflow.com/questions"
response = get(url)

# Check if the request was successful
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

# the html.parser allows up to parse the content
soup = BeautifulSoup(response.content, "html.parser")


# Just in case you need the html file
""" with open("stackoverflow_data.html",'wb') as file:
    file.write(
        soup.find('div',id="mainbar").find('div',class_="flush-left").prettify("utf-8")
    ) """

questions = soup.select('.s-post-summary.js-post-summary')
  1. This handler will allow to turn the questions in to a dictionary. The loop statement will allow us to make the selection of the titles, votes, answers, users and the views.
def questions_handler(questions: ResultSet):
    data={
        'title': [],
        'user': [],
        'vote_count': [],
        'answer_count': [],
        'view_count': []
    }
    for question in questions:
        title = question.select_one('.s-link').getText()
        user = question.select_one('.s-user-card--link a').getText()
        vote_count = question.select_one('.s-post-summary--stats-item__emphasized').getText()
        answer_count = question.select_one('.s-post-summary--stats-item:nth-child(2)').getText()
        view_count = question.select_one('.s-post-summary--stats-item:nth-child(3)').getText()
        if title:
            data["title"].append(title)
        if user:
            data["user"].append(user)
        if vote_count:
            data["vote_count"].append(vote_count)
        if answer_count:
            data["answer_count"].append(answer_count)
        if view_count:
            data["view_count"].append(view_count)
    return data
  1. Now we bring theDataFrame class that wiill allow us to turn the python dict into a basic dataframe. Now we will be able to manipulate the data at will. This will lead to the data cleansing phase. This function is capable of detecting any issue withing the scraping process. The user will see a Dataframe regardless.
def build_dataframe(questions_set: ResultSet):
    data = questions_handler(questions=questions_set)
    if not data["title"]:
        return DataFrame({
            "message": ["The Stackoverflow page has been changed"],
            "action": ["correct the scraper"]
        })
    return DataFrame({
        "titles": data["title"],
        "users": data["user"],
        "vote_counts": data["vote_count"],
        "answer_counts": data["answer_count"],
        "view_counts": data["view_count"],
    })
  1. We build the dataframe with the handler.
questions_df = build_dataframe(questions)
questions_df.head(1)
titles users vote_counts answer_counts view_counts
0 add attrs to a relationship model field on dja... Reginaldo Uset Padron \n0\nvotes\n \n0\nanswers\n \n2\nviews\n

EDA

  1. remove the newline who is messing with the counts strings
questions_df['vote_counts'] = questions_df['vote_counts'].str.replace('\n', ' ')
questions_df['answer_counts'] = questions_df['answer_counts'].str.replace('\n', ' ')
questions_df['view_counts'] = questions_df['view_counts'].str.replace('\n', ' ')

questions_df.head(1)
titles users vote_counts answer_counts view_counts
0 add attrs to a relationship model field on dja... Reginaldo Uset Padron 0 votes 0 answers 2 views
  1. Extract the views, answers and the votes
questions_df['vote_counts'] = questions_df['vote_counts'].str.extract('(\d+)').astype(int)
questions_df['answer_counts'] = questions_df['answer_counts'].str.extract('(\d+)').astype(int)
questions_df['view_counts'] = questions_df['view_counts'].str.extract('(\d+)').astype(int)
questions_df.rename(columns={
    "vote_counts": "votes",
    "answer_counts": "answers",
    "view_counts": "views"
},
inplace=True
)
questions_df.head(5)
titles users votes answers views
0 add attrs to a relationship model field on dja... Reginaldo Uset Padron 0 0 2
1 Kube-OVN configuration on OpenStack Vladouse 0 0 2
2 Design suggestion for a reporting component eshwar 0 0 2
3 Angular - unable to unsubscribe from valueChan... apex2022 0 0 6
4 Load filter from path with deepar_flutter Ivan David 0 0 4
  1. We can choose to save it to excel format, produce plots or pivot_tables.

Barplots

The views by Users:

px.bar(questions_df, x="views", y="users", text_auto=True)

The answers by Users:

px.bar(questions_df, x="answers", y="users", text_auto=True)

The votes by Users:

px.bar(questions_df, x="votes", y="users", text_auto=True)

Pivot Table

questions_df.pivot_table(
    values=["views","answers","votes"],
    index = 'users',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name="Total"
)
answers views votes
users
AKD2022 1 32 3
Abdulrahman Youssef 0 6 0
Alessandra 0 3 0
Amirhussein 0 2 0
Anand Kannan 0 7 1
BDav25 0 5 0
Branden Jones 1 5 0
Christine Brydges 0 3 0
CrazyCodingCranberry 0 13 2
DARSIN 0 4 0
Damian Kowalski 0 5 0
Danilo 0 9 1
David Thielen 0 3 0
Dosado 0 6 0
Electrious_46 1 9 2
G. Bittencourt 0 7 0
Hamza Bilal 0 5 0
Husamuldeen Abdlrahman 0 6 0
Inder Singh 0 3 0
Ishan Popatia 0 3 0
Ivan David 0 4 0
Josh.K 0 5 0
Kachi 0 13 1
Learn on hard way 0 5 0
Louis.vgn 0 4 1
Netanel 0 5 0
PilotPatel1 0 5 0
Piotr Sygnatowicz 0 5 0
Reginaldo Uset Padron 0 2 0
Roelof Berkepeis 0 5 0
Ronaldo 0 22 3
Sertac TULLUK 0 8 0
SorryBaby 0 6 0
The_Matrix 0 5 0
Tyler Seppala 0 3 0
Vladouse 0 2 0
Yannik Brüggemann 1 9 0
Yehia Gamal 0 8 2
Yurizinhokkjkjk 0 10 4
ZorTik 0 8 0
ansh87 0 9 0
apex2022 0 6 0
devblock 0 4 0
eshwar 0 2 0
negativeattitude 0 7 1
neksodebe 0 6 0
petesing 0 5 0
sapo_cosmico 1 4 0
user1998863 0 15 2
wforl 0 6 2
Total 5 334 25
Back to top