import requests
from bs4 import BeautifulSoup55 BS4 Soup: Extract Table to DF
55.1 Understanding the Process
When we scrape a table from a website, we’re essentially following these steps:
- Parse the HTML: Use BeautifulSoup to navigate the HTML structure
- Find the table: Locate the specific table element we want
- Extract data: Pull out the rows and cells from the table
- Structure the data: Organize it into a format pandas can understand
- Create DataFrame: Convert our structured data into a pandas DataFrame
55.2 Complete Example
Let me show you a practical example using fake HTML content that represents a typical table you might find on a website:
from bs4 import BeautifulSoup
import pandas as pd
# Create fake HTML content with a table
html_content = """
<html>
<body>
<div class="content">
<h2>Employee Information</h2>
<table id="employee-table" class="data-table">
<thead>
<tr>
<th>Name</th>
<th>Department</th>
<th>Salary</th>
<th>Years of Experience</th>
</tr>
</thead>
<tbody>
<tr>
<td>John Smith</td>
<td>Engineering</td>
<td>$75,000</td>
<td>5</td>
</tr>
<tr>
<td>Sarah Johnson</td>
<td>Marketing</td>
<td>$65,000</td>
<td>3</td>
</tr>
<tr>
<td>Mike Wilson</td>
<td>Sales</td>
<td>$70,000</td>
<td>7</td>
</tr>
<tr>
<td>Lisa Brown</td>
<td>HR</td>
<td>$60,000</td>
<td>2</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
"""def scrape_table_to_dataframe(html_content):
"""Convert HTML table to pandas DataFrame."""
# Step 1: Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
# Step 2: Find the table (using ID in this case)
table = soup.find('table', {'id': 'employee-table'})
# Step 3: Extract headers
headers = []
header_row = table.find('thead').find('tr')
for th in header_row.find_all('th'):
headers.append(th.text.strip()) # strip() removes whitespace
# Step 4: Extract data rows
rows_data = []
tbody = table.find('tbody')
for row in tbody.find_all('tr'):
row_data = []
for td in row.find_all('td'):
row_data.append(td.text.strip())
rows_data.append(row_data)
# Step 5: Create DataFrame
df = pd.DataFrame(rows_data, columns=headers)
return df
# Execute the function
df = scrape_table_to_dataframe(html_content)
df| Name | Department | Salary | Years of Experience | |
|---|---|---|---|---|
| 0 | John Smith | Engineering | $75,000 | 5 |
| 1 | Sarah Johnson | Marketing | $65,000 | 3 |
| 2 | Mike Wilson | Sales | $70,000 | 7 |
| 3 | Lisa Brown | HR | $60,000 | 2 |
55.3 Scrape from Table
soup1 = BeautifulSoup(html_content, 'html.parser')
soup1_table = soup1.find('table', {'id': 'employee-table'})
soup1_table<table class="data-table" id="employee-table">
<thead>
<tr>
<th>Name</th>
<th>Department</th>
<th>Salary</th>
<th>Years of Experience</th>
</tr>
</thead>
<tbody>
<tr>
<td>John Smith</td>
<td>Engineering</td>
<td>$75,000</td>
<td>5</td>
</tr>
<tr>
<td>Sarah Johnson</td>
<td>Marketing</td>
<td>$65,000</td>
<td>3</td>
</tr>
<tr>
<td>Mike Wilson</td>
<td>Sales</td>
<td>$70,000</td>
<td>7</td>
</tr>
<tr>
<td>Lisa Brown</td>
<td>HR</td>
<td>$60,000</td>
<td>2</td>
</tr>
</tbody>
</table>
import bs4
def bs4_table_to_df(bs4table: bs4.element.Tag):
"""Transform a single bs4 table to a dataframe"""
# Extract headers (Column Name)
headers = []
header_row = bs4table.find("thead").find("tr")
for th in header_row.find_all("th"):
headers.append(th.text.strip()) # strip() removes whitespace
# Extract data rows
rows_data = []
tbody = bs4table.find("tbody")
for row in tbody.find_all("tr"):
row_data = []
for td in row.find_all("td"):
row_data.append(td.text.strip())
rows_data.append(row_data)
# Create DataFrame
df = pd.DataFrame(rows_data, columns=headers)
return dfbs4_table_to_df(soup1_table)| Name | Department | Salary | Years of Experience | |
|---|---|---|---|---|
| 0 | John Smith | Engineering | $75,000 | 5 |
| 1 | Sarah Johnson | Marketing | $65,000 | 3 |
| 2 | Mike Wilson | Sales | $70,000 | 7 |
| 3 | Lisa Brown | HR | $60,000 | 2 |
55.4 Understanding Each Step
Let me break down what’s happening in each part of our function:
Step 1 - HTML Parsing: BeautifulSoup creates a navigable tree structure from the HTML string. Think of it like creating a map of the webpage’s structure that we can explore programmatically.
Step 2 - Table Location: We use find() to locate our specific table. I used the ID selector here, but you could also use class names, tag names, or other attributes.
Step 3 - Header Extraction: We navigate to the table header (<thead>) and extract all the column names from the <th> elements. The strip() method is crucial here because HTML often contains extra whitespace.
Step 4 - Data Extraction: We iterate through each row in the table body (<tbody>), then through each cell (<td>) in each row, building a list of lists structure.
Step 5 - DataFrame Creation: Finally, we pass our structured data to pandas, specifying our headers as column names.
55.5 Alternative Table Finding Methods
Sometimes tables don’t have convenient IDs or classes. Here are other ways to find tables:
soup = BeautifulSoup(html_content, 'html.parser')# Find by class name
table = soup.find('table', {'class': 'data-table'})
# Find by tag name (gets the first table)
table = soup.find('table')
# Find all tables and select by index
tables = soup.find_all('table')
table = tables[0] # First table
# Find by containing text
table = soup.find('table', string=lambda text: 'Employee' in text if text else False)55.6 Handling Edge Cases
Real-world tables can be messy. Here’s a more robust version that handles common issues:
def robust_table_scraper(html_content, table_selector=None):
"""More robust table scraping with error handling."""
soup = BeautifulSoup(html_content, 'html.parser')
# Find table with flexible selector
if table_selector:
table = soup.select_one(table_selector)
else:
table = soup.find('table')
if not table:
raise ValueError("No table found in the HTML content")
# Try to find headers - they might be in thead or first row
headers = []
thead = table.find('thead')
if thead:
header_row = thead.find('tr')
headers = [th.text.strip() for th in header_row.find_all(['th', 'td'])]
else:
# Headers might be in the first row of tbody
first_row = table.find('tr')
if first_row:
headers = [cell.text.strip() for cell in first_row.find_all(['th', 'td'])]
# Extract data rows
rows_data = []
tbody = table.find('tbody')
rows = tbody.find_all('tr') if tbody else table.find_all('tr')[1:] # Skip header row
for row in rows:
cells = row.find_all(['td', 'th'])
row_data = [cell.text.strip() for cell in cells]
if row_data: # Only add non-empty rows
rows_data.append(row_data)
# Create DataFrame
if not headers:
headers = [f'Column_{i+1}' for i in range(len(rows_data[0]))]
df = pd.DataFrame(rows_data, columns=headers)
return dfThis enhanced version handles tables without proper <thead> sections and provides fallback column names when headers aren’t found.
Would you like me to explain any specific part in more detail, or shall we explore how to scrape tables from actual websites using the requests library?