How To Web Scrape
Using Python’s Beautiful Soup
Step 1
Import BeautifulSoup
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import pandas as pd
Step 2
In this specific instance we’ll set the url to the Wikipedia page of the most recent bachelorette season (Season 16 — Clare and Tayshia season).
url = 'https://en.wikipedia.org/wiki/The_Bachelorette_(season_16)'
results = requests.get(url)
Then run results alone to check the response.
results
The response should look like this <Response [200]>. If the response is not 200 that means there is an error.
From the url we’ll be scraping some columns from the following table.
Step 3
“Beautiful Soup is a Python library for pulling data out of HTML files. It works with your parser to provide ways of navigating and parsing.” https://beautiful-soup-4.readthedocs.io/en/latest/
soup = BeautifulSoup(results.text, "html.parser")
print(soup.prettify())
Step 4
On the bachelorette Wikipedia page, right click and hit “Inspect.”
After hitting “Inspect” you will see the following.
Step 5
Click the top left image (arrow into the box)
Step 6
Look on the console for the line that highlights the bachelorette table
Step 7
soup.findAll('table', class_='wikitable')
Step 8
I have set the above equal to soup.
soup = soup.findAll('table', class_='wikitable')
Next
soup = soup[0].find('tbody')
soup
Step 9
Continuation of parsing through html for the name Dale Moss.
x = soup.find_all('tr')[1]
x
Step 10
The name, Dale Moss, has been isolated from the table! (Clare’s winner)
x.find_all('td')[0].find('a')['title']
Step 11
The name, Zac Clark has been isolated! (Tayshia’s winner)
soup.find_all('tr')[2].find_all('td')[0].find('b').text
Step 12
Zac and Dale involved a different process, while isolating their names. But as you can see below, the rest of the contestants will all follow the same pattern.
Step 13
Now let’s isolate the rest of the names.
soup.find_all('tr')[3].find_all('td')[0].text.replace('[22]\n','')
Step 14
names = []
for n in range(3,36):
names.append(soup.find_all('tr')[n].find_all('td')[0].text.replace('\n',''))names
To get rid of the extra characters, the brackets and the numbers, another for loop was created.
for i in range(len(names)):
num = names[i].index('[')
names[i] = names[i][:num]names
Step 15
hometown = []
for x in range(3,36):
hometown.append(soup.find_all('tr')[x].find_all('td')[2].text.replace('\n','')) jobs = []
for n in range(3,36):
jobs.append(soup.find_all('tr')[n].find_all('td')[3].get_text().replace('\n',''))ages = []
for age in range(3,36):
ages.append(soup.find_all('tr')[age].find_all('td')[1].text.replace('\n',''))list(map(lambda x: int(x), ages))
Step 16
Create a DataFrame
import pandas as pd
from pandas import DataFramedf = pd.DataFrame(ages, columns=['ages'])
df['names'] = names
df['hometowns'] = hometown
df['jobs'] = jobs
df
Step 17
Dale and Zac have to be added in separately because the web scraping for them was a little different.
new_row1 = {'ages': [31, 36], 'names':['Dale Moss', 'Zac Clark'], 'hometowns': ['Brandon, South Dakota', 'Haddonfield, New Jersey'], 'jobs': ['Former Pro Football Wide Receiver', 'Addiction Specialist']}top_row = pd.DataFrame(new_row1)df = pd.concat([top_row, df]).reset_index(drop = True)
df