How To Web Scrape

Raizel Bernstein
4 min readDec 29, 2020

Using Python’s Beautiful Soup

Step 1

Import BeautifulSoup

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import pandas as pd

Step 2

In this specific instance we’ll set the url to the Wikipedia page of the most recent bachelorette season (Season 16 — Clare and Tayshia season).

url = 'https://en.wikipedia.org/wiki/The_Bachelorette_(season_16)'
results = requests.get(url)

Then run results alone to check the response.

results

The response should look like this <Response [200]>. If the response is not 200 that means there is an error.

From the url we’ll be scraping some columns from the following table.

Step 3

“Beautiful Soup is a Python library for pulling data out of HTML files. It works with your parser to provide ways of navigating and parsing.” https://beautiful-soup-4.readthedocs.io/en/latest/

soup = BeautifulSoup(results.text, "html.parser")
print(soup.prettify())

Step 4

On the bachelorette Wikipedia page, right click and hit “Inspect.”

After hitting “Inspect” you will see the following.

Step 5

Click the top left image (arrow into the box)

Step 6

Look on the console for the line that highlights the bachelorette table

Step 7

soup.findAll('table', class_='wikitable')

Step 8

I have set the above equal to soup.

soup = soup.findAll('table', class_='wikitable')

Next

soup = soup[0].find('tbody')
soup

Step 9

Continuation of parsing through html for the name Dale Moss.

x = soup.find_all('tr')[1]
x

Step 10

The name, Dale Moss, has been isolated from the table! (Clare’s winner)

x.find_all('td')[0].find('a')['title']

Step 11

The name, Zac Clark has been isolated! (Tayshia’s winner)

soup.find_all('tr')[2].find_all('td')[0].find('b').text

Step 12

Zac and Dale involved a different process, while isolating their names. But as you can see below, the rest of the contestants will all follow the same pattern.

Step 13

Now let’s isolate the rest of the names.

soup.find_all('tr')[3].find_all('td')[0].text.replace('[22]\n','')

Step 14

names = []
for n in range(3,36):
names.append(soup.find_all('tr')[n].find_all('td')[0].text.replace('\n',''))
names

To get rid of the extra characters, the brackets and the numbers, another for loop was created.

for i in range(len(names)):
num = names[i].index('[')
names[i] = names[i][:num]
names

Step 15

hometown = []
for x in range(3,36):
hometown.append(soup.find_all('tr')[x].find_all('td')[2].text.replace('\n',''))
jobs = []
for n in range(3,36):
jobs.append(soup.find_all('tr')[n].find_all('td')[3].get_text().replace('\n',''))
ages = []
for age in range(3,36):
ages.append(soup.find_all('tr')[age].find_all('td')[1].text.replace('\n',''))
list(map(lambda x: int(x), ages))

Step 16

Create a DataFrame

import pandas as pd
from pandas import DataFrame
df = pd.DataFrame(ages, columns=['ages'])
df['names'] = names
df['hometowns'] = hometown
df['jobs'] = jobs
df

Step 17

Dale and Zac have to be added in separately because the web scraping for them was a little different.

new_row1 = {'ages': [31, 36], 'names':['Dale Moss', 'Zac Clark'], 'hometowns': ['Brandon, South Dakota', 'Haddonfield, New Jersey'], 'jobs': ['Former Pro Football Wide Receiver', 'Addiction Specialist']}top_row = pd.DataFrame(new_row1)df = pd.concat([top_row, df]).reset_index(drop = True)
df
All names included

--

--