How To Web Scrape

Step 1

Import BeautifulSoup

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import pandas as pd

Step 2

In this specific instance we’ll set the url to the Wikipedia page of the most recent bachelorette season (Season 16 — Clare and Tayshia season).

url = 'https://en.wikipedia.org/wiki/The_Bachelorette_(season_16)'
results = requests.get(url)
results

Step 3

“Beautiful Soup is a Python library for pulling data out of HTML files. It works with your parser to provide ways of navigating and parsing.” https://beautiful-soup-4.readthedocs.io/en/latest/

soup = BeautifulSoup(results.text, "html.parser")
print(soup.prettify())

Step 4

On the bachelorette Wikipedia page, right click and hit “Inspect.”

Step 5

Click the top left image (arrow into the box)

Step 6

Look on the console for the line that highlights the bachelorette table

Step 7

soup.findAll('table', class_='wikitable')

Step 8

I have set the above equal to soup.

soup = soup.findAll('table', class_='wikitable')
soup = soup[0].find('tbody')
soup

Step 9

Continuation of parsing through html for the name Dale Moss.

x = soup.find_all('tr')[1]
x

Step 10

The name, Dale Moss, has been isolated from the table! (Clare’s winner)

x.find_all('td')[0].find('a')['title']

Step 11

The name, Zac Clark has been isolated! (Tayshia’s winner)

soup.find_all('tr')[2].find_all('td')[0].find('b').text

Step 12

Zac and Dale involved a different process, while isolating their names. But as you can see below, the rest of the contestants will all follow the same pattern.

Step 13

Now let’s isolate the rest of the names.

soup.find_all('tr')[3].find_all('td')[0].text.replace('[22]\n','')

Step 14

names = []
for n in range(3,36):
names.append(soup.find_all('tr')[n].find_all('td')[0].text.replace('\n',''))
names
for i in range(len(names)):
num = names[i].index('[')
names[i] = names[i][:num]
names

Step 15

hometown = []
for x in range(3,36):
hometown.append(soup.find_all('tr')[x].find_all('td')[2].text.replace('\n',''))
jobs = []
for n in range(3,36):
jobs.append(soup.find_all('tr')[n].find_all('td')[3].get_text().replace('\n',''))
ages = []
for age in range(3,36):
ages.append(soup.find_all('tr')[age].find_all('td')[1].text.replace('\n',''))
list(map(lambda x: int(x), ages))

Step 16

Create a DataFrame

import pandas as pd
from pandas import DataFrame
df = pd.DataFrame(ages, columns=['ages'])
df['names'] = names
df['hometowns'] = hometown
df['jobs'] = jobs
df

Step 17

Dale and Zac have to be added in separately because the web scraping for them was a little different.

new_row1 = {'ages': [31, 36], 'names':['Dale Moss', 'Zac Clark'], 'hometowns': ['Brandon, South Dakota', 'Haddonfield, New Jersey'], 'jobs': ['Former Pro Football Wide Receiver', 'Addiction Specialist']}top_row = pd.DataFrame(new_row1)df = pd.concat([top_row, df]).reset_index(drop = True)
df
All names included

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store