Using Python’s Beautiful Soup

Step 1

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import pandas as pd

Step 2

url = 'https://en.wikipedia.org/wiki/The_Bachelorette_(season_16)'
results = requests.get(url)

Then run results alone to check the response.

results

The response should look like this <Response [200]>. If the response is not 200 that means there is an error.

From the url we’ll be scraping some columns from the following table.

Step 3

soup = BeautifulSoup(results.text, "html.parser")
print(soup.prettify())

Step 4

After hitting “Inspect” you will see the following.

Step 5

Step 6

Step 7

soup.findAll('table', class_='wikitable')

Step 8

soup = soup.findAll('table', class_='wikitable')

Next

soup = soup[0].find('tbody')
soup

Step 9

x = soup.find_all('tr')[1]
x

Step 10

x.find_all('td')[0].find('a')['title']

Step 11

soup.find_all('tr')[2].find_all('td')[0].find('b').text

Step 12

Step 13

soup.find_all('tr')[3].find_all('td')[0].text.replace('[22]\n','')

Step 14

names = []
for n in range(3,36):
names.append(soup.find_all('tr')[n].find_all('td')[0].text.replace('\n',''))
names

To get rid of the extra characters, the brackets and the numbers, another for loop was created.

for i in range(len(names)):
num = names[i].index('[')
names[i] = names[i][:num]
names

Step 15

hometown = []
for x in range(3,36):
hometown.append(soup.find_all('tr')[x].find_all('td')[2].text.replace('\n',''))
jobs = []
for n in range(3,36):
jobs.append(soup.find_all('tr')[n].find_all('td')[3].get_text().replace('\n',''))
ages = []
for age in range(3,36):
ages.append(soup.find_all('tr')[age].find_all('td')[1].text.replace('\n',''))
list(map(lambda x: int(x), ages))

Step 16

import pandas as pd
from pandas import DataFrame
df = pd.DataFrame(ages, columns=['ages'])
df['names'] = names
df['hometowns'] = hometown
df['jobs'] = jobs
df

Step 17

new_row1 = {'ages': [31, 36], 'names':['Dale Moss', 'Zac Clark'], 'hometowns': ['Brandon, South Dakota', 'Haddonfield, New Jersey'], 'jobs': ['Former Pro Football Wide Receiver', 'Addiction Specialist']}top_row = pd.DataFrame(new_row1)df = pd.concat([top_row, df]).reset_index(drop = True)
df
All names included

Aspiring Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store