Skip to content
Snippets Groups Projects
Commit 2c8bb01b authored by Amanda Belton's avatar Amanda Belton :speech_balloon:
Browse files

Update scrape_data_from_member_organisation.ipynb

update
parent 416b4474
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:fdd0ffa8 tags:
# Digital Studio workhsop: Web scraping data neighbourhood house addresses
## What are we doing
There is a wealth of data available on the Internet that is not immediately downloadable through conventional methods. Web scraping provides us with a way to programmatically extract important research data, where appropriate, from almost any website.
In this example, we will be looking at how to extract search results from an organisation for Neighbourhood Houses across Victoria.
https://www.nhvic.org.au/find-a-neighbourhood-house
%% Cell type:markdown id:e64d6416 tags:
## What python packages do we need and why?
For this exercise we will be using three python packages:
1. *requests* will allow us to fetch HTML from the nhvic.org.au website.
2. *BeautifulSoup* will allow us to extract relevant metadata from this HTML.
3. *pandas* will let us create a table of our results and export these to a CSV.
Note how the individual python packages all have a specific responsibility within our script. Once you learn the basics of a programming language, the next step is to become familiar with existing packages that will help you get get your work done faster.
We are also going to be doing some text wrangling to make the data easier to use, we willl be using:
1. *re* will allow us to read (or parse) the text and get rid of characters that are not needed to understand the data.
%% Cell type:code id:0de629db tags:
``` python
# Import the libraries needed to read data from the web
from bs4 import BeautifulSoup
import requests
# import library for data handling
import pandas as pd
# import library to pause so we don't overload their webpage
import time
```
%% Cell type:markdown id:87b13fc2 tags:
## Where is the data?
We will be getting data from this set of search results:
https://www.nhvic.org.au/find-a-neighbourhood-house
It is worth spending a moment to look at the different components of the URL.
1. 'https://www.nhvic.org.au' is the origin or root url
2. 'https://www.nhvic.org.au/find-a-neighbourhood-house' is the location of the search page we are going to webscrape
## What does it look like?
Next we will look at the html of this search page. In Google Chrome you can use the menu *View* / *Developer* / *View Source* to see the html, that is, the website text that is 'marked up' to make it easier to read and appealing to look at.
%% Cell type:markdown id:738927ee tags:
## How do we get the raw HTML?
%% Cell type:code id:557d7065 tags:
``` python
# grab the search page that contains urls for each neighbourhood house
pageUrl = "https://www.nhvic.org.au/list-of-neighbourhood-houses-via-local-government-area"
# remember to set the root url for later
rooturl = "https://www.nhvic.org.au"
```
%% Cell type:markdown id:037edb55 tags:
![Screenshot%202024-05-20%20at%208.47.04%E2%80%AFAM.png](attachment:Screenshot%202024-05-20%20at%208.47.04%E2%80%AFAM.png)
%% Cell type:markdown id:ec1e5df4 tags:
# Retrieve
%% Cell type:code id:54e9609f tags:
``` python
# grab this page
response = requests.get(pageUrl)
soup = BeautifulSoup(response.text, "html.parser")
```
%% Cell type:markdown id:27cdee74 tags:
# Parse
%% Cell type:markdown id:0ec3c92c tags:
### Loop through the subpages from this search page?
We can grab just the links from this page by finding all the html that has type 'a', that is, all the html links.
%% Cell type:code id:0b04265a tags:
``` python
# get the table of links on this page
tableoflinks = soup.find("table",{'summary': 'Table summary'})
# get the links inside the table
allLinks = tableoflinks.find_all('a',{'target': '_self'})
# get the neighhbourhood houses links only
links = [l.get("href") for l in allLinks]
# show the first couple of links
links[0:2]
```
%% Output
['/mount-beauty-neighbourhood-centre', '/myrtleford-neighbourhood-centre']
%% Cell type:code id:01306202 tags:
``` python
len(links)
```
%% Output
407
%% Cell type:code id:d89d7022 tags:
``` python
# add the root url into the links we've scraped
fullLinks = [rooturl+l for l in links]
# show just a few
fullLinks[10:15]
```
%% Output
['https://www.nhvic.org.au/livingstone-community-centre',
'https://www.nhvic.org.au/rosanna-fire-station-community-house',
'https://www.nhvic.org.au/watsonia-neighbourhood-house',
'https://www.nhvic.org.au/bass-valley-community-centre',
'https://www.nhvic.org.au/corinella-and-district-community-centre']
%% Cell type:code id:a5ef0ad2 tags:
%% Cell type:code id:d632a5a7 tags:
``` python
# save this initial data we've webscraped
dfLinks = pd.DataFrame(fullLinks, columns=["links"])
dfLinks.to_csv("FullLinks.csv",index=False)
```
%% Cell type:markdown id:4d64f1c4 tags:
%% Cell type:markdown id:31822e4b tags:
### Jumping off point
If you would like to continue and webscrape the addresses as well lets move onto retrieving the addresses for these neighbourhood houses.
%% Cell type:markdown id:418cd4cc tags:
# Extract
%% Cell type:code id:242cc868 tags:
``` python
# make a little function to grab the name and address from a link
def getPageData(thisLink):
# grab this page
response = requests.get(thisLink)
soup = BeautifulSoup(response.text,"html.parser")
# get the meta content (name,address) on this page
address = soup.find('meta',{'property':'og:description'})['content']
name = soup.find('meta',{'property':'og:title'})['content']
pagedata = {'name': name,
'address': address,
'url': thisLink}
time.sleep(.5)
return pagedata
```
%% Cell type:code id:32bb0e6d tags:
``` python
# grab just one page to check this works
getPageData(fullLinks[10])
```
%% Output
{'name': 'Livingstone Community Centre',
'address': '1 Livingstone Street, IVANHOE',
'url': 'https://www.nhvic.org.au/livingstone-community-centre'}
%% Cell type:code id:6691d9b8 tags:
``` python
# now loop through all the links and get each page's data
pages=[]
for l in fullLinks:
print(l)
# grab this page
pages.append(getPageData(l))
```
%% Output
https://www.nhvic.org.au/foundation-learning-centre
https://www.nhvic.org.au/oakgrove-community-centre
https://www.nhvic.org.au/dunolly-and-district-neighbourhood-centre
https://www.nhvic.org.au/maryborough-community-house
https://www.nhvic.org.au/mill-house-neighbourhood-house
https://www.nhvic.org.au/marrar-woorn-neighbourhood-house
https://www.nhvic.org.au/colac-neighbourhood-house
https://www.nhvic.org.au/forrest-and-district-neighbourhood-house
https://www.nhvic.org.au/gellibrand-community-house
https://www.nhvic.org.au/camperdown-community-house
https://www.nhvic.org.au/simpson-and-district-community-centre
https://www.nhvic.org.au/alphington-community-centre
https://www.nhvic.org.au/jika-jika-community-centre
https://www.nhvic.org.au/bridge-darebin-preston
https://www.nhvic.org.au/northern-community-careworks
https://www.nhvic.org.au/preston-reservoir-adult-community-education
https://www.nhvic.org.au/reservoir-neighbourhood-house
https://www.nhvic.org.au/bridge-darebin-thornbury
https://www.nhvic.org.au/span-community-house
https://www.nhvic.org.au/bairnsdale-neighbourhood-house
[{'name': 'Foundation Learning Centre',
'address': '1 Malcolm Court, NARRE WARREN',
'url': 'https://www.nhvic.org.au/foundation-learning-centre'},
{'name': 'Oakgrove Community Centre',
'address': '89-101 Oakgrove Drive, NARRE WARREN SOUTH',
'url': 'https://www.nhvic.org.au/oakgrove-community-centre'},
{'name': 'Dunolly and District Neighbourhood Centre',
'address': 'Havelock Street, DUNOLLY',
'url': 'https://www.nhvic.org.au/dunolly-and-district-neighbourhood-centre'},
{'name': 'Maryborough Community House',
'address': '23 Primrose Street, MARYBOROUGH',
'url': 'https://www.nhvic.org.au/maryborough-community-house'},
{'name': 'Mill House Neighbourhood House',
'address': '88-90 Burke Street, MARYBOROUGH',
'url': 'https://www.nhvic.org.au/mill-house-neighbourhood-house'},
{'name': 'Marrar Woorn Neighbourhood House',
'address': '6 Pengilley Avenue, APOLLO BAY',
'url': 'https://www.nhvic.org.au/marrar-woorn-neighbourhood-house'},
{'name': 'Colac Neighbourhood House',
'address': '23 Miller Street, COLAC',
'url': 'https://www.nhvic.org.au/colac-neighbourhood-house'},
{'name': 'Forrest and District Neighbourhood House',
'address': '14 Grant Street, FORREST',
'url': 'https://www.nhvic.org.au/forrest-and-district-neighbourhood-house'},
{'name': 'Gellibrand Community House',
'address': '5 Main Road, GELLIBRAND',
'url': 'https://www.nhvic.org.au/gellibrand-community-house'},
{'name': 'Camperdown Community House',
'address': '6 Gunner Street, CAMPERDOWN',
'url': 'https://www.nhvic.org.au/camperdown-community-house'},
{'name': 'Simpson and District Community Centre',
'address': '11 Jayarra Street, SIMPSON',
'url': 'https://www.nhvic.org.au/simpson-and-district-community-centre'},
{'name': 'Alphington Community Centre',
'address': '2 Kelvin Road, ALPHINGTON',
'url': 'https://www.nhvic.org.au/alphington-community-centre'},
{'name': 'Jika Jika Community Centre',
'address': '1B Plant Street, NORTHCOTE',
'url': 'https://www.nhvic.org.au/jika-jika-community-centre'},
{'name': 'Bridge Darebin - Preston',
'address': '218 High Street, PRESTON',
'url': 'https://www.nhvic.org.au/bridge-darebin-preston'},
{'name': 'Northern Community Careworks',
'address': '81 High Street, PRESTON',
'url': 'https://www.nhvic.org.au/northern-community-careworks'},
{'name': 'Preston Reservoir Adult Community Education',
'address': '35 Sturdee Street, RESERVOIR',
'url': 'https://www.nhvic.org.au/preston-reservoir-adult-community-education'},
{'name': 'Reservoir Neighbourhood House',
'address': '2B Cuthbert Road, RESERVOIR',
'url': 'https://www.nhvic.org.au/reservoir-neighbourhood-house'},
{'name': 'Bridge Darebin - Thornbury',
'address': '131 Shaftesbury Parade, THORNBURY',
'url': 'https://www.nhvic.org.au/bridge-darebin-thornbury'},
{'name': 'Span Community House',
'address': '64 Clyde Street, THORNBURY',
'url': 'https://www.nhvic.org.au/span-community-house'},
{'name': 'Bairnsdale Neighbourhood House',
'address': '27 Dalmahoy Street, BAIRNSDALE',
'url': 'https://www.nhvic.org.au/bairnsdale-neighbourhood-house'}]
%% Cell type:code id:20dc8076 tags:
%% Cell type:code id:d6b88cc3 tags:
``` python
pages
```
%% Output
[{'name': 'Foundation Learning Centre',
'address': '1 Malcolm Court, NARRE WARREN',
'url': 'https://www.nhvic.org.au/foundation-learning-centre'},
{'name': 'Oakgrove Community Centre',
'address': '89-101 Oakgrove Drive, NARRE WARREN SOUTH',
'url': 'https://www.nhvic.org.au/oakgrove-community-centre'},
{'name': 'Dunolly and District Neighbourhood Centre',
'address': 'Havelock Street, DUNOLLY',
'url': 'https://www.nhvic.org.au/dunolly-and-district-neighbourhood-centre'},
{'name': 'Maryborough Community House',
'address': '23 Primrose Street, MARYBOROUGH',
'url': 'https://www.nhvic.org.au/maryborough-community-house'},
{'name': 'Mill House Neighbourhood House',
'address': '88-90 Burke Street, MARYBOROUGH',
'url': 'https://www.nhvic.org.au/mill-house-neighbourhood-house'},
{'name': 'Marrar Woorn Neighbourhood House',
'address': '6 Pengilley Avenue, APOLLO BAY',
'url': 'https://www.nhvic.org.au/marrar-woorn-neighbourhood-house'},
{'name': 'Colac Neighbourhood House',
'address': '23 Miller Street, COLAC',
'url': 'https://www.nhvic.org.au/colac-neighbourhood-house'},
{'name': 'Forrest and District Neighbourhood House',
'address': '14 Grant Street, FORREST',
'url': 'https://www.nhvic.org.au/forrest-and-district-neighbourhood-house'},
{'name': 'Gellibrand Community House',
'address': '5 Main Road, GELLIBRAND',
'url': 'https://www.nhvic.org.au/gellibrand-community-house'},
{'name': 'Camperdown Community House',
'address': '6 Gunner Street, CAMPERDOWN',
'url': 'https://www.nhvic.org.au/camperdown-community-house'},
{'name': 'Simpson and District Community Centre',
'address': '11 Jayarra Street, SIMPSON',
'url': 'https://www.nhvic.org.au/simpson-and-district-community-centre'},
{'name': 'Alphington Community Centre',
'address': '2 Kelvin Road, ALPHINGTON',
'url': 'https://www.nhvic.org.au/alphington-community-centre'},
{'name': 'Jika Jika Community Centre',
'address': '1B Plant Street, NORTHCOTE',
'url': 'https://www.nhvic.org.au/jika-jika-community-centre'},
{'name': 'Bridge Darebin - Preston',
'address': '218 High Street, PRESTON',
'url': 'https://www.nhvic.org.au/bridge-darebin-preston'},
{'name': 'Northern Community Careworks',
'address': '81 High Street, PRESTON',
'url': 'https://www.nhvic.org.au/northern-community-careworks'},
{'name': 'Preston Reservoir Adult Community Education',
'address': '35 Sturdee Street, RESERVOIR',
'url': 'https://www.nhvic.org.au/preston-reservoir-adult-community-education'},
{'name': 'Reservoir Neighbourhood House',
'address': '2B Cuthbert Road, RESERVOIR',
'url': 'https://www.nhvic.org.au/reservoir-neighbourhood-house'},
{'name': 'Bridge Darebin - Thornbury',
'address': '131 Shaftesbury Parade, THORNBURY',
'url': 'https://www.nhvic.org.au/bridge-darebin-thornbury'},
{'name': 'Span Community House',
'address': '64 Clyde Street, THORNBURY',
'url': 'https://www.nhvic.org.au/span-community-house'},
{'name': 'Bairnsdale Neighbourhood House',
'address': '27 Dalmahoy Street, BAIRNSDALE',
'url': 'https://www.nhvic.org.au/bairnsdale-neighbourhood-house'}]
%% Cell type:code id:52b36471 tags:
``` python
df = pd.DataFrame(pages)
df.sample(5)
```
%% Output
name \
4 Mill House Neighbourhood House
6 Colac Neighbourhood House
12 Jika Jika Community Centre
2 Dunolly and District Neighbourhood Centre
16 Reservoir Neighbourhood House
address \
4 88-90 Burke Street, MARYBOROUGH
6 23 Miller Street, COLAC
12 1B Plant Street, NORTHCOTE
2 Havelock Street, DUNOLLY
16 2B Cuthbert Road, RESERVOIR
url
4 https://www.nhvic.org.au/mill-house-neighbourh...
6 https://www.nhvic.org.au/colac-neighbourhood-h...
12 https://www.nhvic.org.au/jika-jika-community-c...
2 https://www.nhvic.org.au/dunolly-and-district-...
16 https://www.nhvic.org.au/reservoir-neighbourho...
%% Cell type:markdown id:8e28c840 tags:
## Save
the neigbhourhoood houses and their address to a csv file
%% Cell type:code id:8cd50089 tags:
``` python
df.to_csv('MoM.T2.NeighbourhoodHouses.csv')
```
%% Cell type:code id:0c1c3dcb tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment