# Digital Studio workhsop: Web scraping data neighbourhood house addresses
## What are we doing
There is a wealth of data available on the Internet that is not immediately downloadable through conventional methods. Web scraping provides us with a way to programmatically extract important research data, where appropriate, from almost any website.
In this example, we will be looking at how to extract search results from an organisation for Neighbourhood Houses across Victoria.
For this exercise we will be using three python packages:
1.*requests* will allow us to fetch HTML from the nhvic.org.au website.
2.*BeautifulSoup* will allow us to extract relevant metadata from this HTML.
3.*pandas* will let us create a table of our results and export these to a CSV.
Note how the individual python packages all have a specific responsibility within our script. Once you learn the basics of a programming language, the next step is to become familiar with existing packages that will help you get get your work done faster.
We are also going to be doing some text wrangling to make the data easier to use, we willl be using:
1.*re* will allow us to read (or parse) the text and get rid of characters that are not needed to understand the data.
%% Cell type:code id:0de629db tags:
``` python
# Import the libraries needed to read data from the web
frombs4importBeautifulSoup
importrequests
# import library for data handling
importpandasaspd
# import library to pause so we don't overload their webpage
importtime
```
%% Cell type:markdown id:87b13fc2 tags:
## Where is the data?
We will be getting data from this set of search results:
It is worth spending a moment to look at the different components of the URL.
1. 'https://www.nhvic.org.au' is the origin or root url
2. 'https://www.nhvic.org.au/find-a-neighbourhood-house' is the location of the search page we are going to webscrape
## What does it look like?
Next we will look at the html of this search page. In Google Chrome you can use the menu *View* / *Developer* / *View Source* to see the html, that is, the website text that is 'marked up' to make it easier to read and appealing to look at.
%% Cell type:markdown id:738927ee tags:
## How do we get the raw HTML?
%% Cell type:code id:557d7065 tags:
``` python
# grab the search page that contains urls for each neighbourhood house