Golang Code Examples
I stumbled across a scraper and crawler framework written in Go called Colly. Colly makes it really easy to scrape content from web pages with it’s fast speed and easy interface. I have always been interested in web scrapers ever since I did a project for my university studies and you can read about that project here. Before continuing, please note that scraping of websites is not always allowed and sometimes even illegal. In the guide below we will be parsing this blog, GoPHP.io.
This example will only being using one external dependency. While it is possible to parse HTML using Go’s standard library, this involves writing a lot of code. So instead we are going to be using the very popular Golang library, Goquery which supports JQuery style selection of HTML elements. Golang Example Web Scraping A collection of 4 posts. Ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. 07 January 2019. Command Line 99. A week ago I decided to try my hand at web scraping. The initial plan was to use python, but when looking up on YouTube I happened to come across this video on web scraping using Golang and Colly. Given my level of comfort using Golang being better than using Python, I decided to go ahead with Golang. A quick guide on how to use Colly to parse content on any page using Golang. We expand a basic example of parsing links to also parse page headings and more. Colly makes it easy to scrape content from any website.
Golang Web Scraper Example Python
To begin let’s take a look at the Colly Github page and scroll down to the example code listed there. We will create a new project with a new main.go file that looks like this:
You may need to use go get -u github.com/gocolly/colly/...
to download the framework into your go directory. Now let’s go ahead and change the url to the gophp.io website.
And then we can run the script by typing go run main.go
in your terminal making sure you are in the project directory when you do this. You can use ctrl+c
in your terminal to cancel as it may run for a long time. What do we get as our output? For me it looked like this:
What we see here is exactly what you would expect. Our program parsed all the urls on the main gophp.io page and then proceeded to the first link. This first link is a post at gophp.io but the first link on that page is a link to Virtualbox and our program will keep looping until it stops finding links. That could be a long time and unless you want to make a search engine spider it won’t be the most efficent. What I want is a server that I can call on from a PHP script that just fetches and formats the data I need. Luckily Colly has a complete example of what we need, a scraper server.
What does the above code do? It will start a webserver running locally on your machine on port 7171. It takes a url parameter and returns all the links found on the url you input. Let’s give it a go by going to http://127.0.0.1:7171/?url=https://gophp.io/
. Here is an example of the json encoded output we get:
The above json output is only 1 level deep. Notice that it does not keep finding links on the pages it finds. This is great because now we could use this program as a sort of microservice. A PHP application could make calls to this microservice and receive all links for the specified url which could later be processed by the PHP application. Now, links are good but we might want to parse other content on the page. Let’s customize our code for this purpose.
Queries For Specific Content With Colly
Golang Web Service Example
If we take a look at the source of gophp.io we can see that every title has the css class entry-title
which we can use for our query. We will modify the handler function by adding another map for headings. I am only including the section of code that I have changed below:
Golang Web Scraper Examples
Now if we restart our program and navigate to our page on port 7171 again we will see some additional output in our json response.
As you can see we have now parsed all the titles on the page and added them to our json output. Using queries we can make very general or specific parsers for any kind of website.
Golang Http Server
I hope this guide helps someone get started with web scraping. There are several real world examples in the documentation if you would like to learn more. I would love to hear your feedback, questions and comments below!