- In this tutorial, we use ReactPHP to create a fast web scraper and extract data from HTML.reactphp-buzz: https://github.com/clue/reactphp-buzz#installSymfony.
- React tends to be used for more presentational purposes i.e. Displaying the data you have scraped and not the actual scraping. If you are going to use javascript for scraping I would suggest using your node backend to do this (assuming you are using node). Create a route that your React app can call and let your backend code do the work.
- Scraper of the React.js application we're looking for someone who can help with scrapping information from the React.js (login to view URL) application. The goal is to log in to the member zone website, and then got specific text from one place - it's always the same.
Web scraping relies on the HTML structure of the page, and thus cannot be completely stable. When HTML structure changes the scraper may become broken. Keep this in mind when reading this article. At the moment when you are reading this, css-selectors used here may become outdated.
May 14, 2018 In very simple terms, web scraping is the technique of extracting data from websites. This data can further be stored in a database or any other storage system for analysis or other uses. While extracting data from websites can be done manually, web scraping usually refers to an automated and less tedious process. Unlike ReactPHP HTTPClient, clue/buzz-react buffers the response and fulfills the promise once the whole response is received. Actually, it is a default behavior and you can change it if you need streaming responses. So, as you can see, the whole process of scraping is very simple: Make a.
In the previous article, we have created a scraper to parse movies data from IMDB. We have also used a simple in-memory queue to avoid sending hundreds or thousands of concurrent requests and thus to avoid being blocked. But what if you are already blocked? The site that you are scraping has already added your IP to its blacklist and you don’t know whether it is a temporal block or a permanent one.
Such issues can be resolved with a proxy server. Using proxies and rotating IP addresses can prevent you from being detected as a scraper. The idea of rotating different IP addresses while scraping - is to make your scraper look like real users accessing the website from different multiple locations. If you implement it right, you drastically reduce the chances of being blocked.
In this article, I will show you how to send concurrent HTTP requests with ReactPHP using a proxy server. We will play around with some concurrent HTTP requests and then we will come back to the scraper, which we have written before. We will update the scraper to use a proxy server for performing requests.
How to send requests through a proxy in ReactPHP
For sending concurrent HTTP we will use clue/reactphp-buzz package. To install it run the following command:
Now, let’s write a simple asynchronous HTTP request:
We create an instance of ClueReactBuzzBrowser
which is an asynchronous HTTP client. Then we request Google web page via method get($url)
. Method get($url)
returns a promise, which resolves with an instance of PsrHttpMessageResponseInterface
. This snippet above requests http://google.com
and then prints its HTML.
For a more detailed explanation of working with this asynchronous HTTP client check this post.
Class Browser
is very flexible. You can specify different connection settings, like DNS resolution, TSL parameters, timeouts and of course proxies. All these settings are configured within an instance of ReactSocketConnector
. Class Connector
accepts a loop and then a configuration array. So, let’s create one and pass it to our client as a second argument.
This connector tells the client to use 8.8.8.8
for DNS resolution.
Before we can start using proxy we need to install clue/reactphp-socks package:
This library provides SOCKS4, SOCKS4a and SOCKS5 proxy client/server implementation for ReactPHP. In our case, we need a client. This client will be used to connect to a proxy server. Then our main HTTP client will use this proxy client to send connections through a proxy server.
Notice, that this 127.0.0.1:1080
is just a dummy address. Of course, there is no proxy server running on our machine.
The constructor of ClueReactSocksClient
class accepts an address of the proxy server (127.0.0.1:1080
) and an instance of the Connector
. We have already covered Connector
above. Create an empty connector here, with no configuration array.
Name ClueReactSocksClient
can confuse you, that it is one more client in our code. But it is not the same thing as ClueReactBuzzBrowser
, it doesn’t send requests. Consider it as a connection, not a client. The main purpose of it is to establish a connection to a proxy server. Then the real client will use this connection to perform requests.
To use this proxy connection we need to update a connector and specify tcp
option:
The full code now looks like this:
Now, the problem is: where to get a real proxy?
Let’s find a proxy
On the Internet, you can find many sites dedicated to providing free proxies. For example, you can use https://www.socks-proxy.net. Visit it and pick a proxy from Socks Proxy list.
In this tutorial, I use 184.178.172.13:15311
.
Probably when you read this article this particular proxy wouldn’t work. Please, pick another proxy from the site I mentioned above.
Now, the working example looks like this:
Notice, that I have added an onRejected callback. A proxy server might not work (especially a free one), thus it would be useful to show an error if our request has failed. Run the code and you will see HTML code of Google main page.
Updating the scraper
To refresh the memory here is the consumer code of the scraper from the previous article:
We create an event loop. Then we create an instance of ClueReactBuzzBrowser
. The scraper uses this instance to perform concurrent requests. We scrape two URLs with 40 seconds timeout. As you can see we even don’t need to touch the scraper’s code. All we need is to update Browser
constructor and provide a Connector
configured for using a proxy server. At first, create a proxy client with an empty connector:
Then we need a new connector for Browser
with a configured tcp
option, where we provide our client:
And the last step is to update Browser
constructor by providing a connector:
The updated proxy version looks the following:
But, as I have mentioned before proxies might not work. It will be nice to know why we have scrapped nothing. So, it looks like we still have to update a scraper’s code and add errors handling. The part of the scraper which performs HTTP requests looks the following:
The request logic is located inside scrape()
method. We loop through specified URLs and perform a concurrent request for each of them. Each request returns a promise. As an onFulfilled handler, we provide a closure where the response body is being scraped. Then, we set a timer to cancel a promise and thus a request by timeout. One thing is missing here. There is no error handling for this promise. When the parsing is done there is no way to figure out what errors have occurred. It will be nice to have a list of errors, where we have URLs as keys and appropriate errors as values.So, let’s add a new $errors
property and a getter for it:
Final cut pro download for mac. Then we need to update method scrape()
and add a rejection handler for the request promise:
When an error occurs we store it inside $errors
property with an appropriate URL. Now we can keep track of all the errors during the scraping. Also, before scrapping don’t forget to instantiate $errors
property with an empty array. Otherwise, we will continue storing old errors. Here is an updated version of scrape()
method:
Games for mac rpg. Now, the consumer code can be the following:
At the end of this snippet, we print both scraped data and errors. A list of errors can be very useful. In addition to the fact that we can track dead proxies, we can also detect whether we are banned or not.
What if my proxy requires authentication?
All these examples above work fine for free proxies. But when you are serious about scraping chances high that you have private proxies. In most cases they require authentication. Providing your credentials is very simple, just update your proxy connection string like this:
But keep in mind that if you credentials contain some special characters they should be encoded:
You can find examples from this article on GitHub.
This article is a part of the ReactPHP Series.
Web Scraping React Native
Learning Event-Driven PHP With ReactPHP
The book about asynchronous PHP that you NEED!
A complete guide to writing asynchronous applications with ReactPHP. Discover event-driven architecture and non-blocking I/O with PHP!
Review by Pascal MARTIN