Scraping web pages with ScrapingBee
Web scraping is legal for now, and one excellent service that has emerged (since the death of PhantomJS) is ScrapingBee. This service is a reliable substitute for developers familiar with PhantomJS, CasperJS, and NickJS. In the demo below, I wrapped the ScrapingBee API request into a queue.
Why ScrapingBee?
If your team wants to harvest content, at scale, and not stress about the obstacles webmasters (LinkedIn, Craigslist) place on their data, then a managed web-scraping service is a good fit.
Suppose you decide to go DIY and leverage PhantomJS and build your scraper. Below are a few things your team must consider, 1) how to mimic an actual web browser and 2) how to simulate a real person using the browser.
Mimic an Established Browser
Aside from managing your servers, you will need to 1) configure your bot's "User-Agent" to resemble a real browser, 2) handle Javascript obfuscation from Single Page Apps (SPA's) 3) designing a browser fingerprint to help a content provider identify unique users and track online behavior, and 4) scale your service to manage 20+ simultaneous instances of headless Chrome.
Mimic a Real Human
Not only will content providers conduct programmatic due diligence on your browser, but they will also do a second set of tests to confirm a real user is behind the browser. The three most common tactics for verifying human behavior are 1) IP checking, 2) Captcha brain teasers, 3) username/password or a session ID, and 4) identifying strange patterns –such as downloading 1000 documents in sequential order (such as 000, 001, 002, 003, 004...).
If your team wants to build a web scraper, you'll need to consider these eight issues and more.
Simple Demo
In my example below, I am scraping five pages from my website. The code is written for NodeJS and uses better-queue library to simplify orchestration.
const https = require('https');
const fs = require('fs');
const util = require("util");
const Queue = require('better-queue');
const data = [
"https://www.chrisjmendez.com/2020/05/01/mba-glossary/",
"https://www.chrisjmendez.com/2020/04/13/nuxtjs/",
"https://www.chrisjmendez.com/2020/04/12/how-to-simultaneously-unrar-multiple-files-at-once-into-individual-folders/",
"https://www.chrisjmendez.com/2020/03/10/installing-octave-on-macos-for/",
"https://www.chrisjmendez.com/2019/12/25/find-files-on-your-mac-using-command-line/",
"https://www.chrisjmendez.com/2019/12/02/deploying-rails-on-elastic-beanstalk/",
"https://www.chrisjmendez.com/2019/10/30/managing-virtual-environments-for-python/"
];
const API_KEY = "REGISTER HERE => https://www.scrapingbee.com?fpr=chris-m37"
const config = (url) => {
return {
hostname: 'app.scrapingbee.com',
port: '443',
url: url,
path: util.format('/api/v1?api_key=%s&url=%s', API_KEY, encodeURIComponent(url)),
method: 'GET'
}
};
const save = (fileName, html) => {
fs.writeFile(fileName, html, function(err){
if(err) return console.log(err);
console.log("Document Saved:", fileName)
})
};
const q = new Queue( (url, cb) => {
let options = config(url);
let req = https.request(options, res => {
console.log(`\nStatusCode: ${ res.statusCode }`);
let fileName = options.url.split('/').pop();
let body = [];
res
.on('data', html => {
body.push(html);
})
.on('end', () => {
body = Buffer.concat(body).toString();
save(`${fileName}.html`, body);
})
.on('close', () => {
// Go to Next Item in Queue
cb(res,body);
})
})
req.on('error', err => {
console.error(err.message);
process.exit(1);
});
req.end();
});
// /////////////////////////
// Task-Level Events
// /////////////////////////
q.on('task_started', (taskId, obj) => {
console.log('task_started', taskId, obj);
});
q.on('task_finish', (taskId, result, stats) => {
console.log('task_finish', taskId, stats);
});
q.on('task_failed', (taskId, err, stats) => {
console.log('task_failed', taskId, stats);
});
// /////////////////////////
// Queue-Level Events
// /////////////////////////
// All tasks have been pulled off of the queue
// (there may still be tasks running!)
q.on('empty', () => {
console.log('empty');
});
// There are no more tasks on the queue and no tasks running
q.on('drain', () => {
console.log('drain');
});
// /////////////////////////
// Start Queue
// /////////////////////////
data.forEach(function (item) {
q.push(item);
});
Questions, comments, and feedback welcome.
Resources
- ScrapingBee Web Scraping Handbook PDF
- Mirror your Header
- Opensource Web Scraping through NickJS
- Understanding PhantomJS
- Concurrency in Javascript
- Web Scraping using Headless Chrome