Web scraping is legal for now, and one excellent service that has emerged (since the death of PhantomJS) is ScrapingBee. This service is a reliable substitute for developers familiar with PhantomJS, CasperJS, and NickJS. In the demo below, I wrapped the ScrapingBee API request into a queue.

Why ScrapingBee?

If your team wants to harvest content, at scale, and not stress about the obstacles webmasters (LinkedIn, Craigslist) place on their data, then a managed web-scraping service is a good fit.

Suppose you decide to go DIY and leverage PhantomJS and build your scraper. Below are a few things your team must consider, 1) how to mimic an actual web browser and 2) how to simulate a real person using the browser.

Mimic an Established Browser

Aside from managing your servers, you will need to 1) configure your bot's "User-Agent" to resemble a real browser, 2) handle Javascript obfuscation from Single Page Apps (SPA's) 3) designing a browser fingerprint to help a content provider identify unique users and track online behavior, and 4) scale your service to manage 20+ simultaneous instances of headless Chrome.

Mimic a Real Human

Not only will content providers conduct programmatic due diligence on your browser, but they will also do a second set of tests to confirm a real user is behind the browser. The three most common tactics for verifying human behavior are 1) IP checking, 2) Captcha brain teasers, 3) username/password or a session ID, and 4) identifying strange patterns –such as downloading 1000 documents in sequential order (such as 000, 001, 002, 003, 004...).

If your team wants to build a web scraper, you'll need to consider these eight issues and more.

Simple Demo

In my example below, I am scraping five pages from my website. The code is written for NodeJS and uses better-queue library to simplify orchestration.

const https = require('https');
const fs = require('fs');
const util = require("util");
const Queue = require('better-queue');
const data = [
    "https://www.chrisjmendez.com/2020/05/01/mba-glossary/",
    "https://www.chrisjmendez.com/2020/04/13/nuxtjs/",
    "https://www.chrisjmendez.com/2020/04/12/how-to-simultaneously-unrar-multiple-files-at-once-into-individual-folders/",
    "https://www.chrisjmendez.com/2020/03/10/installing-octave-on-macos-for/",
    "https://www.chrisjmendez.com/2019/12/25/find-files-on-your-mac-using-command-line/",
    "https://www.chrisjmendez.com/2019/12/02/deploying-rails-on-elastic-beanstalk/",
    "https://www.chrisjmendez.com/2019/10/30/managing-virtual-environments-for-python/"
];
const API_KEY = "REGISTER HERE => https://www.scrapingbee.com?fpr=chris-m37"
const config = (url) => {
    return {
        hostname: 'app.scrapingbee.com',
        port: '443',
        url: url,
        path: util.format('/api/v1?api_key=%s&url=%s', API_KEY, encodeURIComponent(url)),
        method: 'GET'
    }
};

const save = (fileName, html) => {
  fs.writeFile(fileName, html, function(err){
    if(err) return console.log(err);
    console.log("Document Saved:", fileName)
  })
};

const q = new Queue( (url, cb) => {
    let options = config(url);
    let req = https.request(options, res => {
        console.log(`\nStatusCode: ${ res.statusCode }`);
        let fileName = options.url.split('/').pop();
        let body = [];
        res
            .on('data', html => {
                body.push(html);
            })
            .on('end', () => {
                body = Buffer.concat(body).toString();
                save(`${fileName}.html`, body);
            })
            .on('close', () => {
                // Go to Next Item in Queue
                cb(res,body);
            })
    })
    req.on('error', err => {
        console.error(err.message);
        process.exit(1);
    });
    req.end();
});

// /////////////////////////
// Task-Level Events
// /////////////////////////
q.on('task_started', (taskId, obj) => {
    console.log('task_started', taskId, obj);
});

q.on('task_finish', (taskId, result, stats) => {
    console.log('task_finish', taskId, stats);
});

q.on('task_failed', (taskId, err, stats) => {
    console.log('task_failed', taskId, stats);
});

// /////////////////////////
// Queue-Level Events
// /////////////////////////
// All tasks have been pulled off of the queue
// (there may still be tasks running!)
q.on('empty', () => {
    console.log('empty');
});

// There are no more tasks on the queue and no tasks running
q.on('drain', () => {
    console.log('drain');
});

// /////////////////////////
// Start Queue
// /////////////////////////
data.forEach(function (item) {
    q.push(item);
});

Questions, comments, and feedback welcome.

Scraping web pages with ScrapingBee

Why ScrapingBee?

Mimic an Established Browser

Mimic a Real Human

Simple Demo

Resources

Others from Javascript

Extracting Data from Matomo API using Google Apps Script

Adding PrismJS to a theme without touching the theme itself

Build VST or Audio Unit plug-in using Javascript (Basic)

Publish a Bug Task to Pivotal Tracker via email using Cloudflare email workers

Scraping web pages with ScrapingBee

Why ScrapingBee?

Mimic an Established Browser

Mimic a Real Human

Simple Demo

Resources

Others from Javascript

Extracting Data from Matomo API using Google Apps Script

Adding PrismJS to a theme without touching the theme itself

Build VST or Audio Unit plug-in using Javascript (Basic)

Publish a Bug Task to Pivotal Tracker via email using Cloudflare email workers

Subscribe to new posts