Build a Web-Scraped API with Express and Cheerio

The outgrowth of the world wide web over the last couple of decades has led to an enormous amount of data being collected and plastered onto webpages throughout the internet. A corollary to this hyperbolic production and distribution of content on the web is the curation of a vast amount of information that can be used in enumerable ways if one can effectively extract and aggregate it.

The most common ways to collect and aggregate data available on the web are (1) to request it from an API with technologies such as REST or SOAP and (2) to write a program to parse or scrape it from loosely-structured data, like HTML. The former is by far the preferable method for the programmer consuming the information but, this is often not a possibility due to development time and resources needed on the side of the producer. Therefore, more often than not the only available means to get at the prized data is to scrape it.

This article is going to present a technique for building a standalone Node.js application that collects (scrapes) tax data for the state of Nebraska and either presents that info to the user or calculates taxes based on a city and amount supplied.

The technologies to be utilized are:

  • Node.js: A JavaScript runtime built on Chrome's V8 engine
  • Express: A Node.js web framework
  • Cheerio: An HTML parsing library that mirrors the familiar jQuery library API

The source code can be found on GitHub here.

Base Project Setup

This tutorial will utilize the Node Package Manager (npm) to initialize the project, install libraries, and manage dependencies. Before we get started, make sure you've configured npm for your environment.

Initialize the project accepting the basic default options:

Install dependencies:

Basic project structure:

Express

We will be using Express to build out our RESTful API for the tax calculation application. Express is a web application framework for Node applications that is both flexible in that it imposes few restrictions in how you develop your applications but very powerful because it provides several useful features that are used in a multitude of web applications.

Setting Up Express

In server.js we will include some boilerplate Express setup code that will create the Express application, then register the routes module that we will make in the next subsection. At the end of the file we will instruct the Express app to listen to the port provided or 3500 which is a hardcoded port.

In server.js copy and paste in the following code:

'use strict';

const express = require('express');  
const app = express();  
const port = process.env.PORT || 3500;

const routes = require('./api/routes');  
routes(app);

app.listen(port);

console.log("Node application running on port " + port);  

Routes

We will set up routing in our application to respond to requests made to specific and meaningful URI paths. What do I mean by meaningful you may be asking? Well in the REST paradigm route paths are designed to expose resources within the application in self describing ways.

In the routes/index.js file copy and paste in the following code:

'use strict';

const taxCtrl = require('../controllers');

module.exports = (app) => {  
    app.use(['/calculate/:stateName/:cityName/:amount', '/taxrate/:stateName/:cityName'], (req, res, next) => {
        const state = req.params.stateName;
        if (!taxCtrl.stateUrls.hasOwnProperty(state.toLowerCase())) {
            res.status(404)
                    .send({message: `No state info found for ${state}`});
        } else {
            next();
        }
    });

    app.route('/taxrate/:stateName/:cityName')
        .get(taxCtrl.getTaxRate);

    app.route('/calculate/:stateName/:cityName/:amount')
        .get(taxCtrl.calculateTaxes);

    app.use((req, res) => {
        res.status(404)
            .send({url: `sorry friend, but url ${req.originalUrl} is not found`});
    });
}

The two routes being defined in this module are /taxrate/:stateName/:cityName and /calculate/:stateName/:cityName/:amount. They are registered with the app object that was passed into the module from the server.js script described above by calling the route method on the app. Within the route method the route is specified and then the get method is called, or chained, on the result of calling route. Inside the chained get method is a callback function that we will further discuss in the section on controllers. This method of defining routes is known as "route chaining".

The first route describes an endpoint that will display the state and city tax rates in response to a GET request corresponding to :stateName and :cityName, respectively. In Express you specify what are known as "route parameters" by preceding a section of a route delimited between forward slashes with a colon to indicate a placeholder for a meaningful route parameter. The second route /calculate/:stateName/:cityName/:amount describes an endpoint that will calculate the city and state tax amounts as well as total amount based off the amount parameter of the route.

The two other invocations of the app object are specifying middleware. Express.js middleware is an incredibly useful feature that has many applications which could easily warrant their own series of articles so, I won't be going into great depth here. Just know that middleware are functions that can hook into, access and modify the request, response, error, and next objects of an Express request-response cycle.

You register a middleware function by calling the use method on the app object and passing in unique combinations of routes and callback functions. The first middleware declared on the app object specifies our two URLs within an array and a callback function that checks to see if the state being passed to request tax information is available.

This demo application will only be developed to respond to request for cities in Nebraska but someone could quite easily extend it with other states given they have a publicly available static webpage of similar information. The second middleware serves as a catch all for any URL paths requested that are not specified.

Controllers

Controllers are the part of an Express application that handles the actual requests made to the defined routes and returns a response. Strictly speaking, controllers are not a requirement for developing Express applications. A callback function, anonymous or otherwise, can be used but using controllers lead to better organized code and separation of concerns. That being said, we will be using controllers as it is always a good idea to follow best practices.

In your controllers/index.js file copy and paste the following code.

'use strict';

const svc = require('../services');

const getTaxRate = (req, res) => {  
    const state = req.params.stateName;
    svc.scrapeTaxRates(state, stateUrls[state.toLowerCase()], (rates) => {
        const rate = rates.find(rate => {
            return rate.city.toLowerCase() === req.params.cityName.toLowerCase();
        });
        res.send(rate);
    });
}

const calculateTaxes = (req, res) => {  
    const state = req.params.stateName;
    svc.scrapeTaxRates(state, stateUrls[state.toLowerCase()], (rates) => {
        const rate = rates.find(rate => {
            return rate.city.toLowerCase() === req.params.cityName.toLowerCase();
        });
        res.send(rate.calculateTax(parseFloat(req.params.amount)));
    });
}


const stateUrls = {  
    nebraska: 'http://www.revenue.nebraska.gov/question/sales.html';
};

module.exports = {  
    getTaxRate,
    calculateTaxes,
    stateUrls
};

The first thing you see being imported and declared in the controllers module is a constant called svc which is short for "service". This service object serves as a reusable piece of functionality to request a webpage and parse the resultant HTML. I will go more into depth in the section on Cheerio and services on what is going on behind the scenes with this service object, but for now just know that it parses the HTML for the meaningful bits we are interested (i.e., tax rates).

The two functions that we are most interested in are getTaxRate and calculateTaxes. Both functions are passed in request and response (req and res) objects via the route.get(...) methods in the routes module. The getTaxRate function accesses the stateName route parameter from the params object of the request object.

The state name and its corresponding target URL (in this case only Nebraska and its government webpage displaying taxable information) are passed to the service object's method scrapeTaxRates. A callback function is passed as a third parameter to filter out and respond with the city information corresponding to the cityName parameter found in the route path.

The second controller function, calculateTaxes, again uses the service method scrapeTaxRates to request and parse the HTML, but this time it calculates the taxes via a method within the TaxRate class, which we'll discuss next in the section on models.

Models

Similar to controllers, models are not something that are strictly required for an Express application. However, models come in quite handy when we want to encapsulate data (state) and behavior (actions) within our applications in an organized manner.

In your models/index.js file, copy and paste the following code:

'use strict'

class TaxRate {  
    constructor(state, city, localRate, stateRate) {
        this.state = state;
        this.city = city;
        this.localRate = localRate;
        this.stateRate = stateRate;
    }

    calculateTax (subTotal) {
        const localTax = this.localRate * subTotal;
        const stateTax = this.stateRate * subTotal;
        const total = subTotal + localTax + stateTax;
        return {
            localTax,
            stateTax,
            total
        };
    }
}

module.exports = TaxRate;  

The only model (or more correctly stated: class) that we will define in our application is TaxRate. TaxRate contains member fields for holding data on state, city, local tax rate, and state tax rate. These are the class fields that make up the state of the object. There is only one class method, calculateTax(...), which takes in the parameter representing a subtotal amount passed into the route /calculate/:stateName/:cityName/:amount path and will return an object representing the calculated tax quantities and final total amount.

Cheerio

Cheerio is a lightweight JavaScript library which implements the jQuery core to access, select, and query HTML in server-side apps. In our case we will be using Cheerio to parse the HTML in the static webpage we request from the Nebraska government's website displaying tax information.

Services

In our little application we will use a custom services module to implement the requesting of the HTML page from the Nebraska government's web site as well as the parsing of the resultant HTML to extract the data we desire.

In your services/index.js file copy and paste the following code:

'use strict';

const http = require('http');  
const cheerio = require('cheerio');  
const TaxRate = require('../models');

const scrapeTaxRates = (state, url, cb) => {  
    http.get(url, (res) => {
        let html = '';

        res.on('data', chunk => {
            html += chunk;
        });

        res.on('end', () => {
            const parser = new Parser(state);
            const rates = parser.parse(html);
            cb(rates);
        });
    });
};

class Parser {  
    constructor(state) {
        this.state = state;
    }

    parse(html) {
        switch(this.state.toLowerCase()) {
            case 'nebraska':
                return this.parseNebraska(html);
            default:
                return null;
        }
    }

    parseNebraska(html) {
        const $ = cheerio.load(html);
        let rates = [];
        $('tr').each((idx, el) => {
            const cells = $(el).children('td');
            if (cells.length === 5 && !$(el).attr('bgcolor')) {
                const rawData = {
                    city: $(cells[0]).first().text(),
                    cityRate: $(cells[1]).first().text(),
                    totalRate: $(cells[2]).first().text()
                };
                rawData.cityRate = parseFloat(rawData.cityRate.replace('%', ''))/100;
                rawData.totalRate = parseFloat(rawData.totalRate.substr(0, rawData.totalRate.indexOf('%')))/100;
                rawData.stateRate = rawData.totalRate - rawData.cityRate;
                rates.push(new TaxRate('Nebraska', rawData.city, rawData.cityRate, rawData.stateRate));
            }
        });
        return rates;
    }
}

module.exports = {  
    scrapeTaxRates;
};

The first three lines are importing (via require()) some module level objects http, cheerio, and TaxRate. TaxRate was described in the previous section on modules so we won't beat the proverbial dead horse and go over its use in too much detail, so suffice it to say it is used to store tax rate data and calculate taxes.

The http object is a Node module that is used to make requests from the server to another networked resource, which in our case is the tax rate webpage from the Nebraska government. The remaining one is Cheerio, which is used to parse the HTML using the familiar jQuery API.

The services module only exposes one publicly available function called scrapeTaxRates, which takes a state name string, URL string (for the state's page displaying tax rates), and a callback function to process the tax rates in unique ways specified by the calling client code.

Within the body of the scrapeTaxRates function the get method to the http object is called to request the webpage at the specified URL. The callback function passed to the http.get(...) method handles the processing of the response. In the processing of the response, a string of HTML is built and stored in a variable called html. This is done in an incremental fashion as the data event is fired and a buffered chunk of data is returned from the response.

Upon the firing of the end event a final callback function is invoked. Inside this callback the Parser class is instantiated and the parse method is called to parse the HTML and extract the information specific to the structure and layout of the Nebraska webpage. The parsed data is loaded into a series of TaxRate objects stored in an array and passed to the callback function to execute the logic specified in the calling client code (in our case, in the controller functions described earlier). It's in this last step that data is serialized and sent as a response to the caller of the REST API.

Conclusion

In this short article we investigated how to design a simple lightweight Node.js application that scrapes useful tax rate information from a government website, which could be useful for e-commerce applications. The two primary purposes of the application are to collect tax rates, and either display that information for a given city, or to calculate taxes based on a state, city, and subtotal.

For example, below you will find screenshots of the application displaying taxes for the city of Omaha, and calculating taxes for a subtotal of $1000. In order to test this application cd into the root directory and type $ node server.js into the console. You will see a message that says, "Node application running on port 3500".

I hope this article inspires you to further investigate the world of data scraping so you can create useful applications and meaningful data products. As always, I welcome any and all comments below.

Author image
Lincoln, Nebraska Twitter