Automate and scrape a website with XPath Expressions

I've been scraping websites for a long time. One of my first "real" programming projects used Java and Selenium to automate a website and scrape data. I later used Node.js to pull data from many different education-related websites so teachers could see it all in one place.

In this post, I want to share some tips for using XPath expressions to find page elements and data. Scroll to the bottom for two helper functions that make it just as easy to use XPath expressions as CSS selectors!

Why Scrape?

In a perfect world, every site would be regularly maintained and would have an API you could use to efficiently get the data you need. In reality, we often need to use an older legacy site to access the data we need. There is no API, and the website we need to use is clunky and time consuming.

We might use web scraping to make sure people aren't doing work that a computer can do automatically. For example, instead of having a person search for individual records manually, we can make a list of records we want to search and build an automation that finds those records. If you have multiple sources of data, you can scrape them concurrently with an automation.

In summary, automated web scraping is always more efficient than manual searching and data entry.

What is an XPath Expression?

XPath stands for "XML Path Language." Since HTML is an XML-like language, we can use XPath to find specific HTML nodes. This is similar to using a CSS Selector to find an element on the page, but it can be much more powerful.

For example, this is how we would find the same element using a CSS selector and XPath expression:

Description	CSS Selector	XPath Expression
H1 Element	`h1`	`//h1`
Paragraph within a section	`section p`	`//section//p`
Anchor element with title "home"	`a[title="home"]`	`//a(@title="home")`
Second direct descendant list item in an ordered list	`ol > li:nth-of-type(2)`	`//ol/li[2]`
Paragraph with exact text "lorem ipsum"	impossible	`//p[text()="lorem ipsum"]`
Paragraph containing text "lorem ipsum"	impossible	`//p[contains(text(), "lorem ipsum")]`

XPath expressions give us more power than just CSS selectors.

As you can see, XPath expressions have capabilities beyond CSS selectors. For example, XPath expressions can find an element based on its text content.

How to use XPath Expressions in JavaScript

Most developers quickly learn about the querySelector and querySelectorAll methods for querying the DOM with CSS selectors. For example:

// find a heading
const heading = document.querySelector('h1');

// find all paragraphs
const paragraphs = document.querySelectorAll('p');

Similarly, the document.evaluate function allows us to search the DOM using an XPath:

// find a heading
const heading = document.evaluate(
  '//h1',
  document,
  null,
  XPathResult.ANY_UNORDERED_NODE_TYPE,
  null,
).singleNodeValue;

// find all paragraphs and log their text content
const paragraphs = document.evaluate(
  '//p',
  document,
  null,
  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
  null,
);

for (let i = 0; i < paragraphs.snapshotLength; i++) {
  const text = paragraphs.snapshotItem(i).textContent;
  console.log(text);
}

document.evaluate parameters

The document.evaluate function receives several parameters. This makes it seem more complicated than the querySelector and querySelectorAll functions. This is the price we pay for the power of XPath.

I'll be showing you a wrapper function to make XPath much easier to use. Since you won't have to worry about the details, I'm going to skip most of them. You can always check the documentation over at MDN if you want to know more:

https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate

Search within a parent element using XPath Context Nodes

One thing you do want to be aware of is how you can search within a specific element using an XPath expression. You can do this with the contextNode parameter of document.evaluate. It is the second parameter.

In the examples above, the context node was always the document itself. However, it can be any element you provide. Calling document.evaluate with a specific context node will only return matches within that node.

This is similar to how we can call querySelector on the document itself, but we can also call it on a specific element within the document:

// searches the whole document for paragraphs
const allParagraphs = document.querySelectorAll('p');

// searches only for paragraphs inside the section element
const section = document.querySelector('section');
const sectionParagraphs = section.querySelectorAll('p');

This is what it looks like to search a specific element for descendants using XPath:

// searches only for paragraphs inside the section element
const section = document.querySelector('section');

const sectionParagraphs = document.evaluate(
  '//p',
  section, // <-- notice the change!
  null,
  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
  null,
);

for (let i = 0; i < sectionParagraphs.snapshotLength; i++) {
  const text = sectionParagraphs.snapshotItem(i).textContent;
  console.log(text);
}

Mix and Match XPath Expressions and CSS Selectors

You can use XPath expressions and CSS selectors in your code. Both of them return references to DOM elements, and both of them use those same element references to search for descendant nodes.

Simple document.evaluate wrappers

Using document.evaluate is more verbose than using a CSS selector. However, we can simplify things by creating a wrapper function for document.evaluate that mimics querySelector and querySelectorAll.

Find one element by XPath easily

You can use this function to easily find a single document node using an XPath. This is similar to using the querySelector function:

/**
* Shorthand for calling `document.evaluate` to get a single element via XPath.
* @param {*} xpathExpression 
* @param {*} contextNode 
* @returns 
*/
const getNodeByXpath = (xpathExpression, contextNode = document) => document.evaluate(
  xpathExpression,
  contextNode,
  null,
  XPathResult.ANY_UNORDERED_NODE_TYPE,
  null,
).singleNodeValue;

// From root of document
const heading = getNodeByXpath('//h1');

// From specific ancestor element ("context node")
const section = document.querySelector('section');
const sectionHeading = getNodeByXpath('//h2', section);

Find multiple elements by XPath easily

You can use this function to easily find multiple document nodes using an XPath. This is similar to using the querySelectorAll function:

/**
* Shorthand for calling `document.evaluate` to get multiple element via XPath.
* Returns an array of nodes, which may be empty.
* @param {*} xpathExpression 
* @param {*} contextNode 
* @returns 
*/
const getNodesByXpath = (xpathExpression, contextNode = document) => {
  const result = document.evaluate(
    xpathExpression,
    contextNode,
    null,
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
    null,
  );

  const nodes = [];
  for (let i = 0; i < result.snapshotLength; i++) {
    const node = result.snapshotItem(i);
    nodes.push(node);
  }
  return nodes;
};

// From root of document
const paragraphs = getNodesByXpath('//p');

// From specific ancestor element ("context node")
const section = document.querySelector('section');
const sectionParagraphs = getNodesByXpath('//p', section);

References

For help building your own XPath expressions, check out this cheatsheet:

https://devhints.io/xpath