How to automate and scrape a React web app

I’ve mentioned before that I’ve been automating and scraping websites for a long time. It’s not always pretty, but sometimes it’s absolutely necessary. In this post I want to share a few tips and tricks for automating React apps.

The Irony

There’s irony here. Often, we scrape websites because they are old and unmaintained and never had an API… we simply have no other choice. However, React is a more modern framework. Check out this timeline:

  • 2006 – Facebook releases its app platform and API
  • 2009 – Node.js is released, allowing API servers to be written in JavaScript
  • 2013 – Facebook releases React

Any website that uses React could also have been created with a convenient API for data access. In 2006, APIs were already popular enough that Facebook offered one, and then in 2009 Node.js enabled developers to use JavaScript to build API servers. This means any devs who built a React app could have also built an API server.

To be fair, most React apps probably are making HTTP requests to get things done, but that doesn’t mean there’s an API there for you to consume. The backend of a site is often built to only handle the needs of the frontend, with things like authentication and session management tightly coupling the two together. If that’s the case, it can be better to scrape the React frontend than to try to interact directly with the server.

Typical Automation

Automating a website means getting a computer to use the website as if it were a person, clicking on things and entering text and copying data from the page.

There are different tools for doing this like Selenium and Puppeteer, which have their own methods to call. We can also scrape a website without any external tools. For example, we can run a script in the browser console. We can also create a bookmarklet to load an external JavaScript file and run it. In this case, we will use plain vanilla JavaScript to move through the website and find data.

We can find elements with CSS selectors or XPath expressions.

If we want to click on a link or a button, we can use the element.click() method.

If we want to enter text into a form, we can simply find the <input> or <textarea> element and set its value: element.value = “robot user input”.

The above approach also works for selecting an option in a <select> dropdown.

However, there are unique challenges with automating React apps that make these usual approaches ineffective. React has a reactivity system (hence the name) that listens for user input events and then uses them to update its internal state data.

This reactivity system listens for specific events that aren’t fired when we change an element’s value directly or call an element’s click() method. The page may appear to respond as usual, but React’s internal state will not reflect the changes we made on the page.

Eventually, our automation will likely fail because React is not updating in response to our input.

Clicking in a React app

For many websites, using element.click() is enough to click the element from JavaScript. In fact, we can still use that method to follow links and submit some forms.

However, many React components require a click event to function properly. Maybe a switch component will only toggle when clicked. Maybe a dropdown or menu will only open when clicked. In these cases, we need to trigger React’s reactivity system.

To do so, we will dispatch a MouseEvent three times, once for each type of event that fires when a user clicks an element:

  • mousedown
  • click
  • mouseup

This ensures that React will recognize the click no matter what specific mouse events it was listening for. If React was expecting all three to happen in order, we satisfy that requirement:

/**
* Simulates a mouse click event in the browser in a way that React will recognize.
* @param {*} element 
*/
const simulateMouseClick = (element) => {
  const mouseClickEvents = ['mousedown', 'click', 'mouseup'];
  mouseClickEvents.forEach(mouseEventType =>
    element.dispatchEvent(
      new MouseEvent(mouseEventType, {
        view: window,
        bubbles: true,
        cancelable: true,
        buttons: 1,
      }),
    ),
  );
};

React app mouse hover

There are times when we need to hover our cursor over an element to proceed with automation. For example, we may need to move our mouse over a dropdown component in order for a menu to appear.

On some websites, such a menu will always exist but will be hidden by CSS classes. In that case we can still click links in the menu even if they are hidden. With React, however, those menu options won’t even be rendered unless the menu is activated by a mouse hover event.

In order to simulate a hover event, we will dispatch a mousemove event in the middle of our target element. React’s event listeners will pick up this event and recognize that we are hovering over the element:

/**
* Simulates mouse movement across an element in a way that React will recognize.
* @param {*} element 
*/
const simulateMouseHover = (element) => {
  const x = element.offsetLeft + element.offsetWidth / 2;
  const y = element.offsetTop + element.offsetHeight / 2;

  element.dispatchEvent(
    new MouseEvent('mousemove', {
      view: window,
      bubbles: true,
      cancelable: true,
      buttons: 1,
      screenX: x,
      screenY: y,
      clientX: x,
      clientY: y,
    }),
  );
};

React app text entry

Entering text in a React app is very similar to how we would enter text in any website. We find the form input element we want to update, and we set its value directly.

However, React’s two-way data binding overrides the native value setter function. Without that setter function, our updates may appear in the browser but they won’t actually update the element’s value property, and React won’t pick up on them either.

For example, if we enter text into a React form and then submit that form, React won’t actually have our data. It will appear in the browser, but React will have an empty form since it didn’t “hear” the input events that should have fired.

To get around this, we need to do two things:

  1. Get the original setter function and call it to make sure the element’s value is actually updated
  2. Dispatch an input event from that element so React “hears” the change happen and pulls the element’s value into its internal state

This is what that looks like:

/**
* Enters text in an input field in a way that React will respond to.
* Directly setting an input value works, but React will not copy the value to its internal model.
* The value would disappear if the component is re-rendered, and even if it didn't it would never
* be submitted in an API request.
* @param {*} element 
*/
const simulateTextEntry = (input, text) => {
  const nativeInputValueSetter = Object.getOwnPropertyDescriptor(window.HTMLInputElement.prototype, 'value').set;
  nativeInputValueSetter.call(input, text);

  const event = new Event('input', { bubbles: true });
  input.dispatchEvent(event);
};

Reference: https://stackoverflow.com/a/46012210

Automate and scrape a website with XPath Expressions

I’ve been scraping websites for a long time. One of my first “real” programming projects used Java and Selenium to automate a website and scrape data. I later used Node.js to pull data from many different education-related websites so teachers could see it all in one place.

In this post, I want to share some tips for using XPath expressions to find page elements and data. Scroll to the bottom for two helper functions that make it just as easy to use XPath expressions as CSS selectors!

Why Scrape?

In a perfect world, every site would be regularly maintained and would have an API you could use to efficiently get the data you need. In reality, we often need to use an older legacy site to access the data we need. There is no API, and the website we need to use is clunky and time consuming.

We might use web scraping to make sure people aren’t doing work that a computer can do automatically. For example, instead of having a person search for individual records manually, we can make a list of records we want to search and build an automation that finds those records. If you have multiple sources of data, you can scrape them concurrently with an automation.

In summary, automated web scraping is always more efficient than manual searching and data entry.

What is an XPath Expression?

XPath stands for “XML Path Language.” Since HTML is an XML-like language, we can use XPath to find specific HTML nodes. This is similar to using a CSS Selector to find an element on the page, but it can be much more powerful.

For example, this is how we would find the same element using a CSS selector and XPath expression:

DescriptionCSS SelectorXPath Expression
H1 Elementh1//h1
Paragraph within a sectionsection p//section//p
Anchor element with title “home”a[title="home"]//a(@title="home")
Second direct descendant list item in an ordered listol > li:nth-of-type(2)//ol/li[2]
Paragraph with exact text “lorem ipsum”impossible//p[text()="lorem ipsum"]
Paragraph containing text “lorem ipsum”impossible//p[contains(text(), "lorem ipsum")]
XPath expressions give us more power than just CSS selectors.

As you can see, XPath expressions have capabilities beyond CSS selectors. For example, XPath expressions can find an element based on its text content.

How to use XPath Expressions in JavaScript

Most developers quickly learn about the querySelector and querySelectorAll methods for querying the DOM with CSS selectors. For example:

// find a heading
const heading = document.querySelector('h1');\

// find all paragraphs
const paragraphs = document.querySelectorAll('p');

Similarly, the document.evaluate function allows us to search the DOM using an XPath:

// find a heading
const heading = document.evaluate(
  '//h1',
  document,
  null,
  XPathResult.ANY_UNORDERED_NODE_TYPE,
  null,
).singleNodeValue;

// find all paragraphs and log their text content
const paragraphs = document.evaluate(
  '//p',
  document,
  null,
  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
  null,
);

for (let i = 0; i < paragraphs.snapshotLength; i++) {
  const text = paragraphs.snapshotItem(i).textContent;
  console.log(text);
}

document.evaluate parameters

The document.evaluate function receives several parameters. This makes it seem more complicated than the querySelector and querySelectorAll functions. This is the price we pay for the power of XPath.

I’ll be showing you a wrapper function to make XPath much easier to use. Since you won’t have to worry about the details, I’m going to skip most of them. You can always check the documentation over at MDN if you want to know more:

https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate

Search within a parent element using XPath Context Nodes

One thing you do want to be aware of is how you can search within a specific element using an XPath expression. You can do this with the contextNode parameter of document.evaluate. It is the second parameter.

In the examples above, the context node was always the document itself. However, it can be any element you provide. Calling document.evaluate with a specific context node will only return matches within that node.

This is similar to how we can call querySelector on the document itself, but we can also call it on a specific element within the document:

// searches the whole document for paragraphs
const allParagraphs = document.querySelectorAll('p');

// searches only for paragraphs inside the section element
const section = document.querySelector('section');
const sectionParagraphs = section.querySelectorAll('p');

This is what it looks like to search a specific element for descendants using XPath:

// searches only for paragraphs inside the section element
const section = document.querySelector('section');

const sectionParagraphs = document.evaluate(
  '//p',
  section, // <-- notice the change!
  null,
  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
  null,
);

for (let i = 0; i < sectionParagraphs.snapshotLength; i++) {
  const text = sectionParagraphs.snapshotItem(i).textContent;
  console.log(text);
}

Mix and Match XPath Expressions and CSS Selectors

You can use XPath expressions and CSS selectors in your code. Both of them return references to DOM elements, and both of them use those same element references to search for descendant nodes.

Simple document.evaluate wrappers

Using document.evaluate is more verbose than using a CSS selector. However, we can simplify things by creating a wrapper function for document.evaluate that mimics querySelector and querySelectorAll.

Find one element by XPath easily

You can use this function to easily find a single document node using an XPath. This is similar to using the querySelector function:

/**
* Shorthand for calling `document.evaluate` to get a single element via XPath.
* @param {*} xpathExpression 
* @param {*} contextNode 
* @returns 
*/
const getNodeByXpath = (xpathExpression, contextNode = document) => document.evaluate(
  xpathExpression,
  contextNode,
  null,
  XPathResult.ANY_UNORDERED_NODE_TYPE,
  null,
).singleNodeValue;

// From root of document
const heading = getNodeByXpath('//h1');

// From specific ancestor element ("context node")
const section = document.querySelector('section');
const sectionHeading = getNodeByXpath('//h2', section);

Find multiple elements by XPath easily

You can use this function to easily find multiple document nodes using an XPath. This is similar to using the querySelectorAll function:

/**
* Shorthand for calling `document.evaluate` to get multiple element via XPath.
* Returns an array of nodes, which may be empty.
* @param {*} xpathExpression 
* @param {*} contextNode 
* @returns 
*/
const getNodesByXpath = (xpathExpression, contextNode = document) => {
  const result = document.evaluate(
    xpathExpression,
    contextNode,
    null,
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
    null,
  );

  const nodes = [];
  for (let i = 0; i < result.snapshotLength; i++) {
    const node = result.snapshotItem(i);
    nodes.push(node);
  }
  return nodes;
};

// From root of document
const paragraphs = getNodesByXpath('//p');

// From specific ancestor element ("context node")
const section = document.querySelector('section');
const sectionParagraphs = getNodesByXpath('//p', section);

References

For help building your own XPath expressions, check out this cheatsheet:

https://devhints.io/xpath