How to automate and scrape a React web app

I've mentioned before that I've been automating and scraping websites for a long time. It's not always pretty, but sometimes it's absolutely necessary. In this post I want to share a few tips and tricks for automating React apps.

The Irony

There's irony here. Often, we scrape websites because they are old and unmaintained and never had an API… we simply have no other choice. However, React is a more modern framework. Check out this timeline:

  • 2006 – Facebook releases its app platform and API
  • 2009 – Node.js is released, allowing API servers to be written in JavaScript
  • 2013 – Facebook releases React

Any website that uses React could also have been created with a convenient API for data access. In 2006, APIs were already popular enough that Facebook offered one, and then in 2009 Node.js enabled developers to use JavaScript to build API servers. This means any devs who built a React app could have also built an API server.

To be fair, most React apps probably are making HTTP requests to get things done, but that doesn't mean there's an API there for you to consume. The backend of a site is often built to only handle the needs of the frontend, with things like authentication and session management tightly coupling the two together. If that's the case, it can be better to scrape the React frontend than to try to interact directly with the server.

Typical Automation

Automating a website means getting a computer to use the website as if it were a person, clicking on things and entering text and copying data from the page.

There are different tools for doing this like Selenium and Puppeteer, which have their own methods to call. We can also scrape a website without any external tools. For example, we can run a script in the browser console. We can also create a bookmarklet to load an external JavaScript file and run it. In this case, we will use plain vanilla JavaScript to move through the website and find data.

We can find elements with CSS selectors or XPath expressions.

If we want to click on a link or a button, we can use the element.click() method.

If we want to enter text into a form, we can simply find the <input> or <textarea> element and set its value: element.value = "robot user input".

The above approach also works for selecting an option in a <select> dropdown.

However, there are unique challenges with automating React apps that make these usual approaches ineffective. React has a reactivity system (hence the name) that listens for user input events and then uses them to update its internal state data.

This reactivity system listens for specific events that aren't fired when we change an element's value directly or call an element's click() method. The page may appear to respond as usual, but React's internal state will not reflect the changes we made on the page.

Eventually, our automation will likely fail because React is not updating in response to our input.

Clicking in a React app

For many websites, using element.click() is enough to click the element from JavaScript. In fact, we can still use that method on React components to follow links and submit some forms.

However, many React components require a click event to function properly. Maybe a switch component will only toggle when clicked. Maybe a dropdown or menu will only open when clicked. In these cases, we need to trigger React's reactivity system.

To do so, we will dispatch a MouseEvent three times, once for each type of event that fires when a user clicks an element:

  • mousedown
  • click
  • mouseup

This ensures that React will recognize the click no matter what specific mouse events it was listening for. If React was expecting all three to happen in order, we satisfy that requirement:

/**
 * Simulates a mouse click event in the browser in a way that React will recognize.
 * @param {HTMLElement} element 
 */
const simulateMouseClick = (element) => {
  const mouseClickEvents = ['mousedown', 'click', 'mouseup'];
  mouseClickEvents.forEach(mouseEventType =>
    element.dispatchEvent(
      new MouseEvent(mouseEventType, {
        view: window,
        bubbles: true,
        cancelable: true,
        buttons: 1,
      }),
    ),
  );
};

React app mouse hover

There are times when we need to hover our cursor over an element to proceed with automation. For example, we may need to move our mouse over a dropdown component in order for a menu to appear.

On some websites, such a menu will always exist but will be hidden by CSS classes. In that case we can still click links in the menu even if they are hidden. With React, however, those menu options won't even be rendered unless the menu is activated by a mouse hover event.

In order to simulate a hover event, we will dispatch a mousemove event in the middle of our target element. React's event listeners will pick up this event and recognize that we are hovering over the element:

/**
 * Simulates mouse movement across an element in a way that React will recognize.
 * @param {HTMLElement} element 
 */
const simulateMouseHover = (element) => {
  const x = element.offsetLeft + element.offsetWidth / 2;
  const y = element.offsetTop + element.offsetHeight / 2;
​
  element.dispatchEvent(
    new MouseEvent('mousemove', {
      view: window,
      bubbles: true,
      cancelable: true,
      buttons: 1,
      screenX: x,
      screenY: y,
      clientX: x,
      clientY: y,
    }),
  );
};

React app text entry

Entering text in a React app is very similar to how we would enter text in any website. We find the form input element we want to update, and we set its value directly.

However, React's two-way data binding overrides the native value setter function. Without that setter function, our updates may appear in the browser but they won't actually update the element's value property, and React won't pick up on them either.

For example, if we enter text into a React form and then submit that form, React won't actually have our data. It will appear in the browser, but React will have an empty form since it didn't "hear" the input events that should have fired.

To get around this, we need to do two things:

  1. Get the original setter function and call it to make sure the element's value is actually updated
  2. Dispatch an input event from that element so React "hears" the change happen and pulls the element's value into its internal state

This is what that looks like:

/**
 * Enters text in an input field in a way that React will respond to.
 * Directly setting an input value works, but React will not copy the value to its internal model.
 * The value would disappear if the component is re-rendered, and even if it didn't it would never
 * be submitted in an API request.
 * @param {HTMLInputElement} input 
 * @param {string} text 
 */
const simulateTextEntry = (input, text) => {
  const nativeInputValueSetter = Object.getOwnPropertyDescriptor(window.HTMLInputElement.prototype, 'value').set;
  nativeInputValueSetter.call(input, text);const event = new Event('input', { bubbles: true });
  input.dispatchEvent(event);
};

Reference: https://stackoverflow.com/a/46012210