Published on 14-07-2020

Swinging on an AST branch

Reading time: 10 minutes

Categories:

Articles

There are things that we tend not to notice since they are so obvious — or we just do not need to think about them. There are many such things in JavaScript, things that we do not need to worry about in our everyday work (unless you develop Babel or Rollup — then these things are your work…). Let’s look into one of such things: Abstract Syntax Tree (AST).

# What is Abstract Syntax Tree?

Dealing with strings is not the best way to spend time on and I probably do not need to convince you about it. Yes, there are regular expressions, but they are not the most readable thing under the sun (especially in JavaScript, still lacking several regex goodies). Using them is also often very error-prone.

Probably the most famous task that cannot be sensibly achieved with regular expressions is parsing HTML. Let’s pretend that you must get the content of a paragraph with the .myClass class:

<p class="myClass">content</p>

It shouldn’t be difficult:

pageContent.match( /<p.*?class="myClass".*?>(.+?)<\/p>/g )

Easy peasy, isn’t it? Well, not exactly, because the following code is also totally valid HTML:

<P cLaSs = 'myClass'>content

And don’t get me started about syntax errors, which should be resolved using an algorithm described in the specification:

<p class="otherClass" class="myClass">content</p>

Most of the time you will not have any certainty that the source of the data is syntactically valid. And even if it is valid, you still cannot expect that it uses a certain type of HTML syntax. What’s more, the fact that it is syntactically valid and uses the syntax you expect does not mean that when a new version of the application is released, it will not start generating some different HTML output (e.g. due to code minification).

This is exactly what’s wrong with regular expressions: they have to strictly match the text to the pattern. When even a single character does not match, it is over. Working with more complex strings becomes really cumbersome. Additionally, strings are not good at passing additional information, like relationships between particular HTML elements.

Due to that, some very smart people realized that we need a more sensible data structure that would show what exactly is going on inside the HTML code and how the browser sees it. And this is how the DOM was created: as a tree structure that contains information about all elements building a web page and their mutual relationships. Thanks to that, getting content from a p.myClass element is as simple as:

document.querySelector( 'p.myClass' ).innerHTML;

And this is it. There is no need to worry about syntax differences or even errors — the browser does it for you, normalizing HTML code to a sensible, coherent form. A form that also allows you to see where exactly in the page’s structure a particular element is:

p.parentNode; // <body>
p.previousElementSibling; // <p>
p.nextElementSibling; // null

This way we went from a simple string to a spatial tree structure.

This idea works not only for HTML but it can also be used for other languages like JavaScript. A tree structure that is created after parsing the JavaScript source code (a string) and that contains nodes (instructions, blocks, expressions, etc.), is called AST, as in Abstract Syntax Tree. Its task is very similar to DOM’s one: to enable sensible traversing through the program’s code and analyzing or even modifying it on the go, without the need to resort to regular expressions. This is why AST is used by compilers and other tools that need to analyze the code of particular programs.

# JavaScript AST

AST is used in JavaScript in two totally different ways. On the one hand, it is an integral part of JavaScript engines (e.g. V8), which parse every piece of code to AST. On the other hand, it is used by tools written in JavaScript itself, extending the abilities of other JavaScript programs (e.g. Babel).

In this article I will not cover the AST magic done by JavaScript engines (like speculative optimizations). Instead, I will focus on how a totally average JavaScript programmer that treats the JavaScript engine as a monolithic platform can use AST for more mundane work.

There are quite a lot of tools that produce JavaScript AST. The most popular ones are:

@babel/parser – The parser used by Babel (surprise!).
Esprima – Formerly connected with jQuery and the JS Foundation, now an OpenJS Foundation project.
Espree – Used by ESLint, compatible with Esprima.
Acorn – A modular parser that is the foundation of Espree and that @babel/parser is based on.

As you can see, all main players on the market are similar or they even stem from the same project. Nevertheless, there are also some other parsers (e.g. the one in UglifyJS/Terser) and they do not necessarily play by the same rules. This is why the need to create a standard for unified AST appeared.

The standard is called ESTree and is based on the syntax of SpiderMonkey’s AST (SpiderMonkey is the JavaScript engine used by Firefox). Most of the parsers should be compliant with the standard or at least allow to import ESTree-based AST that will be then transformed into parser-specific AST (like UglifyJS/Terser’s parser does).

However, this is all just some boring theory. You may ask: What does such AST look like in practice? Let’s see on a simple JavaScript example:

console.log( 'Hello world!' );

To see its AST, you can use AST Explorer. As you can see, AST for such a simple code is not so simple. At first glance, it actually seems pretty complicated.

The generated JavaScript AST is based on Babel’s parser output; for other parsers it looks slightly different — despite the fact that these parsers are based on the same standard.

Similar to the DOM, where there is the document node, here you have the File node. This node contains the Program node (an equivalent to document.documentElement in the DOM). Going deeper you will notice that the entire program’s line of code (including the semicolon) forms one node, ExpressionStatement (this is a special kind of expression that is also a statement). At the same time, this same line without the semicolon forms a CallExpression node — so an expression that contains a function invocation.

A small digression here: The statement contains a semicolon because, according to the ECMAScript specification, this is the character that delimits a statement. Theoretically, a piece of code without a semicolon at the end would be syntactically incorrect. However, JavaScript contains an automatic semicolon insertion mechanism.

Every AST node contains information about the previous one (left) and the next one (right). What is interesting, every AST node also contains information about where it was located in the source code (loc). Not especially useful in most scenarios, but it can be useful, for example, when creating source maps.

# Modifying the code!

OK, enough with the theory — let’s jump to some practice! Let’s come up with a practical issue that could be solved with JavaScript AST.

# The issue

Imagine that you have a program consisting of several JavaScript files that contain some console.log() invocations. You want to replace them with a customLog() function. And before someone suggests that it can be easily done using regular expressions, I must warn that the source code can contain some surprises:

console.log( `console.log( 'Hello world!' ) is a really nice expression!` );

And if it is still not enough for die-hard regular expression enthusiasts, pretend that your boss has just forbidden their use. And this is exactly when Abstract Syntax Tree enters the room! The task is simple:

Load the code that needs to be transformed.
Parse it into AST.
Find AST nodes that need to be changed.
Replace the nodes with the new ones.
Generate the new JavaScript code.
Save the generated code back to the file.

It can be done using four Babel’s packages:

@babel/parser – To parse the code into AST.
@babel/traverse – To find and replace particular AST nodes.
@babel/generator – To generate code back from AST.
@babel/types – To… erm, this is just a library of AST node types that Babel understands and for some reason, it was put into a separate package ¯\_(ツ)_/¯.

The sample file that needs to be transformed is called input.js and looks like this:

function someFunction() {
	if ( true ) {
		console.log( `console.log( 'Hello world!' ) is a really nice expression!` );
	}
}

console.log( 'FORWAAAAARD!!!' );
other.call( 'whatever' );

someFunction();

# Parsing the code

Let’s start with the installation of dependencies:

npm install @babel/parser @babel/traverse @babel/generator @babel/types

The first step is to change the JavaScript code into Abstract Syntax Tree. To do it, you must read the file contents and then use the parse() method from the @babel/parser package on it:

const { join: joinPath} = require( 'path' ); // 3
const { readFileSync } = require( 'fs' ); // 2
const { parse } = require( '@babel/parser' ); // 5

const path = joinPath( __dirname, 'input.js' );
const code = readFileSync( path, 'utf8' ); // 1
const ast = parse( code ); // 4

To load the file (1), you use the readFileSync() method from the built-in fs module (2). To be sure that the file path is correct, you can use the join() method from another built-in module, path (3).

After saving the file contents into a variable, you just have to pass it to the parse() method (4) from the @babel/parser package (5). And this is it — you got your AST!

# Traversing the tree

This is all that we needed from the @babel/parser package. Now you can continue to the @babel/traverse package that will allow you to move through the entire tree and get every node. Pass AST to it then:

const { default: traverse } = require( '@babel/traverse' );

[...]

traverse( ast, { // 1
	enter( path ) { // 2

	}
} );

The package has a default export that is a function (1) taking AST and the options object as arguments. The enter() method (2) of this object is called every time the traverse() method enters another node. There is also an exit() method that is called upon exiting nodes. For your use case, though, this difference is irrelevant.

The path parameter passed to the enter() method is an object wrapping the AST node. It contains additional methods and properties that help understand which nodes are siblings to the one being inspected. They also allow modifying the node itself. All operations are done on the passed ast tree, which is alive and mutable.

Filter out nodes that are not interesting to you. You only need console.log() calls, so CallExpression nodes. Additionally, they should contain MemberExpression, so the syntax for referencing the object’s property (which console.log() undeniably does). Now code it:

const { isMemberExpression, isCallExpression } = require( '@babel/types' ); // 5

[...]

traverse( ast, {
	enter( path ) {
		const { node, parentPath } = path; // 1

		if ( !parentPath ) { // 2
			return;
		}

		const { node: parentNode } = parentPath; // 3

		if ( !isMemberExpression( node ) || !isCallExpression( parentNode ) ) { // 4
			return;
		}
	}
} );

First of all, get the node and parentPath properties from the path parameter (1). The first one is the wrapped node, the second one is the path of the current node’s parent (AST nodes, similar to DOM ones, have parents). If the node does not have a parent (2), it means it is a File or Program one. If it has a parent, it can be obtained from its path (3). Then you check if the current node is of the MemberExpression type and if its parent is of CallExpression one (4). The methods from @babel/types will be helpful here (5).

Now that all uninteresting nodes are filtered out, time to move on to ones that actually are interesting! You need to replace all console.log() calls with customLog() ones, so — in terms of AST — you need to replace MemberExpression nodes with Identifier ones (customLog() is a function name, so: an identifier). It is a fairly easy operation:

const { [...], identifier } = require( '@babel/types' ); // 1

[...]

traverse( ast, {
	enter( path ) {
		[...]

		path.replaceWith( // 3
			identifier( 'customLog' ) // 2
		);
	}
} );

You use another type, identifier (1). Thanks to it you create an identifier for the customLog() function (2) that you can use to replace the current node (3).

And this is it! You just got rid of console.log() calls in favor of customLog() ones.

# Generating and saving the code

The last thing to do is to generate the JavaScript code from the transformed AST and save it to the file:

const { [...], writeFileSync } = require( 'fs' ); // 4
const { default: generate } = require( '@babel/generator' ); // 2

[...]

const { code:transformedCode } = generate( ast ); // 1

const outputPath = joinPath( __dirname, 'output.js' );
writeFileSync( outputPath, transformedCode, 'utf8' ); // 3

To generate the new code you simply need to call the generate() function (1), which is the default export of the @babel/generator package (2). Saving it to a file (3) can be achieved by using the writeFileSync() function from the built-in fs module (4).

# Additional node filtering

If you ran this code now, you would notice two things:

The formatting of the code is not the same as in the original one.
The other.call() call was also changed to the customLog() one.

The first “error” is simple to explain: Abstract Syntax Tree does not contain any information about white space characters because they are totally irrelevant to the program. Due to that, this information is lost during the transformation from and to AST. Fortunately, it can be easily fixed by tools like ESLint and Prettier.

The second error is caused by the fact that the current filtering is not entirely accurate. You should ensure that the node is, in fact, a console.log() call. Write an isConsoleLog() function, which will do exactly that:

const { [...], isIdentifier, identifier } = require( '@babel/types' ); // 1

[...]

function isConsoleLog( { object, property } ) { // 3
	return isIdentifier( object ) && isIdentifier( property ) && object.name === 'console' && property.name === 'log';
}

traverse( ast, {
	enter( path ) {
		[...]

		if ( !isMemberExpression( node ) || !isCallExpression( parentNode ) || !isConsoleLog( node ) ) { // 2
			return;
		}

		[...]
	}
} );

It uses another function from @babel/types — isIdentifier() (1). The isConsoleLog() function, on the other hand, is used inside the filtering method in traverse() (2). It takes a node as a parameter.

The two node’s properties you should be interested in are object and property (3). The first one informs about the object which is referenced by the expression, the second one — which of the object’s properties is referenced. You should check if both properties are identifiers and their names are console and log. Only after such a check you will be sure that you deal with a console.log() call.

# Summary

This is how you can write a simple script that allows you to modify JavaScript code without the need of using regular expressions, by using a friendlier tree structure. What’s more, this is only a fraction of Abstract Syntax Tree possibilities! You can use a parser to, for example, extend the JavaScript syntax (as in JSX) or polyfill new ECMAScript features (this is exactly why Babel was created). AST is also used in bundlers (e.g. webpack or Rollup) for searching imports and exports.

The whole code is available on GitHub. Enjoy!

See other How to articles: