Syntax Nodes

During the compilation of the Cangjie language, the code is converted into Tokens through lexical analysis, and then the syntax of Tokens is parsed to obtain a syntax tree. A node of each syntax tree could be an expression, declaration, type, pattern, among others. The ast library of Cangjie provides a class for each node, with an appropriate inheritance relationship among the classes. The main abstract classes are as follows:

  • Node: parent class of all syntax nodes.
  • TypeNode: parent class of all type nodes.
  • Expr: parent class of all expression nodes.
  • Decl: parent class of all declaration nodes.
  • Pattern: parent class of all pattern nodes.

For more details about the various node types, see Cangjie Programming Language Library API. In the examples below, the following two nodes are used:

  • BinaryExpr: binary operation expression.
  • FuncDecl: function declaration.

Node Parsing

The ast library enables basically each type of node to be parsed from Tokens. There are two ways to parse a node.

Using Functions for Expression and Declaration Parsing

The following functions can be used to parse any expression or declaration from Tokens:

  • parseExpr(input: Tokens): Expr: Parses the input Tokens into an expression.
  • parseExprFragment(input: Tokens, startFrom!: Int64 = 0): (Expr, Int64): Parses a segment of the input Tokens into an expression. The segment starts from the startFrom index. The parsing may consume only a part of the segment starting from the startFrom index and return the index of the first unconsumed Token. If the entire segment is consumed, input.size is returned.
  • parseDecl(input: Tokens, astKind!: String = ""): Parses the input Tokens into a declaration. astKind is an additional setting. For details, see Cangjie Programming Language Library API.
  • parseDeclFragment(input: Tokens, startFrom!: Int64 = 0): (Decl, Int64): Parses a segment of the input Tokens into a declaration. The meanings of the startFrom parameter and return index are the same as those of parseExpr.

The following code cases illustrate how to use these functions.

let tks1 = quote(a + b)
let tks2 = quote(u + v, x + y)
let tks3 = quote(
    func f1(x: Int64) { return x + 1 }
)
let tks4 = quote(
    func f2(x: Int64) { return x + 2 }
    func f3(x: Int64) { return x + 3 }
)

let binExpr1 = parseExpr(tks1)
let (binExpr2, mid) = parseExprFragment(tks2)
let (binExpr3, _) = parseExprFragment(tks2, startFrom: mid + 1) // Skip the comma.
println("binExpr1 = ${binExpr1.toTokens()}")
println("binExpr2 = ${binExpr2.toTokens()}, binExpr3 = ${binExpr3.toTokens()}")
let funcDecl1 = parseDecl(tks3)
let (funcDecl2, mid2) = parseDeclFragment(tks4)
let (funcDecl3, _) = parseDeclFragment(tks4, startFrom: mid2)
println("${funcDecl1.toTokens()}")
println("${funcDecl2.toTokens()}")
println("${funcDecl3.toTokens()}")

The output is as follows:

binExpr1 = a + b
binExpr2 = u + v, binExpr3 = x + y
func f1(x: Int64) {
    return x + 1
}

func f2(x: Int64) {
    return x + 2
}

func f3(x: Int64) {
    return x + 3
}

Using Constructors for Parsing

Most node types support the init(input: Tokens) constructor, which can parse the input Tokens to the node of the corresponding type. For example:

import std.ast.*

let binExpr = BinaryExpr(quote(a + b))
let funcDecl = FuncDecl(quote(func f1(x: Int64) { return x + 1 }))

If the parsing fails, an exception is thrown. This parsing method applies to code snippets whose types are known, so manual type casting are not needed.

Node Composition

After a node is parsed based on Tokens, you can view the components of the node. The components of BinaryExpr and FuncDecl are listed as examples. For details about the components of other nodes, see Cangjie Programming Language Library API.

  • BinaryExpr node:
    • leftExpr: Expr: expression on the left of the operator
    • op: Token: operator
    • rightExpr: Expr: expression on the right of the operator
  • FuncDecl node (partial):
    • identifier: Token: function name
    • funcParams: ArrayList<FuncParam>: parameter list
    • declType: TypeNode: return value type
    • block: Block: function body
  • FuncParam node (partial):
    • identifier: Token: parameter name
    • paramType: TypeNode: parameter type
  • Block node (partial):
    • nodes: ArrayList<Node>: expressions and declarations in a block

Each component is a public mut prop and can be viewed and updated. Here are some examples of update results.

Case of BinaryExpr

let binExpr = BinaryExpr(quote(x * y))
binExpr.leftExpr = BinaryExpr(quote(a + b))
println(binExpr.toTokens())

binExpr.op = Token(TokenKind.ADD)
println(binExpr.toTokens())

The output is as follows:

(a + b) * y
a + b + y

To begin with, it is identified through parsing that binExpr represents the node x * y, as depicted in the following figure.

    *
  /   \
 x     y

After that, we replace the left node (that is, x) with a + b, which obtains the following syntax tree:

      *
    /   \
   +     y
  / \
 a   b

When the syntax tree is tokenized (with toTokens), parentheses must be added around a + b to obtain (a + b) * y. (If a + b * y is a result, multiplication is performed before addition, which contradicts the connotation of the syntax tree.) The ast library can automatically add parentheses when the syntax tree is tokenized.

Finally, we replace the operator at the root of the syntax tree from * to + to obtain the following syntax tree:

      +
    /   \
   +     y
  / \
 a   b

The syntax tree can be tokenized as a + b + y because addition is inherently left-associative and requires no parentheses on the left.

Case of FuncDecl

let funcDecl = FuncDecl(quote(func f1(x: Int64) { x + 1 }))
funcDecl.identifier = Token(TokenKind.IDENTIFIER, "foo")
println("Number of parameters: ${funcDecl.funcParams.size}")
funcDecl.funcParams[0].identifier = Token(TokenKind.IDENTIFIER, "a")
println("Number of nodes in body: ${funcDecl.block.nodes.size}")
let binExpr = (funcDecl.block.nodes[0] as BinaryExpr).getOrThrow()
binExpr.leftExpr = parseExpr(quote(a))
println(funcDecl.toTokens())

In this case, a FuncDecl node is constructed through parsing, and the function name, parameter name, and part of the expression in the function body are modified. The output is as follows:

Number of parameters: 1
Number of nodes in body: 1
func foo(a: Int64) {
    a + 1
}

Using the quote Interpolation Syntax Node

Any node of an abstract syntax tree (AST) can be interpolated in the quote statement, and the ArrayList lists of some AST nodes can also be interpolated (corresponding to scenarios where such node lists appear in practice). Interpolation can be directly expressed by $(node), where node is an instance of any node type.

Next, we will demonstrate how to interpolate nodes through some cases.

var binExpr = BinaryExpr(quote(1 + 2))
let a = quote($(binExpr))
let b = quote($binExpr)
let c = quote($(binExpr.leftExpr))
let d = quote($binExpr.leftExpr)
println("a: ${a.toTokens()}")
println("b: ${b.toTokens()}")
println("c: ${c.toTokens()}")
println("d: ${d.toTokens()}")

The output is as follows:

a: 1 + 2
b: 1 + 2
c: 1
d: 1 + 2.leftExpr

Generally, the expression after an interpolation operator uses parentheses to limit the scope, for example, $(binExpr). If there is only one identifier, the expression can be written as $binExpr, with the parentheses omitted. In this case, both a and b insert the binExpr node into the quote statement, and the result is 1 + 2. However, if the expression after the interpolation operator is more complex, a scope error may occur if parentheses are not added. For example, the expression binExpr.leftExpr is evaluated as the left part of expression 1 + 2, that is, 1. Therefore, 1 is correctly assigned to c. However, the interpolation in d is interpreted as ($binExpr).leftExpr, so the result is 1 + 2.leftExpr. To specify the scope of interpolation, you are advised to use parentheses in interpolation operations.

The following example shows the interpolation of a node list (ArrayList).

var incrs = ArrayList<Node>()
for (i in 1..=5) {
    incrs.append(parseExpr(quote(x += $(i))))
}
var foo = quote(
    func foo(n: Int64) {
        let x = n
        $(incrs)
        x
    })
println(foo)

The output is as follows:

func foo(n: Int64) {
    let x = n
    x += 1
    x += 2
    x += 3
    x += 4
    x += 5
    x
}

In this case, we create a node list incrs that contains expressions x += 1, ..., x += 5. The interpolation of incrs lists the nodes in sequence and wraps the line for each node. This applies to the scenario where expressions and declarations that need to be executed sequentially are inserted.

The following example shows that in some cases, parentheses need to be added around the interpolation for correctness.

var binExpr1 = BinaryExpr(quote(x + y))
var binExpr2 = BinaryExpr(quote($(binExpr1) * z))      // Error: x + y x z is obtained.
println("binExpr2: ${binExpr2.toTokens()}")
println("binExpr2.leftExpr: ${binExpr2.leftExpr.toTokens()}")
println("binExpr2.rightExpr: ${binExpr2.rightExpr.toTokens()}")
var binExpr3 = BinaryExpr(quote(($(binExpr1)) * z))     / Correct: (x + y) x z is obtained.
println("binExpr3: ${binExpr3.toTokens()}")

The output is as follows:

binExpr2: x + y * z
binExpr2.leftExpr: x
binExpr2.rightExpr: y * z
binExpr3: (x + y) * z

First, we construct the expression x + y, and then insert the expression into the $(binExpr1) * z template. The intent is to obtain an expression that first computes x + y and then multiplies the result by z. However, the result of interpolation is x + y * z, in which y * z is performed before x is added. This is because interpolation cannot automatically add parentheses to ensure the atomicity of the inserted expression, which differs from the replacement of leftExpr introduced in the previous section. Therefore, parentheses need to be added around $(binExpr1) to ensure result correctness.