Lexical Structure

This chapter describes the lexical structure of the Cangjie programming language. For the complete BNF of the lexicon and syntax, see"Cangjie Syntax."

Note: To improve the readability of the document, the syntax definition in the text body is slightly different from that in the appendix. In the text body, symbols and keywords are replaced with their literal representations (rather than names in the lexical structure).

Identifiers and Keywords

Identifiers are classified into common identifiers and raw identifiers. A common identifier starts with an underscore (ASCII code _) or an XID_Start character, followed by XID_Continue characters of any length, and the keyword Cangjie is removed. XID_Start and XID_Continue are properties defined in the Unicode Standard, as detailed in [Unicode Standard Annex #31] (https://www.unicode.org/reports/tr31/tr31-37.html). The current version used by Cangjie is 15.0.0. A raw identifier is a common identifier enclosed by a pair of backquotes (``). Keywords can also be used in within the backquotes.

Identifier
    : Ident
    | RawIdent
    ;

fragment RawIdent
    : '`' Ident '`'
    ;

fragment Ident
    : XID_Start XID_Continue*
    | '_' XID_Continue+
    ;

In Cangjie, all identifiers are identified in the form of Normalization Form C (NFC). If two identifiers are equal after normalization to NFC, they are considered to be the same. For the definition of NFC, see Unicode tr15.

For example, the following are some valid Cangjie identifiers:

foo
_bar
Cangjie
`if`

Keywords are special character strings that cannot be used as identifiers. The following table lists the Cangjie keywords.

Keyword
asbreakBool
casecatchclass
constcontinueRune
doelseenum
extendforfrom
funcfalsefinally
foreignFloat16Float32
Float64ifin
isinitinout
importinterfaceInt8
Int16Int32Int64
IntNativeletmut
mainmacromatch
Nothingoperatorprop
packagequotereturn
spawnsuperstatic
structsynchronizedtry
thistruetype
throwThisunsafe
UnitUInt8UInt16
UInt32UInt64UIntNative
varVArraywhere
while

Contextual keywords are special strings that can be used as identifiers. They exist as keywords in some syntaxes, but can also be used as common identifiers.

Contextual Keyword
abstractopenoverride
privateprotectedpublic
redefgetset
sealed

Semicolons and Newline Characters

There are two symbols that can indicate the end of an expression or declaration: a semicolon (;) and a newline character. The meaning of ; is fixed. It indicates the end of an expression or declaration regardless of its position, and multiple expressions or declarations can be written on the same line, separated by ;. However, the meaning of a newline character is not fixed. Depending on its position, it can be used as a separator between two tokens like a space character, or indicates the end of an expression or declaration like a ;.

A newline character can be used between any two tokens. Generally, the "longest match" principle (using as many tokens as possible to form a valid expression or declaration) is followed to determine whether to treat the newline character as the separator between tokens or the terminator of the expression or declaration. The newline character encountered before the "longest match" ends is treated as the separator between tokens, and that encountered after the "longest match" ends is treated as the terminator of the expression or declaration. The following shows examples:

let width1: Int32 = 32 // The newline character is treated as a terminator.
let length1: Int32 = 1024 // The newline character is treated as a terminator.
let width2: Int32 = 32; let length2: Int32 = 1024 // The newline character is treated as a terminator.
var x = 100 + // The newline character is treated as a separator.
200 * 300 - // The newline character is treated as a separator.
50 // The newline character is treated as a terminator.

However, the "longest match" principle does not apply to scenarios where a newline character cannot be used as the separator between two tokens, string literals, and multi-line comments. A newline character cannot be used as the separator between two tokens in the following scenarios:

  • Do not use a newline character as the separator between unary operator and operand.

  • In the calling expression, do not use a newline character as the separator between ( and its previous token.

  • In an index access expression, do not use a newline character as the separator between [ and its previous token.

  • In constant pattern, do not use a newline character as the separator between $ and the identifier following it.

Note: In the preceding scenarios, the newline character cannot be used as the separator between two tokens. It does not mean that the newline character cannot be used in these scenarios. (If a newline character is used, it will be directly treated as the terminator of the expression or declaration.)

The "longest match" principle does not apply to string literals and multi-line comments.

  • For a single-line string, when a non-escape double quotation mark is encountered for the first time, the matching ends.

  • For a multi-line string, when three non-escaped double quotation marks are encountered for the first time, the matching ends.

  • For a multi-line raw string, when the non-escape double quotation marks and the same number of comment tags (#) at the beginning are encountered for the first time, the matching ends.

  • For a multi-line comment, when the first */ is encountered, the matching ends.

Literals

A literal is an expression that represents a value that cannot be modified.

Literals also have types. In Cangjie, the types of literals include integer, floating-point, Rune, Boolean, and string. The syntax of literals is as follows:

literalConstant
    : IntegerLiteral
    | FloatLiteral
    | RuneLiteral
    | booleanLiteral
    | stringLiteral
    ;
    
stringLiteral
    : lineStringLiteral
    | multiLineStringLiteral
    | MultiLineRawStringLiteral
    ;

Literals of the Integer Type

An integer literal can be expressed using four number notations: binary (using the 0b or 0B prefix), octal (using the 0o or 0O prefix), decimal (without a prefix), and hexadecimal (using the 0x or 0X prefix). In addition, you can add an optional suffix to specify the specific type of the integer literal.

The syntax of the literal of the integer type is defined as follows:

IntegerLiteralSuffix
   : 'i8' |'i16' |'i32' |'i64' |'u8' |'u16' |'u32' | 'u64'
   ; 
 
IntegerLiteral
   : BinaryLiteral IntegerLiteralSuffix?
   | OctalLiteral IntegerLiteralSuffix?
   | DecimalLiteral '_'* IntegerLiteralSuffix?
   | HexadecimalLiteral IntegerLiteralSuffix?
   ;

BinaryLiteral
	: '0' [bB] BinDigit (BinDigit | '_')*
	;

BinDigit
	: [01]
	;

OctalLiteral
	: '0' [oO] OctalDigit (OctalDigit | '_')*
	;

OctalDigit
	: [0-7]
	;

DecimalLiteral
	: ([1-9] (DecimalDigit | '_')*) | DecimalDigit
	;

DecimalDigit
	: [0-9]
	;

HexadecimalLiteral
	: '0' [xX] HexadecimalDigits
	;
HexadecimalDigits
   	: HexadecimalDigit (HexadecimalDigit | '_')*
   	;
    
HexadecimalDigit
   	: [0-9a-fA-F]
   	;

The mappings between suffixes and types for IntegerLiteralSuffix are as follows:

SuffixTypeSuffixType
i8Int8u8UInt8
i16Int16u16UInt16
i32Int32u32UInt32
i64Int64u64UInt64

Literals of the Floating-Point Type

A floating-point literal can be expressed in two formats: decimal (without a prefix) and hexadecimal (with a 0x or 0X prefix). In a decimal floating-point number, either the integer part or the fractional part (including the decimal point), or both, must be contained. If there is no decimal part, the exponent part (with an e or E prefix) is required. In a decimal floating-point number, the integer part, the fractional part (including the decimal point), or both, must be contained, and the exponent part (with a p or P prefix) is required. In addition, you can add an optional suffix to specify the specific type of the floating-point literal.

The syntax of the literal of the floating-point type is defined as follows:

FloatLiteralSuffix
    : 'f16' | 'f32' | 'f64'
    ;
 
FloatLiteral
    : (DecimalLiteral DecimalExponent | DecimalFraction DecimalExponent? | (DecimalLiteral DecimalFraction) DecimalExponent?)  FloatLiteralSuffix? 
    | (Hexadecimalprefix (HexadecimalDigits | HexadecimalFraction | (HexadecimalDigits HexadecimalFraction)) HexadecimalExponent) 

DecimalFraction 
    : '.' DecimalFragment
    ;

DecimalFragment
    : DecimalDigit (DecimalDigit | '_')*
    ;
    
DecimalExponent 
    : FloatE Sign? DecimalFragment
    ;
    
HexadecimalFraction 
    : '.' HexadecimalDigits
    ;

HexadecimalExponent 
    : FloatP Sign? DecimalFragment
    ;
    
FloatE 
    : [eE]
    ;

FloatP 
    : [pP]
    ;

Sign 
    : [-]
    ;

Hexadecimalprefix 
    : '0' [xX]
    ;

The mappings between suffixes and types for FloatLiteralSuffix are as follows:

SuffixType
f16Float16
f32Float32
f64Float64

Literals of the Boolean Type

There are only two boolean type literals: true and false.

booleanLiteral
    : 'true'
    | 'false'
    ;

Literals of the String Type

String literals are classified into three types: single-line string literals, multi-line string literals, and multi-line raw string literals.

Single-line string literals are defined using a pair of single or double quotation marks. The content in the quotation marks can be any number of characters. To include a quotation mark or a backslash (\) as part of the string, add a backslash (\) before it. A single-line string literal cannot span multiple lines by including newline characters.

The syntax of single-line string literals is defined as follows:

lineStringLiteral
    : '"' (lineStringExpression | lineStringContent)* '"'
    ;
    
lineStringExpression
    : '${' SEMI* (expressionOrDeclaration (SEMI+ expressionOrDeclaration?)*) SEMI* '}'
    ;
    
lineStringContent
    : LineStrText
    ;
    
LineStrText
    : ~["\\\r\n]
    | EscapeSeq
    ;

A multi-line string literal must start and end with three double quotation marks (""") or three single quotation marks ('''). The content in the quotation marks can be any number of characters. To include the three quotation marks (" or ') used to enclose the string or backslashes (\) as part of the string, you must add backslashes (\) before them. If there is no newline character after the three double quotation marks at the beginning, or no non-escape double quotation marks are encountered before the end of the current file, a compilation error is reported. Unlike single-line string literals, multi-line string literals can span multiple lines.

The syntax of multi-line string literals is defined as follows:

multiLineStringLiteral
    : '"""' NL (multiLineStringExpression | multiLineStringContent)* '"""'
    ;
    
multiLineStringExpression
    : '${' end* (expressionOrDeclaration (end+ expressionOrDeclaration?)*) end* '}'
    ;
    
multiLineStringContent
    : MultiLineStrText
    ;

MultiLineStrText
    : ~('\\')
    | EscapeSeq
    ;

A multi-line raw string literal starts and ends with one or more comment tags (#) and one single quotation mark (') or double quotation mark ("). Both the number of comment tags and the quotation marks at the end are the same as those at the beginning. The literal content can be any number of valid characters. Before the current file ends, if no matching quotation mark and the same number of comment tags are encountered, a compilation error is reported. Like multi-line string literals, a multi-line raw string literal can span multiple lines. The difference is that escape rules do not apply to multi-line raw string literals. The content in the literals remains unchanged (escape characters are not escaped).

The syntax of multi-line raw string literals is defined as follows:

MultiLineRawStringLiteral
    : MultiLineRawStringContent
    ;

fragment MultiLineRawStringContent
    : '#' MultiLineRawStringContent '#' 
    | '#' '"' .*? '"' '#'
    ;

Literals of the Rune Type

A Rune literal starts with the character r, followed by a single-line string literal (with single or double quotation marks). The string literal must contain exactly one character. The syntax is as follows:

RuneLiteral
    : 'r' '\'' (SingleChar | EscapeSeq) '\''
    : 'r' '"' (SingleChar | EscapeSeq) '"'
    ;

fragment SingleChar
	:	~['\\\r\n]
	;

EscapeSeq
    : UniCharacterLiteral
    | EscapedIdentifier
    ;

fragment UniCharacterLiteral
    : '\\' 'u' '{' HexadecimalDigit '}'
    | '\\' 'u' '{' HexadecimalDigit HexadecimalDigit '}'
    | '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
    | '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
    | '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
    | '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
    | '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
    | '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
    ;

fragment EscapedIdentifier
    : '\\' ('t' | 'b' | 'r' | 'n' | '\'' | '"' | '\\' | 'f' | 'v' | '0' | '$')
    ;

Operators

The following table lists all operators supported by Cangjie, with higher priority operators appearing at the top. For details about each operator, see [Expressions].

OperatorDescription
@Macro call expression
.Member access
[]Index access
()Function call
++Postfix increment
--Postfix decrement
?Question mark
!Logic NOT
-Unary negative
**Power
*Multiply
/Divide
%Remainder
+Add
-Subtract
<<Bitwise left shift
>>Bitwise right shift
..Range operator
..=
<Less than
<=Less than or equal
>Greater than
>=Greater than or equal
isType test
asType cast
==Equal
!=Not equal
&Bitwise AND
^Bitwise XOR
``
&&Logic AND
`
??coalescing
`>`
~>Composition
=Assignment
**=Compound assignment
*=
/=
%=
+=
-=
<<=
>>=
&=
^=
`=`
&&=
`

Comments

Cangjie supports the following comment formats:

A single-line comment starts with //. The syntax is as follows:

LineComment
    : '//' ~[\n\r]*
    ;

A multi-line comment is enclosed with /* and */ and supports nesting. The syntax is as follows:

DelimitedComment
    : '/*' ( DelimitedComment | . )*? '*/'
    ;