Lexical Structure
This chapter describes the lexical structure of the Cangjie programming language. For the complete BNF
of the lexicon and syntax, see"Cangjie Syntax."
Note: To improve the readability of the document, the syntax definition in the text body is slightly different from that in the appendix. In the text body, symbols and keywords are replaced with their literal representations (rather than names in the lexical structure).
Identifiers and Keywords
Identifiers are classified into common identifiers and raw identifiers. A common identifier starts with an underscore (ASCII code _
) or an XID_Start
character, followed by XID_Continue
characters of any length, and the keyword Cangjie is removed. XID_Start
and XID_Continue
are properties defined in the Unicode Standard, as detailed in [Unicode Standard Annex #31] (https://www.unicode.org/reports/tr31/tr31-37.html). The current version used by Cangjie is 15.0.0
. A raw identifier is a common identifier enclosed by a pair of backquotes (``). Keywords can also be used in within the backquotes.
Identifier
: Ident
| RawIdent
;
fragment RawIdent
: '`' Ident '`'
;
fragment Ident
: XID_Start XID_Continue*
| '_' XID_Continue+
;
In Cangjie, all identifiers are identified in the form of Normalization Form C (NFC). If two identifiers are equal after normalization to NFC, they are considered to be the same. For the definition of NFC, see Unicode tr15.
For example, the following are some valid Cangjie identifiers:
foo
_bar
Cangjie
`if`
Keywords are special character strings that cannot be used as identifiers. The following table lists the Cangjie keywords.
Keyword | ||
---|---|---|
as | break | Bool |
case | catch | class |
const | continue | Rune |
do | else | enum |
extend | for | from |
func | false | finally |
foreign | Float16 | Float32 |
Float64 | if | in |
is | init | inout |
import | interface | Int8 |
Int16 | Int32 | Int64 |
IntNative | let | mut |
main | macro | match |
Nothing | operator | prop |
package | quote | return |
spawn | super | static |
struct | synchronized | try |
this | true | type |
throw | This | unsafe |
Unit | UInt8 | UInt16 |
UInt32 | UInt64 | UIntNative |
var | VArray | where |
while |
Contextual keywords are special strings that can be used as identifiers. They exist as keywords in some syntaxes, but can also be used as common identifiers.
Contextual Keyword | ||
---|---|---|
abstract | open | override |
private | protected | public |
redef | get | set |
sealed |
Semicolons and Newline Characters
There are two symbols that can indicate the end of an expression or declaration: a semicolon (;
) and a newline character. The meaning of ;
is fixed. It indicates the end of an expression or declaration regardless of its position, and multiple expressions or declarations can be written on the same line, separated by ;
. However, the meaning of a newline character is not fixed. Depending on its position, it can be used as a separator between two tokens
like a space character, or indicates the end of an expression or declaration like a ;
.
A newline character can be used between any two tokens
. Generally, the "longest match" principle (using as many tokens
as possible to form a valid expression or declaration) is followed to determine whether to treat the newline character as the separator between tokens
or the terminator of the expression or declaration. The newline character encountered before the "longest match" ends is treated as the separator between tokens
, and that encountered after the "longest match" ends is treated as the terminator of the expression or declaration. The following shows examples:
let width1: Int32 = 32 // The newline character is treated as a terminator.
let length1: Int32 = 1024 // The newline character is treated as a terminator.
let width2: Int32 = 32; let length2: Int32 = 1024 // The newline character is treated as a terminator.
var x = 100 + // The newline character is treated as a separator.
200 * 300 - // The newline character is treated as a separator.
50 // The newline character is treated as a terminator.
However, the "longest match" principle does not apply to scenarios where a newline character cannot be used as the separator between two tokens
, string literals, and multi-line comments. A newline character cannot be used as the separator between two tokens
in the following scenarios:
-
Do not use a newline character as the separator between unary operator and operand.
-
In the calling expression, do not use a newline character as the separator between
(
and its previoustoken
. -
In an index access expression, do not use a newline character as the separator between
[
and its previoustoken
. -
In
constant pattern
, do not use a newline character as the separator between$
and the identifier following it.
Note: In the preceding scenarios, the newline character cannot be used as the separator between two tokens
. It does not mean that the newline character cannot be used in these scenarios. (If a newline character is used, it will be directly treated as the terminator of the expression or declaration.)
The "longest match" principle does not apply to string literals and multi-line comments.
-
For a single-line string, when a non-escape double quotation mark is encountered for the first time, the matching ends.
-
For a multi-line string, when three non-escaped double quotation marks are encountered for the first time, the matching ends.
-
For a multi-line raw string, when the non-escape double quotation marks and the same number of comment tags (
#
) at the beginning are encountered for the first time, the matching ends. -
For a multi-line comment, when the first
*/
is encountered, the matching ends.
Literals
A literal is an expression that represents a value that cannot be modified.
Literals also have types. In Cangjie, the types of literals include integer, floating-point, Rune, Boolean, and string. The syntax of literals is as follows:
literalConstant
: IntegerLiteral
| FloatLiteral
| RuneLiteral
| booleanLiteral
| stringLiteral
;
stringLiteral
: lineStringLiteral
| multiLineStringLiteral
| MultiLineRawStringLiteral
;
Literals of the Integer Type
An integer literal can be expressed using four number notations: binary (using the 0b
or 0B
prefix), octal (using the 0o
or 0O
prefix), decimal (without a prefix), and hexadecimal (using the 0x
or 0X
prefix). In addition, you can add an optional suffix to specify the specific type of the integer literal.
The syntax of the literal of the integer type is defined as follows:
IntegerLiteralSuffix
: 'i8' |'i16' |'i32' |'i64' |'u8' |'u16' |'u32' | 'u64'
;
IntegerLiteral
: BinaryLiteral IntegerLiteralSuffix?
| OctalLiteral IntegerLiteralSuffix?
| DecimalLiteral '_'* IntegerLiteralSuffix?
| HexadecimalLiteral IntegerLiteralSuffix?
;
BinaryLiteral
: '0' [bB] BinDigit (BinDigit | '_')*
;
BinDigit
: [01]
;
OctalLiteral
: '0' [oO] OctalDigit (OctalDigit | '_')*
;
OctalDigit
: [0-7]
;
DecimalLiteral
: ([1-9] (DecimalDigit | '_')*) | DecimalDigit
;
DecimalDigit
: [0-9]
;
HexadecimalLiteral
: '0' [xX] HexadecimalDigits
;
HexadecimalDigits
: HexadecimalDigit (HexadecimalDigit | '_')*
;
HexadecimalDigit
: [0-9a-fA-F]
;
The mappings between suffixes and types for IntegerLiteralSuffix
are as follows:
Suffix | Type | Suffix | Type |
---|---|---|---|
i8 | Int8 | u8 | UInt8 |
i16 | Int16 | u16 | UInt16 |
i32 | Int32 | u32 | UInt32 |
i64 | Int64 | u64 | UInt64 |
Literals of the Floating-Point Type
A floating-point literal can be expressed in two formats: decimal (without a prefix) and hexadecimal (with a 0x
or 0X
prefix). In a decimal floating-point number, either the integer part or the fractional part (including the decimal point), or both, must be contained. If there is no decimal part, the exponent part (with an e
or E
prefix) is required. In a decimal floating-point number, the integer part, the fractional part (including the decimal point), or both, must be contained, and the exponent part (with a p
or P
prefix) is required. In addition, you can add an optional suffix to specify the specific type of the floating-point literal.
The syntax of the literal of the floating-point type is defined as follows:
FloatLiteralSuffix
: 'f16' | 'f32' | 'f64'
;
FloatLiteral
: (DecimalLiteral DecimalExponent | DecimalFraction DecimalExponent? | (DecimalLiteral DecimalFraction) DecimalExponent?) FloatLiteralSuffix?
| (Hexadecimalprefix (HexadecimalDigits | HexadecimalFraction | (HexadecimalDigits HexadecimalFraction)) HexadecimalExponent)
DecimalFraction
: '.' DecimalFragment
;
DecimalFragment
: DecimalDigit (DecimalDigit | '_')*
;
DecimalExponent
: FloatE Sign? DecimalFragment
;
HexadecimalFraction
: '.' HexadecimalDigits
;
HexadecimalExponent
: FloatP Sign? DecimalFragment
;
FloatE
: [eE]
;
FloatP
: [pP]
;
Sign
: [-]
;
Hexadecimalprefix
: '0' [xX]
;
The mappings between suffixes and types for FloatLiteralSuffix
are as follows:
Suffix | Type |
---|---|
f16 | Float16 |
f32 | Float32 |
f64 | Float64 |
Literals of the Boolean Type
There are only two boolean type literals: true
and false
.
booleanLiteral
: 'true'
| 'false'
;
Literals of the String Type
String literals are classified into three types: single-line string literals, multi-line string literals, and multi-line raw string literals.
Single-line string literals are defined using a pair of single or double quotation marks. The content in the quotation marks can be any number of characters. To include a quotation mark or a backslash (\
) as part of the string, add a backslash (\
) before it. A single-line string literal cannot span multiple lines by including newline characters.
The syntax of single-line string literals is defined as follows:
lineStringLiteral
: '"' (lineStringExpression | lineStringContent)* '"'
;
lineStringExpression
: '${' SEMI* (expressionOrDeclaration (SEMI+ expressionOrDeclaration?)*) SEMI* '}'
;
lineStringContent
: LineStrText
;
LineStrText
: ~["\\\r\n]
| EscapeSeq
;
A multi-line string literal must start and end with three double quotation marks (""") or three single quotation marks ('''). The content in the quotation marks can be any number of characters. To include the three quotation marks ("
or '
) used to enclose the string or backslashes (\
) as part of the string, you must add backslashes (\
) before them. If there is no newline character after the three double quotation marks at the beginning, or no non-escape double quotation marks are encountered before the end of the current file, a compilation error is reported. Unlike single-line string literals, multi-line string literals can span multiple lines.
The syntax of multi-line string literals is defined as follows:
multiLineStringLiteral
: '"""' NL (multiLineStringExpression | multiLineStringContent)* '"""'
;
multiLineStringExpression
: '${' end* (expressionOrDeclaration (end+ expressionOrDeclaration?)*) end* '}'
;
multiLineStringContent
: MultiLineStrText
;
MultiLineStrText
: ~('\\')
| EscapeSeq
;
A multi-line raw string literal starts and ends with one or more comment tags (#
) and one single quotation mark ('
) or double quotation mark ("
). Both the number of comment tags and the quotation marks at the end are the same as those at the beginning. The literal content can be any number of valid characters. Before the current file ends, if no matching quotation mark and the same number of comment tags are encountered, a compilation error is reported. Like multi-line string literals, a multi-line raw string literal can span multiple lines. The difference is that escape rules do not apply to multi-line raw string literals. The content in the literals remains unchanged (escape characters are not escaped).
The syntax of multi-line raw string literals is defined as follows:
MultiLineRawStringLiteral
: MultiLineRawStringContent
;
fragment MultiLineRawStringContent
: '#' MultiLineRawStringContent '#'
| '#' '"' .*? '"' '#'
;
Literals of the Rune Type
A Rune literal starts with the character r
, followed by a single-line string literal (with single or double quotation marks). The string literal must contain exactly one character. The syntax is as follows:
RuneLiteral
: 'r' '\'' (SingleChar | EscapeSeq) '\''
: 'r' '"' (SingleChar | EscapeSeq) '"'
;
fragment SingleChar
: ~['\\\r\n]
;
EscapeSeq
: UniCharacterLiteral
| EscapedIdentifier
;
fragment UniCharacterLiteral
: '\\' 'u' '{' HexadecimalDigit '}'
| '\\' 'u' '{' HexadecimalDigit HexadecimalDigit '}'
| '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
| '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
| '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
| '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
| '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
| '\\' 'u' '{' HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit HexadecimalDigit '}'
;
fragment EscapedIdentifier
: '\\' ('t' | 'b' | 'r' | 'n' | '\'' | '"' | '\\' | 'f' | 'v' | '0' | '$')
;
Operators
The following table lists all operators supported by Cangjie, with higher priority operators appearing at the top. For details about each operator, see [Expressions].
Operator | Description |
---|---|
@ | Macro call expression |
. | Member access |
[] | Index access |
() | Function call |
++ | Postfix increment |
-- | Postfix decrement |
? | Question mark |
! | Logic NOT |
- | Unary negative |
** | Power |
* | Multiply |
/ | Divide |
% | Remainder |
+ | Add |
- | Subtract |
<< | Bitwise left shift |
>> | Bitwise right shift |
.. | Range operator |
..= | |
< | Less than |
<= | Less than or equal |
> | Greater than |
>= | Greater than or equal |
is | Type test |
as | Type cast |
== | Equal |
!= | Not equal |
& | Bitwise AND |
^ | Bitwise XOR |
` | ` |
&& | Logic AND |
` | |
?? | coalescing |
` | >` |
~> | Composition |
= | Assignment |
**= | Compound assignment |
*= | |
/= | |
%= | |
+= | |
-= | |
<<= | |
>>= | |
&= | |
^= | |
` | =` |
&&= | |
` |
Comments
Cangjie supports the following comment formats:
A single-line comment starts with //
. The syntax is as follows:
LineComment
: '//' ~[\n\r]*
;
A multi-line comment is enclosed with /*
and */
and supports nesting. The syntax is as follows:
DelimitedComment
: '/*' ( DelimitedComment | . )*? '*/'
;