std.regex Package

Function Description

The regex package provides the capability of analyzing and processing text (ASCII character strings only) through regular expressions, and supports functions such as search, segmentation, replacement, and verification.

regex Rule Set

Current regular expressions of Cangjie support the following rules only. Use of unsupported rules will cause unexpected output results.

Character	Description
`\`	Marks the next character as a special character (File Format Escape, listed in this table), a literal character (Identity Escape, a total of 12 including `^$()*+?.[{\\|`), or a backreference (backreferences). For example, "n" matches the character "n." `\n` matches a line feed. The sequence `\` matches `\`, and `(` matches `(`.
`^`	Matches the start position of an input string. If multiLine() in RegexOption is selected, ^ also matches the position following `\n` or `\r`.
`$`	Matches the end position of an input string.
`*`	Matches the preceding subexpression zero times or multiple times. For example, `zo` can match `z`, `zo`, and `zoo`. `` is equivalent to `{0,}`.
`+`	Matches the preceding subexpression once or multiple times. For example, `zo+` can match `zo` and `zoo`, but cannot match `z`. `+` is equivalent to `{1,}`.
`?`	Matches the preceding subexpression zero times or once. For example, `do(es)?` can match `do` and `does` in `does`. `?` is equivalent to `{0,1}`.
`{n}`	n is a non-negative integer, and matches exactly n times. For example, `o{2}` cannot match `o` in `Bob`, but can match two letters `o` in `food`.
`{n,}`	n is a non-negative integer, and matches at least n times. For example, `o{2,}` cannot match `o` in `Bob`, but can match all letters `o` in `foooood`. `o{1,}` is equivalent to `o+`, and `o{0,}` is equivalent to `o*`.
`{n,m}`	m and n are non-negative integers, where n ≤ m. Matches at least n times and at most m times. For example, `o{1,3}` matches the first three letters `o` in `fooooood`. `o{0,1}` is equivalent to `o?`. Note that there is no space between the comma and the two numbers.
`?`	Non-greedy quantifiers: When this character follows any other repeated modifier (*,+,?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. In the non-greedy pattern, strings are matched as few as possible, while in the default greedy pattern, strings are matched as many as possible. For example, regarding the string `oooo`, `o+?` matches a single letter `o`, while `o+` matches all letters `o`.
`.`	Matches any single character except `\n`. Patterns like `(.\\|\n)` need to be used to match any characters including `\n`.
`(pattern)`	Matches a pattern and obtains the substring for this match. The substring is used for backreference. The match can be obtained from the generated set of matches. `$` or `$` needs to be used to match the parenthesis string. A quantity suffix is allowed.
`x\\|y`	Not enclosed in parentheses, it specifies the entire regular expression. For example, z\|food matches `z` or `food`. (?:z\|f)ood matches `zood` or `food`.
`[xyz]`	Specifies a character set (character class), and matches any character contained. For example, `[abc]` can match `a` in `plain`. Among special characters, only the backslash (\) retains its special meaning and is used as an escape character. Other special characters, such as an asterisk, a plus sign, and brackets, are all used as normal characters. A caret (^) denotes a negative character set if appearing at the beginning of a string, and it is used as a normal character if in the middle of a string. A hyphen (-) denotes a character range if appearing in the middle of a string, and it is used only as a normal character if at the beginning (or end) of a string. A right square bracket can be used as an escape character or as the first character.
`[^xyz]`	Specifies a negated character set (negated character classes), and matches any character not listed. For example, `[^abc]` can match `plin` in `plain`.
`[a-z]`	Specifies a character range, and matches any character within the specified range. For example, `[a-z]` can match any lowercase letter from `a` to `z`.
`[^a-z]`	Specifies a range of negated characters, and matches any character outside the specified range. For example, `[^a-z]` can match any character not in the range from `a` to `z`.
`\b`	Matches a word boundary, that is, the position between a word and a space. For example, `er\b` can match `er` in `never`, but cannot match `er` in `verb`.
`\B`	Matches a non-word boundary. `er\B` can match `er` in `verb`, but cannot match `er` in `never`.
`\d`	Matches a digit character. It is equivalent to `[0-9]`.
`\D`	Matches a non-digit character. It is equivalent to `[^0-9]`.
`\f`	Matches a form feed. It is equivalent to `\x0c`.
`\n`	Matches a line feed. It is equivalent to `\x0a`.
`\r`	Matches a carriage return character. It is equivalent to `\x0d`.
`\s`	Matches any whitespace character, including a space, a tab character, and a form feed. It is equivalent to `[\f\n\r\t\v]`.
`\S`	Matches any non-whitespace character. It is equivalent to `[^\f\n\r\t\v]`.
`\t`	Matches a tab character. It is equivalent to `\x09`.
`\v`	Matches `\n\v\f\r\x85`.
`\w`	Matches any word character including the underscore. It is equivalent to `[A-Za-z0-9_]`.
`\W`	Matches any non-word character. It is equivalent to `[^A-Za-z0-9_]`.
`\xnm`	Specifies a hexadecimal escape character sequence, and matches characters represented by two hexadecimal digits nm. For example, `\x41` matches `A`. ASCII codes can be used in regular expressions.
`\num`	Backreferences a substring. The substring matches the (num)th capture group subexpression enclosed in brackets in the regular expression. num is a decimal positive integer starting from 1, and the upper limit of capture groups in Regex is 63. For example, `(.)\1` matches two consecutive identical characters.
`(?:pattern)`	Matches a pattern but does not obtain the matched substring (shy groups). In other words, it is a non-capturing match and the matched substring is not stored for backreference. This is helpful when the OR character `(\\|)` is used to combine parts of a pattern.
`(?=pattern)`	Specifies a positive lookahead assertion. Strings are searched and matched at the beginning of any string matching the pattern. This is a non-capturing match. In other words, the match does not need to be captured for future use. For example, `Windows(?=95\\|98\\|NT\\|2000)` can match `Windows` in `Windows2000`, but cannot match `Windows` in `Windows3.1`. An assertion does not consume characters. Specifically, after a match occurs, the next search starts immediately following the last match, instead of starting from the asserted character.
`(?!pattern)`	Specifies a negative lookahead assertion. Strings are searched and matched at the beginning of any string that does not match the pattern. This is a non-capturing match. In other words, the match does not need to be captured for future use. For example, `Windows(?!95\\|98\\|NT\\|2000)` can match `Windows` in `Windows3.1`, but cannot match `Windows` in `Windows2000`. An assertion does not consume characters. Specifically, after a match occurs, the next search starts immediately following the last match, instead of starting from the asserted character.
`(?<=pattern)`	Specifies a positive lookbehind assertion. It is similar to the positive lookahead assertion but in an opposite direction. For example, `(?<=95\\|98\\|NT\\|2000)Windows` can match `Windows` in `2000Windows`, but cannot match `Windows` in `3.1Windows`.
`(?<!pattern)`	Specifies a negative lookbehind assertion. It is similar to the negative lookahead assertion but in an opposite direction. For example, `(?<!95\\|98\\|NT\\|2000)Windows` can match `Windows` in `3.1Windows`, but cannot match `Windows` in `2000Windows`.
`(?i)`	Specifies, using a rule, that some rules are case-insensitive. Currently, Regex supports global case-insensitivity only. If this option is specified, global case-insensitivity applies.
`(?-i)`	Specifies, using a rule, that some rules are case-sensitive. Currently, Regex is case-sensitive by default. This option is treated as a compilation compatibility issue rather than a sensitivity issue.
`+`	Specifies a separate plus sign rather than an escaped `\+`.
`*`	Specifies a separate asterisk rather than an escaped `\*`.
`-`	Specifies a separate minus sign, rather than an escaped `\-`.
`]`	Specifies a separate right square bracket rather than an escaped `\]`.
`}`	Specifies a separate right curly bracket rather than an escaped `\}`.
`[[:alpha:]]`	Specifies any uppercase or lowercase letter.
`[[:^alpha:]]`	Specifies any character except uppercase and lowercase letters.
`[[:lower:]]`	Specifies any lowercase letter.
`[[:^lower:]]`	Specifies any character except lowercase letters.
`[[:upper:]]`	Specifies any uppercase letter.
`[[:^upper:]]`	Specifies any character except uppercase letters.
`[[:digit:]]`	Specifies any single digit from 0 to 9.
`[[:^digit:]]`	Specifies any character except a single digit from 0 to 9.
`[[:xdigit:]]`	Specifies hexadecimal letters and digits.
`[[:^xdigit:]]`	Specifies any character except hexadecimal letters and digits.
`[[:alnum:]]`	Specifies any digit or letter.
`[[:^alnum:]]`	Specifies any character except digits or letters.
`[[:space:]]`	Specifies any whitespace character, including a "space" and a "tab key".
`[[:^space:]]`	Specifies any character except whitespace characters.
`[[:punct:]]`	Specifies any punctuation mark.
`[[:^punct:]]`	Specifies any character except punctuation marks.

Cangjie also has other special rules:

Unquantifiable characters before ?, +, and * are ignored. Exception: * is treated as a common character for strings beginning with (*, |*, or *.
When *? is matching a string formed by all characters before *?, the character cannot be matched.
The maximum number of capture groups in a regular expression is 63, and the maximum length of a compiled rule is 65535.
The ((pattern1){m1,n1}pattern2){m2,n2} scenario is not supported currently. That is:
- Group definition 1 is modified by {m1,n1}
- Group definition 1 is wrapped by group definition 2
- Group definition 2 is modified by {m2,n2}

API List

Class

Name	Description
Matcher	Specifies a regular expression matcher, used to scan an input sequence for matching.
MatchData	Stores regular expression matching results, and provides functions for querying the regular expression matching results.
Regex	Specifies the compilation type and input sequence.
RegexOption	Specifies the regular expression matching pattern.

Struct

Struct Name	Description
Position	Stores position information, indicating a range with a closed starting point and an open endpoint.

Exception Class

Name	Description
RegexException	Provides regex-related exception processing.