std.regex Package

Function Description

The regex package provides the capability of analyzing and processing text (ASCII character strings only) through regular expressions, and supports functions such as search, segmentation, replacement, and verification.

regex Rule Set

Current regular expressions of Cangjie support the following rules only. Use of unsupported rules will cause unexpected output results.

CharacterDescription
\Marks the next character as a special character (File Format Escape, listed in this table), a literal character (Identity Escape, a total of 12 including ^$()*+?.[{\|), or a backreference (backreferences). For example, "n" matches the character "n." \n matches a line feed. The sequence \ matches \, and ( matches (.
^Matches the start position of an input string. If multiLine() in RegexOption is selected, ^ also matches the position following \n or \r.
$Matches the end position of an input string.
*Matches the preceding subexpression zero times or multiple times. For example, zo* can match z, zo, and zoo. * is equivalent to {0,}.
+Matches the preceding subexpression once or multiple times. For example, zo+ can match zo and zoo, but cannot match z. + is equivalent to {1,}.
?Matches the preceding subexpression zero times or once. For example, do(es)? can match do and does in does. ? is equivalent to {0,1}.
{n}n is a non-negative integer, and matches exactly n times. For example, o{2} cannot match o in Bob, but can match two letters o in food.
{n,}n is a non-negative integer, and matches at least n times. For example, o{2,} cannot match o in Bob, but can match all letters o in foooood. o{1,} is equivalent to o+, and o{0,} is equivalent to o*.
{n,m}m and n are non-negative integers, where nm. Matches at least n times and at most m times. For example, o{1,3} matches the first three letters o in fooooood. o{0,1} is equivalent to o?. Note that there is no space between the comma and the two numbers.
?Non-greedy quantifiers: When this character follows any other repeated modifier (*,+,?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. In the non-greedy pattern, strings are matched as few as possible, while in the default greedy pattern, strings are matched as many as possible. For example, regarding the string oooo, o+? matches a single letter o, while o+ matches all letters o.
.Matches any single character except \n. Patterns like (.\|\n) need to be used to match any characters including \n.
(pattern)Matches a pattern and obtains the substring for this match. The substring is used for backreference. The match can be obtained from the generated set of matches. \( or \) needs to be used to match the parenthesis string. A quantity suffix is allowed.
x\|yNot enclosed in parentheses, it specifies the entire regular expression. For example, z|food matches z or food. (?:z|f)ood matches zood or food.
[xyz]Specifies a character set (character class), and matches any character contained. For example, [abc] can match a in plain. Among special characters, only the backslash (\) retains its special meaning and is used as an escape character. Other special characters, such as an asterisk, a plus sign, and brackets, are all used as normal characters. A caret (^) denotes a negative character set if appearing at the beginning of a string, and it is used as a normal character if in the middle of a string. A hyphen (-) denotes a character range if appearing in the middle of a string, and it is used only as a normal character if at the beginning (or end) of a string. A right square bracket can be used as an escape character or as the first character.
[^xyz]Specifies a negated character set (negated character classes), and matches any character not listed. For example, [^abc] can match plin in plain.
[a-z]Specifies a character range, and matches any character within the specified range. For example, [a-z] can match any lowercase letter from a to z.
[^a-z]Specifies a range of negated characters, and matches any character outside the specified range. For example, [^a-z] can match any character not in the range from a to z.
\bMatches a word boundary, that is, the position between a word and a space. For example, er\b can match er in never, but cannot match er in verb.
\BMatches a non-word boundary. er\B can match er in verb, but cannot match er in never.
\dMatches a digit character. It is equivalent to [0-9].
\DMatches a non-digit character. It is equivalent to [^0-9].
\fMatches a form feed. It is equivalent to \x0c.
\nMatches a line feed. It is equivalent to \x0a.
\rMatches a carriage return character. It is equivalent to \x0d.
\sMatches any whitespace character, including a space, a tab character, and a form feed. It is equivalent to [\f\n\r\t\v].
\SMatches any non-whitespace character. It is equivalent to [^\f\n\r\t\v].
\tMatches a tab character. It is equivalent to \x09.
\vMatches \n\v\f\r\x85.
\wMatches any word character including the underscore. It is equivalent to [A-Za-z0-9_].
\WMatches any non-word character. It is equivalent to [^A-Za-z0-9_].
\xnmSpecifies a hexadecimal escape character sequence, and matches characters represented by two hexadecimal digits nm. For example, \x41 matches A. ASCII codes can be used in regular expressions.
\numBackreferences a substring. The substring matches the (num)th capture group subexpression enclosed in brackets in the regular expression. num is a decimal positive integer starting from 1, and the upper limit of capture groups in Regex is 63. For example, (.)\1 matches two consecutive identical characters.
(?:pattern)Matches a pattern but does not obtain the matched substring (shy groups). In other words, it is a non-capturing match and the matched substring is not stored for backreference. This is helpful when the OR character (\|) is used to combine parts of a pattern.
(?=pattern)Specifies a positive lookahead assertion. Strings are searched and matched at the beginning of any string matching the pattern. This is a non-capturing match. In other words, the match does not need to be captured for future use. For example, Windows(?=95\|98\|NT\|2000) can match Windows in Windows2000, but cannot match Windows in Windows3.1. An assertion does not consume characters. Specifically, after a match occurs, the next search starts immediately following the last match, instead of starting from the asserted character.
(?!pattern)Specifies a negative lookahead assertion. Strings are searched and matched at the beginning of any string that does not match the pattern. This is a non-capturing match. In other words, the match does not need to be captured for future use. For example, Windows(?!95\|98\|NT\|2000) can match Windows in Windows3.1, but cannot match Windows in Windows2000. An assertion does not consume characters. Specifically, after a match occurs, the next search starts immediately following the last match, instead of starting from the asserted character.
(?<=pattern)Specifies a positive lookbehind assertion. It is similar to the positive lookahead assertion but in an opposite direction. For example, (?<=95\|98\|NT\|2000)Windows can match Windows in 2000Windows, but cannot match Windows in 3.1Windows.
(?<!pattern)Specifies a negative lookbehind assertion. It is similar to the negative lookahead assertion but in an opposite direction. For example, (?<!95\|98\|NT\|2000)Windows can match Windows in 3.1Windows, but cannot match Windows in 2000Windows.
(?i)Specifies, using a rule, that some rules are case-insensitive. Currently, Regex supports global case-insensitivity only. If this option is specified, global case-insensitivity applies.
(?-i)Specifies, using a rule, that some rules are case-sensitive. Currently, Regex is case-sensitive by default. This option is treated as a compilation compatibility issue rather than a sensitivity issue.
+Specifies a separate plus sign rather than an escaped \+.
*Specifies a separate asterisk rather than an escaped \*.
-Specifies a separate minus sign, rather than an escaped \-.
]Specifies a separate right square bracket rather than an escaped \].
}Specifies a separate right curly bracket rather than an escaped \}.
[[:alpha:]]Specifies any uppercase or lowercase letter.
[[:^alpha:]]Specifies any character except uppercase and lowercase letters.
[[:lower:]]Specifies any lowercase letter.
[[:^lower:]]Specifies any character except lowercase letters.
[[:upper:]]Specifies any uppercase letter.
[[:^upper:]]Specifies any character except uppercase letters.
[[:digit:]]Specifies any single digit from 0 to 9.
[[:^digit:]]Specifies any character except a single digit from 0 to 9.
[[:xdigit:]]Specifies hexadecimal letters and digits.
[[:^xdigit:]]Specifies any character except hexadecimal letters and digits.
[[:alnum:]]Specifies any digit or letter.
[[:^alnum:]]Specifies any character except digits or letters.
[[:space:]]Specifies any whitespace character, including a "space" and a "tab key".
[[:^space:]]Specifies any character except whitespace characters.
[[:punct:]]Specifies any punctuation mark.
[[:^punct:]]Specifies any character except punctuation marks.

Cangjie also has other special rules:

  1. Unquantifiable characters before ?, +, and * are ignored. Exception: * is treated as a common character for strings beginning with (*, |*, or *.

  2. When *? is matching a string formed by all characters before *?, the character cannot be matched.

  3. The maximum number of capture groups in a regular expression is 63, and the maximum length of a compiled rule is 65535.

  4. The ((pattern1){m1,n1}pattern2){m2,n2} scenario is not supported currently. That is:

    • Group definition 1 is modified by {m1,n1}
    • Group definition 1 is wrapped by group definition 2
    • Group definition 2 is modified by {m2,n2}

API List

Class

NameDescription
MatcherSpecifies a regular expression matcher, used to scan an input sequence for matching.
MatchDataStores regular expression matching results, and provides functions for querying the regular expression matching results.
RegexSpecifies the compilation type and input sequence.
RegexOptionSpecifies the regular expression matching pattern.

Struct

Struct NameDescription
PositionStores position information, indicating a range with a closed starting point and an open endpoint.

Exception Class

NameDescription
RegexExceptionProvides regex-related exception processing.