Note
method freeze() will not only make the set immutable, but also makes important methods much higher performance: contains(c), containsNone(...), span(...), spanBack(...) etc. After the object is frozen, any subsequent call that wants to change the object will throw UnsupportedOperationException.
The UnicodeSet class is not designed to be subclassed.
UnicodeSet
supports two APIs. The first is the
operand API that allows the caller to modify the value of
a UnicodeSet
object. It conforms to Java 2's
java.util.Set
interface, although
UnicodeSet
does not actually implement that
interface. All methods of Set
are supported, with the
modification that they take a character range or single character
instead of an Object
, and they take a
UnicodeSet
instead of a Collection
. The
operand API may be thought of in terms of boolean logic: a boolean
OR is implemented by add
, a boolean AND is implemented
by retain
, a boolean XOR is implemented by
complement
taking an argument, and a boolean NOT is
implemented by complement
with no argument. In terms
of traditional set theory function names, add
is a
union, retain
is an intersection, remove
is an asymmetric difference, and complement
with no
argument is a set complement with respect to the superset range
MIN_VALUE-MAX_VALUE
The second API is the
applyPattern()
/toPattern()
API from the
java.text.Format
-derived classes. Unlike the
methods that add characters, add categories, and control the logic
of the set, the method applyPattern()
sets all
attributes of a UnicodeSet
at once, based on a
string pattern.
Pattern syntax
Patterns are accepted by the constructors and theapplyPattern()
methods and returned by the
toPattern()
method. These patterns follow a syntax
similar to that employed by version 8 regular expression character
classes. Here are some simple examples:
Any character may be preceded by a backslash in order to remove any special meaning. White space characters, as defined by the Unicode Pattern_White_Space property, are ignored, unless they are escaped.
[]
No characters [a]
The character 'a' [ae]
The characters 'a' and 'e' [a-e]
The characters 'a' through 'e' inclusive, in Unicode code point order [\\u4E01]
The character U+4E01 [a{ab}{ac}]
The character 'a' and the multicharacter strings "ab" and "ac" [\p{Lu}]
All characters in the general category Uppercase Letter
Property patterns specify a set of characters having a certain property as defined by the Unicode standard. Both the POSIX-like "[:Lu:]" and the Perl-like syntax "\p{Lu}" are recognized. For a complete list of supported property patterns, see the User's Guide for UnicodeSet at https://unicode-org.github.io/icu/userguide/strings/unicodeset. Actual determination of property data is defined by the underlying Unicode database as implemented by UCharacter.
Patterns specify individual characters, ranges of characters, and Unicode property sets. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '['. Property patterns are inverted by modifying their delimiters; "[:^foo]" and "\P{foo}". In any other location, '^' has no special meaning.
Since ICU 70, "[^...]", "[:^foo]", "\P{foo}", and "[:binaryProperty=No:]"
perform a "code point complement" (all code points minus the original set),
removing all multicharacter strings,
equivalent to .complement()
.removeAllStrings()
.
The complement()
API function continues to perform a
symmetric difference with all code points and thus retains all multicharacter strings.
Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left character is greater than or equal to the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.
Sets may be intersected using the '&' operator or the asymmetric
set difference may be taken using the '-' operator, for example,
"[[:L:]&[\\u0000-\\u0FFF]]
" indicates the set of all Unicode letters
with values less than 4096. Operators ('&' and '|') have equal
precedence and bind left-to-right. Thus
"[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to
"[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for
difference; intersection is commutative.
[a] | The set containing 'a' |
[a-z] | The set containing 'a' through 'z' and all letters in between, in Unicode order |
[^a-z] | The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF |
[[pat1][pat2]]
| The union of sets specified by pat1 and pat2 |
[[pat1]&[pat2]]
| The intersection of sets specified by pat1 and pat2 |
[[pat1]-[pat2]]
| The asymmetric difference of sets specified by pat1 and pat2 |
[:Lu:] or \p{Lu}
| The set of characters having the specified Unicode property; in this case, Unicode uppercase letters |
[:^Lu:] or \P{Lu}
| The set of characters not having the given Unicode property |
Formal syntax
pattern :=
('[' '^'? item* ']') | property
item :=
char | (char '-' char) | pattern-expr
pattern-expr :=
pattern | pattern-expr pattern | pattern-expr op pattern
op :=
'&' | '-'
special :=
'[' | ']' | '-'
char :=
any character that is not special
any character
| ('\\')
| ('\u' hex hex hex hex)
hex :=
'0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' |
'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'property :=
a Unicode property set pattern
Legend:
a := b
a
may be replaced byb
a?
zero or one instance of a
a*
one or more instances of a
a | b
either a
orb
'a'
the literal string between the quotes
To iterate over contents of UnicodeSet, the following are available:
ranges()
to iterate through the rangesstrings()
to iterate through the stringsiterator()
to iterate through the entire contents in a single loop.
That method is, however, not particularly efficient, since it "boxes" each code point into a String.
UnicodeSetIterator
can also be used, but not in for loops.
To replace, count elements, or delete spans, see UnicodeSetSpanner
.
Modifier and Type | Class and Description |
---|---|
private static interface | |
public static enum | UnicodeSet.
Argument values for whether span() and similar functions continue while the current character is contained vs. not contained in the set. |
private static class |
Modifier and Type | Field and Description |
---|---|
private volatile BMPSet | |
private int[] | |
private static final int | |
private static final int | |
private static UnicodeSet | |
private int | |
private int[] | |
private static final int | |
public static final int | MAX_VALUE
Maximum value that can be stored in a UnicodeSet. |
public static final int | MIN_VALUE
Minimum value that can be stored in a UnicodeSet. |
private static final VersionInfo | |
private int[] | |
private static final int | START_EXTRA
The pattern representation of this set. |
pack-priv TreeSet | |
private volatile UnicodeSetStringSpan |
Access | Constructor and Description |
---|---|
private | |
private | |
public | UnicodeSet(int
first character, inclusive, of range start, int last character, inclusive, of range end)Constructs a set containing the given range. |
public | UnicodeSet(String
a string specifying what characters are in the set pattern)Constructs a set from the given pattern. |
Modifier and Type | Method and Description |
---|---|
public final UnicodeSet | |
public final UnicodeSet | Returns: this object, for chainingthe source string s)Adds the specified multicharacter to this set if it is not already present. |
private UnicodeSet | |
private UnicodeSet | |
private final UnicodeSet | |
private UnicodeSet | applyFilter(UnicodeSet.
Generic filter-based scanning code for UCD property UnicodeSets. |
private UnicodeSet | Returns: an inversion list for the parsed substring ofpattern the string containing the pattern to be parsed. The
portion of the string from pos.getIndex(), which must be a '[', to the
corresponding closing ']', is parsed. pattern, ParsePosition upon entry, the position at which to being parsing. The
character at pattern.charAt(pos.getIndex()) must be a '['. Upon return
from a successful parse, pos.getIndex() is either the character after the
closing ']' of the parsed pattern, or pattern.length() if the closing ']'
is the last character of the pattern string. pos)Parses the given pattern, starting at the given position. |
private void | |
public UnicodeSet | |
public UnicodeSet | Returns: the clone, not frozenClone a thawed version of this class, according to the Freezable interface. |
public UnicodeSet | complement(int
first character, inclusive, of range start, int last character, inclusive, of range end)Complements the specified range in this set. |
public boolean | Returns: true if the test condition is metcharacter to be checked for containment c)Returns true if this set contains the given character. |
private void | |
private void | |
private final int | Returns: the smallest integer i in the range 0..len-1, inclusive, such that c < list[i]a character in the range MIN_VALUE..MAX_VALUE
inclusive c)Returns the smallest value i such that c < list[i]. |
public UnicodeSet | |
private static synchronized UnicodeSet | |
public int | |
public int | getRangeEnd(int index)
Iteration method that returns the last character in the specified range of this set. |
public int | getRangeStart(int index)
Iteration method that returns the first character in the specified range of this set. |
private static int | Returns: a code point IF the string consists of a single one. otherwise returns -1.to test s)Utility for getting code point from single code point CharSequence. |
public boolean | |
private static final int | |
private int[] | |
private UnicodeSet | |
public UnicodeSet | retainAll(UnicodeSet
set that defines which elements this set will retain. c)Retains only the elements in this set that are contained in the specified set. |
public UnicodeSet | set(UnicodeSet
a other)UnicodeSet whose value will be
copied to this objectMake this object represent the same set as |
public int | Returns: the number of elements in this set (its cardinality).Returns the number of elements in this set (its cardinality) Note that the elements of a set may include both individual codepoints and strings. |
public int | Returns: the length of the spanThe string to be spanned s, UnicodeSet.The span condition spanConditionSpan a string using this UnicodeSet. |
public int | Returns: the string index which ends the span (i.e. exclusive)The string to be spanned s, int The start index that the span begins start, UnicodeSet.The span condition spanConditionSpan a string using this UnicodeSet. |
public int | Returns: the limit (exclusive end) of the spanAn output-only object (must not be null) for returning the count. outCount)Same as span() but also counts the smallest number of set elements on any path across the span. |
public int | Returns: The string index which starts the span (i.e. inclusive).The string to be spanned s, int The index of the char (exclusive) that the string should be spanned backwards fromIndex, UnicodeSet.The span condition spanConditionSpan a string backwards (from the fromIndex) using this UnicodeSet. |
private int | spanCodePointsAndCount(CharSequence s, int start, UnicodeSet.
|
private UnicodeSet |
bmpSet | back to summary |
---|---|
private volatile BMPSet bmpSet |
buffer | back to summary |
---|---|
private int[] buffer |
GROW_EXTRA | back to summary |
---|---|
private static final int GROW_EXTRA |
HIGH | back to summary |
---|---|
private static final int HIGH |
INCLUSION | back to summary |
---|---|
private static UnicodeSet INCLUSION |
len | back to summary |
---|---|
private int len |
list | back to summary |
---|---|
private int[] list |
LOW | back to summary |
---|---|
private static final int LOW |
MAX_VALUE | back to summary |
---|---|
public static final int MAX_VALUE Maximum value that can be stored in a UnicodeSet.
|
MIN_VALUE | back to summary |
---|---|
public static final int MIN_VALUE Minimum value that can be stored in a UnicodeSet.
|
NO_VERSION | back to summary |
---|---|
private static final VersionInfo NO_VERSION |
rangeList | back to summary |
---|---|
private int[] rangeList |
START_EXTRA | back to summary |
---|---|
private static final int START_EXTRA The pattern representation of this set. This may not be the most economical pattern. It is the pattern supplied to applyPattern(), with variables substituted and whitespace removed. For sets constructed without applyPattern(), or modified using the non-pattern API, this string will be null, indicating that toPattern() must generate a pattern representation from the inversion list. |
strings | back to summary |
---|---|
pack-priv TreeSet<String> strings |
stringSpan | back to summary |
---|---|
private volatile UnicodeSetStringSpan stringSpan |
UnicodeSet | back to summary |
---|---|
private UnicodeSet() Constructs an empty set.
|
UnicodeSet | back to summary |
---|---|
private UnicodeSet(UnicodeSet other) Constructs a copy of an existing set.
|
UnicodeSet | back to summary |
---|---|
public UnicodeSet(int start, int end) Constructs a set containing the given range. If
|
UnicodeSet | back to summary |
---|---|
public UnicodeSet(String pattern) Constructs a set from the given pattern. See the class description for the syntax of the pattern language. Whitespace is ignored.
|
add | back to summary |
---|---|
public final UnicodeSet add(int c) Adds the specified character to this set if it is not already present. If this set already contains the specified character, the call leaves this set unchanged.
|
add | back to summary |
---|---|
public final UnicodeSet add(CharSequence s) Adds the specified multicharacter to this set if it is not already
present. If this set already contains the multicharacter,
the call leaves this set unchanged.
Thus
|
add | back to summary |
---|---|
private UnicodeSet add(int[] other, int otherLen, int polarity) |
add_unchecked | back to summary |
---|---|
private UnicodeSet add_unchecked(int start, int end) |
add_unchecked | back to summary |
---|---|
private final UnicodeSet add_unchecked(int c) |
applyFilter | back to summary |
---|---|
private UnicodeSet applyFilter(UnicodeSet. Generic filter-based scanning code for UCD property UnicodeSets. |
applyPattern | back to summary |
---|---|
private UnicodeSet applyPattern(String pattern, ParsePosition pos) Parses the given pattern, starting at the given position. The character at pattern.charAt(pos.getIndex()) must be '[', or the parse fails. Parsing continues until the corresponding closing ']'. If a syntax error is encountered between the opening and closing brace, the parse fails. Upon return from a successful parse, the ParsePosition is updated to point to the character following the closing ']', and an inversion list for the parsed pattern is returned. This method calls itself recursively to parse embedded subpatterns.
|
checkFrozen | back to summary |
---|---|
private void checkFrozen() |
clear | back to summary |
---|---|
public UnicodeSet clear() Removes all of the elements from this set. This set will be empty after this call returns.
|
cloneAsThawed | back to summary |
---|---|
public UnicodeSet cloneAsThawed() Clone a thawed version of this class, according to the Freezable interface.
|
complement | back to summary |
---|---|
public UnicodeSet complement(int start, int end) Complements the specified range in this set. Any character in
the range will be removed if it is in this set, or will be
added if it is not in this set. If
|
contains | back to summary |
---|---|
public boolean contains(int c) Returns true if this set contains the given character.
|
ensureBufferCapacity | back to summary |
---|---|
private void ensureBufferCapacity(int newLen) |
ensureCapacity | back to summary |
---|---|
private void ensureCapacity(int newLen) |
findCodePoint | back to summary |
---|---|
private final int findCodePoint(int c) Returns the smallest value i such that c < list[i]. Caller must ensure that c is a legal value or this method will enter an infinite loop. This method performs a binary search.
|
freeze | back to summary |
---|---|
public UnicodeSet freeze() Freeze this class, according to the Freezable interface.
|
getInclusions | back to summary |
---|---|
private static synchronized UnicodeSet getInclusions(int src) |
getRangeCount | back to summary |
---|---|
public int getRangeCount() Iteration method that returns the number of ranges contained in this set.
|
getRangeEnd | back to summary |
---|---|
public int getRangeEnd(int index) Iteration method that returns the last character in the specified range of this set.
|
getRangeStart | back to summary |
---|---|
public int getRangeStart(int index) Iteration method that returns the first character in the specified range of this set.
|
getSingleCP | back to summary |
---|---|
private static int getSingleCP(CharSequence s) Utility for getting code point from single code point CharSequence. See the public UTF16.getSingleCodePoint() (which returns -1 for null rather than throwing NPE).
|
isFrozen | back to summary |
---|---|
public boolean isFrozen() Is this frozen, according to the Freezable interface?
|
max | back to summary |
---|---|
private static final int max(int a, int b) |
range | back to summary |
---|---|
private int[] range(int start, int end) Assumes start <= end. |
retain | back to summary |
---|---|
private UnicodeSet retain(int[] other, int otherLen, int polarity) |
retainAll | back to summary |
---|---|
public UnicodeSet retainAll(UnicodeSet c) Retains only the elements in this set that are contained in the specified set. In other words, removes from this set all of its elements that are not contained in the specified set. This operation effectively modifies this set so that its value is the intersection of the two sets.
|
set | back to summary |
---|---|
public UnicodeSet set(UnicodeSet other) Make this object represent the same set as
|
size | back to summary |
---|---|
public int size() Returns the number of elements in this set (its cardinality) Note that the elements of a set may include both individual codepoints and strings.
|
span | back to summary |
---|---|
public int span(CharSequence s, UnicodeSet. Span a string using this UnicodeSet. To replace, count elements, or delete spans, see
|
span | back to summary |
---|---|
public int span(CharSequence s, int start, UnicodeSet. Span a string using this UnicodeSet. If the start index is less than 0, span will start from 0. If the start index is greater than the string length, span returns the string length. To replace, count elements, or delete spans, see
|
spanAndCount | back to summary |
---|---|
public int spanAndCount(CharSequence s, int start, UnicodeSet. Same as span() but also counts the smallest number of set elements on any path across the span. To replace, count elements, or delete spans, see
|
spanBack | back to summary |
---|---|
public int spanBack(CharSequence s, int fromIndex, UnicodeSet. Span a string backwards (from the fromIndex) using this UnicodeSet. If the fromIndex is less than 0, spanBack will return 0. If fromIndex is greater than the string length, spanBack will start from the string length. To replace, count elements, or delete spans, see
|
spanCodePointsAndCount | back to summary |
---|---|
private int spanCodePointsAndCount(CharSequence s, int start, UnicodeSet. |
xor | back to summary |
---|---|
private UnicodeSet xor(int[] other, int otherLen, int polarity) |
Modifier and Type | Method and Description |
---|---|
public boolean |
contains | back to summary |
---|---|
public boolean contains(int codePoint) |
The functionality is straightforward for sets with only single code points, without strings (which is the common case):
Note
Unpaired surrogates are treated like surrogate code points. Similarly, set strings match only on code point boundaries, never in the middle of a surrogate pair.
Modifier and Type | Field and Description |
---|---|
public static final UnicodeSet. | CONTAINED
Spans the longest substring that is a concatenation of set elements (characters or strings). |
public static final UnicodeSet. | NOT_CONTAINED
Continues a span() while there is no set element at the current position. |
public static final UnicodeSet. | SIMPLE
Continues a span() while there is a set element at the current position. |
Access | Constructor and Description |
---|---|
private |
Modifier and Type | Method and Description |
---|---|
public static UnicodeSet. | |
public static UnicodeSet. |
CONTAINED | back to summary |
---|---|
public static final UnicodeSet. Spans the longest substring that is a concatenation of set elements (characters or strings). (For characters only, this is like while contains(current)==true). When span() returns, the substring between where it started and the position it returned consists only of set elements (characters or strings) that are in the set.
If a set contains strings, then the span will be the longest substring for which there
exists at least one non-overlapping concatenation of set elements (characters or strings).
This is equivalent to a POSIX regular expression for
|
NOT_CONTAINED | back to summary |
---|---|
public static final UnicodeSet. Continues a span() while there is no set element at the current position. Increments by one code point at a time. Stops before the first set element (character or string). (For code points only, this is like while contains(current)==false). When span() returns, the substring between where it started and the position it returned consists only of characters that are not in the set, and none of its strings overlap with the span.
|
SIMPLE | back to summary |
---|---|
public static final UnicodeSet. Continues a span() while there is a set element at the current position. Increments by the longest matching element at each position. (For characters only, this is like while contains(current)==true). When span() returns, the substring between where it started and the position it returned consists only of set elements (characters or strings) that are in the set. If a set only contains single characters, then this is the same as CONTAINED. If a set contains strings, then the span will be the longest substring with a match at each position with the longest single set element (character or string). Use this span condition together with other longest-match algorithms, such as ICU converters (ucnv_getUnicodeSet()).
|
SpanCondition | back to summary |
---|---|
private SpanCondition() |
valueOf | back to summary |
---|---|
public static UnicodeSet. |
values | back to summary |
---|---|
public static UnicodeSet. |
Modifier and Type | Field and Description |
---|---|
pack-priv VersionInfo |
Access | Constructor and Description |
---|---|
pack-priv |
Modifier and Type | Method and Description |
---|---|
public boolean |
version | back to summary |
---|---|
pack-priv VersionInfo version |
VersionFilter | back to summary |
---|---|
pack-priv VersionFilter(VersionInfo version) |
contains | back to summary |
---|---|
public boolean contains(int ch) Implements jdk. |