Lines Matching full:unicode
1 // Copyright (C) 2016 and later: Unicode, Inc. and others.
2 // License & terms of use: http://www.unicode.org/copyright.html
18 * 8/19/1999 srl Upgraded scripts to Unicode 3.0
28 #include "unicode/utypes.h"
33 /* Unicode version number */
36 * Unicode version number, default for the current ICU version.
37 * The actual Unicode Character Database (UCD) data is stored in uprops.dat
38 * and may be generated from UCD files from a different Unicode version.
39 * Call u_getUnicodeVersion to get the actual Unicode version of the data.
48 * \brief C API: Unicode Properties
50 * This C API provides low-level access to the Unicode Character Database.
54 * Unicode assigns each code point (not just assigned character) values for
60 * "About the Unicode Character Database" (http://www.unicode.org/ucd/)
72 * Instead, Unicode properties should be used directly.
84 * Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions
85 * (http://www.unicode.org/reports/tr18/#Compatibility_Properties).
112 * - u_isUWhiteSpace=UCHAR_WHITE_SPACE: Unicode White_Space property;
125 /** The lowest Unicode code point value. Code points are non-negative. @stable ICU 2.0 */
129 * The highest Unicode code point value (scalar value) according to
130 * The Unicode Standard. This is a 21-bit value (20.1 bits, rounded up).
145 * Selection constants for Unicode properties.
147 * one of the Unicode properties.
149 * The properties APIs are intended to reflect Unicode properties as defined
150 * in the Unicode Character Database (UCD) and Unicode Technical Reports (UTR).
151 * For details about the properties see http://www.unicode.org/ucd/ .
152 * For names of Unicode properties see the UCD file PropertyAliases.txt.
154 * Important: If ICU is built with UCD files from Unicode versions below, e.g., 3.2,
155 * then properties marked with "new in Unicode 3.2" are not or not fully available.
167 * UCHAR_<Unicode property name>=<integer>,
178 /** First constant for binary Unicode properties. @stable ICU 2.1 */
193 /** Binary property Default_Ignorable_Code_Point (new in Unicode 3.2).
197 /** Binary property Deprecated (new in Unicode 3.2).
211 /** Binary property Grapheme_Base (new in Unicode 3.2).
215 /** Binary property Grapheme_Extend (new in Unicode 3.2).
219 /** Binary property Grapheme_Link (new in Unicode 3.2).
240 /** Binary property IDS_Binary_Operator (new in Unicode 3.2).
244 /** Binary property IDS_Trinary_Operator (new in Unicode 3.2).
251 /** Binary property Logical_Order_Exception (new in Unicode 3.2).
266 /** Binary property Radical (new in Unicode 3.2).
270 /** Binary property Soft_Dotted (new in Unicode 3.2).
279 /** Binary property Unified_Ideograph (new in Unicode 3.2).
301 /** Binary property STerm (new in Unicode 4.0.1).
303 (http://www.unicode.org/reports/tr29/)
306 /** Binary property Variation_Selector (new in Unicode 4.0.1).
342 Unicode normalization and combining character sequences.
351 /** Binary property Pattern_Syntax (new in Unicode 4.1).
353 (http://www.unicode.org/reports/tr31/)
356 /** Binary property Pattern_White_Space (new in Unicode 4.1).
358 (http://www.unicode.org/reports/tr31/)
405 * See http://www.unicode.org/reports/tr51/#Emoji_Properties
412 * See http://www.unicode.org/reports/tr51/#Emoji_Properties
419 * See http://www.unicode.org/reports/tr51/#Emoji_Properties
426 * See http://www.unicode.org/reports/tr51/#Emoji_Properties
434 * One more than the last constant for binary Unicode properties.
443 /** First constant for enumerated/integer Unicode properties. @stable ICU 2.2 */
455 See http://www.unicode.org/reports/tr11/
476 /** Enumerated property Hangul_Syllable_Type, new in Unicode 4.
495 see UNORM_FCD and http://www.unicode.org/notes/tn5/#FCD .
502 see UNORM_FCD and http://www.unicode.org/notes/tn5/#FCD .
505 /** Enumerated property Grapheme_Cluster_Break (new in Unicode 4.1).
507 (http://www.unicode.org/reports/tr29/)
510 /** Enumerated property Sentence_Break (new in Unicode 4.1).
512 (http://www.unicode.org/reports/tr29/)
515 /** Enumerated property Word_Break (new in Unicode 4.1).
517 (http://www.unicode.org/reports/tr29/)
520 /** Enumerated property Bidi_Paired_Bracket_Type (new in Unicode 6.3).
521 Used in UAX #9: Unicode Bidirectional Algorithm
522 (http://www.unicode.org/reports/tr9/)
527 * One more than the last constant for enumerated/integer Unicode properties.
542 /** First constant for bit-mask Unicode properties. @stable ICU 2.4 */
546 * One more than the last constant for bit-mask Unicode properties.
555 /** First constant for double Unicode properties. @stable ICU 2.4 */
559 * One more than the last constant for double Unicode properties.
568 /** First constant for string Unicode properties. @stable ICU 2.4 */
612 /** String property Bidi_Paired_Bracket (new in Unicode 6.3).
617 * One more than the last constant for string Unicode properties.
623 /** Miscellaneous property Script_Extensions (new in Unicode 6.0).
625 For more information, see UAX #24: http://www.unicode.org/reports/tr24/.
629 /** First constant for Unicode properties with unusual value types. @stable ICU 4.6 */
633 * One more than the last constant for Unicode properties with unusual value types.
644 * Data for enumerated Unicode general category types.
645 * See http://www.unicode.org/Public/UNIDATA/UnicodeData.html .
653 * / ** <Unicode 2-letter General_Category value> comment... * /
722 * http://www.unicode.org/policies/stability_policy.html#Property_Value
730 * U_GC_XX_MASK constants are bit flags corresponding to Unicode
849 * / ** <Unicode 1..3-letter Bidi_Class value> comment... * /
920 * U_BPT_<Unicode Bidi_Paired_Bracket_Type value name>
941 * Constants for Unicode blocks, see the Unicode Data file Blocks.txt
948 * UBLOCK_<Unicode Block value name> = <integer>,
951 /** New No_Block value in Unicode 4. @stable ICU 2.6 */
976 * Unicode 3.2 renames this block to "Greek and Coptic".
1084 * Unicode 3.2 renames this block to "Combining Diacritical Marks for Symbols".
1193 * Until Unicode 3.1.1, the corresponding block name was "Private Use",
1195 * Unicode 3.2 renames the block for the BMP PUA to "Private Use Area" and
1203 * Until Unicode 3.1.1, the corresponding block name was "Private Use",
1205 * Unicode 3.2 renames the block for the BMP PUA to "Private Use Area" and
1239 /* New blocks in Unicode 3.1 */
1260 /* New blocks in Unicode 3.2 */
1265 * Unicode 4.0.1 renames the "Cyrillic Supplementary" block to "Cyrillic Supplement".
1296 /* New blocks in Unicode 4 */
1329 /* New blocks in Unicode 4.1 */
1372 /* New blocks in Unicode 5.0 */
1393 /* New blocks in Unicode 5.1 */
1430 /* New blocks in Unicode 5.2 */
1485 /* New blocks in Unicode 6.0 */
1512 /* New blocks in Unicode 6.1 */
1537 /* New blocks in Unicode 7.0 */
1604 /* New blocks in Unicode 8.0 */
1627 /* New blocks in Unicode 9.0 */
1680 * U_EA_<Unicode East_Asian_Width value name>
1703 * Unicode character; or the name that was defined in
1704 * Unicode version 1.0, before the Unicode standard merged
1706 * Unicode code point a unique name.
1712 /** Unicode character name (Name property). @stable ICU 2.0 */
1740 * Unicode allows for additional names, beyond the long and short
1770 * U_DT_<Unicode Decomposition_Type value name>
1812 * U_JT_<Unicode Joining_Type value name>
1842 * U_JG_<Unicode Joining_Group value name>
1956 * U_GCB_<Unicode Grapheme_Cluster_Break value name>
1970 U_GCB_SPACING_MARK = 10, /*[SM]*/ /* from here on: new in Unicode 5.1/ICU 4.0 */
1974 U_GCB_REGIONAL_INDICATOR = 12, /*[RI]*/ /* new in Unicode 6.2/ICU 50 */
1976 U_GCB_E_BASE = 13, /*[EB]*/ /* from here on: new in Unicode 9.0/ICU 58 */
2007 * U_WB_<Unicode Word_Break value name>
2019 U_WB_CR = 8, /*[CR]*/ /* from here on: new in Unicode 5.1/ICU 4.0 */
2029 U_WB_REGIONAL_INDICATOR = 13, /*[RI]*/ /* new in Unicode 6.2/ICU 50 */
2031 U_WB_HEBREW_LETTER = 14, /*[HL]*/ /* from here on: new in Unicode 6.3/ICU 52 */
2037 U_WB_E_BASE = 17, /*[EB]*/ /* from here on: new in Unicode 9.0/ICU 58 */
2067 * U_SB_<Unicode Sentence_Break value name>
2081 U_SB_CR = 11, /*[CR]*/ /* from here on: new in Unicode 5.1/ICU 4.0 */
2106 * U_LB_<Unicode Line_Break value name>
2124 /** Renamed from the misspelled "inseperable" in Unicode 4.0.1/ICU 3.0 @stable ICU 3.0 */
2141 U_LB_NEXT_LINE = 29, /*[NL]*/ /* from here on: new in Unicode 4/ICU 2.6 */
2145 U_LB_H2 = 31, /*[H2]*/ /* from here on: new in Unicode 4.1/ICU 3.4 */
2155 U_LB_CLOSE_PARENTHESIS = 36, /*[CP]*/ /* new in Unicode 5.2/ICU 4.4 */
2157 U_LB_CONDITIONAL_JAPANESE_STARTER = 37,/*[CJ]*/ /* new in Unicode 6.1/ICU 49 */
2159 U_LB_HEBREW_LETTER = 38, /*[HL]*/ /* new in Unicode 6.1/ICU 49 */
2161 U_LB_REGIONAL_INDICATOR = 39,/*[RI]*/ /* new in Unicode 6.2/ICU 50 */
2163 U_LB_E_BASE = 40, /*[EB]*/ /* from here on: new in Unicode 9.0/ICU 58 */
2189 * U_NT_<Unicode Numeric_Type value name>
2217 * U_HST_<Unicode Hangul_Syllable_Type value name>
2238 * Check a binary Unicode property for a code point.
2240 * Unicode, especially in version 3.2, defines many more properties than the
2243 * The properties APIs are intended to reflect Unicode properties as defined
2244 * in the Unicode Character Database (UCD) and Unicode Technical Reports (UTR).
2245 * For details about the properties see http://www.unicode.org/ucd/ .
2246 * For names of Unicode properties see the UCD file PropertyAliases.txt.
2248 * Important: If ICU is built with UCD files from Unicode versions below 3.2,
2249 * then properties marked with "new in Unicode 3.2" are not or not fully available.
2254 * @return TRUE or FALSE according to the binary Unicode property value for c.
2255 * Also FALSE if 'which' is out of bounds or if the Unicode version
2267 * Check if a code point has the Alphabetic Unicode property.
2271 * @return true if the code point has the Alphabetic Unicode property, false otherwise
2282 * Check if a code point has the Lowercase Unicode property.
2286 * @return true if the code point has the Lowercase Unicode property, false otherwise
2297 * Check if a code point has the Uppercase Unicode property.
2301 * @return true if the code point has the Uppercase Unicode property, false otherwise
2312 * Check if a code point has the White_Space Unicode property.
2320 * @return true if the code point has the White_Space Unicode property, false otherwise.
2333 * Get the property value for an enumerated or integer Unicode property for a code point.
2336 * Unicode, especially in version 3.2, defines many more properties than the
2339 * The properties APIs are intended to reflect Unicode properties as defined
2340 * in the Unicode Character Database (UCD) and Unicode Technical Reports (UTR).
2341 * For details about the properties see http://www.unicode.org/ .
2342 * For names of Unicode properties see the UCD file PropertyAliases.txt.
2357 * Returns 0 or 1 (for FALSE/TRUE) for binary Unicode properties.
2359 * Returns 0 if 'which' is out of bounds or if the Unicode version
2373 * Get the minimum value for an enumerated/integer/binary Unicode property.
2380 * @return Minimum value returned by u_getIntPropertyValue for a Unicode property.
2394 * Get the maximum value for an enumerated/integer/binary Unicode property.
2398 * Examples for min/max values (for Unicode 3.2):
2409 * @return Maximum value returned by u_getIntPropertyValue for a Unicode property.
2423 * Get the numeric value for a Unicode code point as defined in the
2424 * Unicode Character Database.
2429 * For characters without any numeric values in the Unicode Character Database,
2431 * Note: This is different from the Unicode Standard which specifies NaN as the default value.
2529 * Beginning with Unicode 4, this is the same as
2652 * TRUE for Unicode White_Space characters except for "vertical space controls"
2748 * - It is a Unicode Separator character (categories "Z" = "Zs" or "Zl" or "Zp"), but is not
2762 * the exact same results because of the Unicode version
2765 * Note: Unicode 4.0.1 changed U+200B ZERO WIDTH SPACE from a Space Separator (Zs)
2767 * See http://www.unicode.org/versions/Unicode4.0.1/
2845 * Note that this is different from the Unicode definition in
2863 * which is used in the Unicode bidirectional algorithm
2864 * (UAX #9 http://www.unicode.org/reports/tr9/).
2901 * sometimes need a "poor man's" mapping to another Unicode
2909 * @return another Unicode code point that may serve as a mirror-image
2924 * See http://www.unicode.org/reports/tr9/
2971 * with the same Unicode general category ("character type").
2989 * Enumerate efficiently all code points with their Unicode general categories.
2997 * The Unicode Standard guarantees that the numeric value of the type is 0..31.
3033 * Unicode 4 explicitly assigns Han number characters the Numeric_Type
3038 * for complete numeric Unicode properties.
3051 * Returns the Unicode allocation block that contains the character.
3063 * Retrieve the name of a Unicode character.
3066 * in Unicode version 1.0.
3069 * Unicode 1.0 names are only retrieved if they are different from the modern
3103 * The Unicode ISO_Comment property is deprecated and has no values.
3125 * Find a Unicode character by its name and return its code point value.
3129 * A Unicode 1.0 name is matched only if it differs from the modern name.
3130 * Unicode names are all uppercase. Extended names are lowercase followed
3136 * @return The Unicode value of the code point with the given name,
3151 * for each Unicode character with the code point value and
3156 * @param code The Unicode code point for the character with this name.
3173 * Enumerate all assigned Unicode characters between the start and limit
3176 * For Unicode 1.0 names, only those are enumerated that differ from the
3201 * Return the Unicode name for a given property, as given in the
3202 * Unicode database file PropertyAliases.txt.
3214 * have a short name, but some do not. Unicode allows for
3237 * in the Unicode database file PropertyAliases.txt. Short, long, and
3258 * Return the Unicode name for a given property value, as given in the
3259 * Unicode database file PropertyValueAliases.txt.
3287 * a short name, but some do not. Unicode allows for
3311 * specified in the Unicode database file PropertyValueAliases.txt.
3346 * first character in an identifier according to Unicode
3347 * (The Unicode Standard, Version 3.0, chapter 5.16 Identifiers).
3373 * Almost the same as Unicode's ID_Continue (UCHAR_ID_CONTINUE)
3374 * except that Unicode recommends to ignore Cf which is less than
3398 * Note that Unicode just recommends to ignore Cf (format controls).
3535 * Before Unicode 3.2, CaseFolding.txt contains mappings marked with 'I' that
3539 * Unicode 3.2 CaseFolding.txt instead contains mappings marked with 'T' that
3646 * The "age" is the Unicode version when the code point was first
3654 * @param versionArray The Unicode version number array, to be filled in.
3662 * Gets the Unicode version information.
3664 * for the Unicode standard that is currently used by ICU.
3665 * For example, Unicode version 3.1.1 is represented as an array with
3669 * the Unicode version number
3678 * See Unicode Standard Annex #15 for details, search for "FC_NFKC_Closure"
3679 * or for "FNC": http://www.unicode.org/reports/tr15/