1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2"https://www.w3.org/TR/html4/loose.dtd"> 3<html> 4<head> 5 <meta name="generator" content= 6 "HTML Tidy for HTML5 for Apple macOS version 5.6.0"> 7 <meta http-equiv="Content-Type" content= 8 "text/html; charset=utf-8"> 9 <meta http-equiv="Content-Language" content="en-us"> 10 <link rel="stylesheet" href= 11 "../reports.css" type="text/css"> 12 <title>UTS #35: Unicode LDML: Collation</title> 13 <style type="text/css"> 14 <!-- 15 .dtd { 16 font-family: monospace; 17 font-size: 90%; 18 background-color: #CCCCFF; 19 border-style: dotted; 20 border-width: 1px; 21 } 22 23 .xmlExample { 24 font-family: monospace; 25 font-size: 80% 26 } 27 28 .blockedInherited { 29 font-style: italic; 30 font-weight: bold; 31 border-style: dashed; 32 border-width: 1px; 33 background-color: #FF0000 34 } 35 36 .inherited { 37 font-weight: bold; 38 border-style: dashed; 39 border-width: 1px; 40 background-color: #00FF00 41 } 42 43 .element { 44 font-weight: bold; 45 color: red; 46 } 47 48 .attribute { 49 font-weight: bold; 50 color: maroon; 51 } 52 53 .attributeValue { 54 font-weight: bold; 55 color: blue; 56 } 57 58 li, p { 59 margin-top: 0.5em; 60 margin-bottom: 0.5em 61 } 62 63 h2, h3, h4, table { 64 margin-top: 1.5em; 65 margin-bottom: 0.5em; 66 } 67 --> 68 </style> 69</head> 70<body> 71 <table class="header" width="100%"> 72 <tr> 73 <td class="icon"><a href="https://unicode.org"><img alt= 74 "[Unicode]" src="../logo60s2.gif" 75 width="34" height="33" style= 76 "vertical-align: middle; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px; border-top-width: 0px;"></a> 77 <a class="bar" href= 78 "https://www.unicode.org/reports/">Technical Reports</a></td> 79 </tr> 80 <tr> 81 <td class="gray"> </td> 82 </tr> 83 </table> 84 <div class="body"> 85 <h2 style="text-align: center">Unicode Technical Standard #35</h2> 86 <h1>Unicode Locale Data Markup Language (LDML)<br> 87 Part 5: Collation</h1> 88 <!-- At least the first row of this header table should be identical across the parts of this UTS. --> 89 <table border="1" cellpadding="2" cellspacing="0" class="wide"> 90 <tr> 91 <td>Version</td> 92 <td>38</td> 93 </tr> 94 <tr> 95 <td>Editors</td> 96 <td>Markus Scherer (<a href="mailto:markus.icu@gmail.com">markus.icu@gmail.com</a>) and 97 <a href="tr35.html#Acknowledgments">other CLDR committee 98 members</a></td> 99 </tr> 100 </table> 101 <p>For the full header, summary, and status, see <a href= 102 "tr35.html">Part 1: Core</a></p> 103 <h3><i>Summary</i></h3> 104 <p>This document describes parts of an XML format 105 (<i>vocabulary</i>) for the exchange of structured locale data. 106 This format is used in the <a href= 107 "https://unicode.org/cldr/">Unicode Common Locale Data 108 Repository</a>.</p> 109 <p>This is a partial document, describing only those parts of 110 the LDML that are relevant for collation (sorting, searching 111 & grouping). For the other parts of the LDML see the 112 <a href="tr35.html">main LDML document</a> and the links 113 above.</p> 114 <h3><i>Status</i></h3> 115 116 <!-- NOT YET APPROVED 117 <p> 118 <i class="changed">This is a<b><font color="#ff3333"> 119 draft </font></b>document which may be updated, replaced, or superseded by 120 other documents at any time. Publication does not imply endorsement 121 by the Unicode Consortium. This is not a stable document; it is 122 inappropriate to cite this document as other than a work in 123 progress. 124 </i> 125 </p> 126 END NOT YET APPROVED --> 127 <!-- APPROVED --> 128 <p><i>This document has been reviewed by Unicode members and 129 other interested parties, and has been approved for publication 130 by the Unicode Consortium. This is a stable document and may be 131 used as reference material or cited as a normative reference by 132 other specifications.</i></p> 133 <!-- END APPROVED --> 134 135 <blockquote> 136 <p><i><b>A Unicode Technical Standard (UTS)</b> is an 137 independent specification. Conformance to the Unicode 138 Standard does not imply conformance to any UTS.</i></p> 139 </blockquote> 140 <p><i>Please submit corrigenda and other comments with the CLDR 141 bug reporting form [<a href="tr35.html#Bugs">Bugs</a>]. Related 142 information that is useful in understanding this document is 143 found in the <a href="tr35.html#References">References</a>. For 144 the latest version of the Unicode Standard see [<a href= 145 "tr35.html#Unicode">Unicode</a>]. For a list of current Unicode 146 Technical Reports see [<a href= 147 "tr35.html#Reports">Reports</a>]. For more information about 148 versions of the Unicode Standard, see [<a href= 149 "tr35.html#Versions">Versions</a>].</i></p> 150 <h2><a name="Parts" href="#Parts" id="Parts">Parts</a></h2> 151 <!-- This section of Parts should be identical in all of the parts of this UTS. --> 152 <p>The LDML specification is divided into the following 153 parts:</p> 154 <ul class="toc"> 155 <li>Part 1: <a href="tr35.html#Contents">Core</a> (languages, 156 locales, basic structure)</li> 157 <li>Part 2: <a href="tr35-general.html#Contents">General</a> 158 (display names & transforms, etc.)</li> 159 <li>Part 3: <a href="tr35-numbers.html#Contents">Numbers</a> 160 (number & currency formatting)</li> 161 <li>Part 4: <a href="tr35-dates.html#Contents">Dates</a> 162 (date, time, time zone formatting)</li> 163 <li>Part 5: <a href= 164 "tr35-collation.html#Contents">Collation</a> (sorting, 165 searching, grouping)</li> 166 <li>Part 6: <a href= 167 "tr35-info.html#Contents">Supplemental</a> (supplemental 168 data)</li> 169 <li>Part 7: <a href= 170 "tr35-keyboards.html#Contents">Keyboards</a> (keyboard 171 mappings)</li> 172 </ul> 173 <h2><a name="Contents" href="#Contents" id="Contents">Contents 174 of Part 5, Collation</a></h2> 175 <!-- START Generated TOC: CheckHtmlFiles --> 176 <ul class="toc"> 177 <li>1 <a href="#CLDR_Collation">CLDR Collation</a> 178 <ul class="toc"> 179 <li>1.1 <a href="#CLDR_Collation_Algorithm">CLDR 180 Collation Algorithm</a> 181 <ul class="toc"> 182 <li>1.1.1 <a href="#Algorithm_FFFE">U+FFFE</a></li> 183 <li>1.1.2 <a href= 184 "#Context_Sensitive_Mappings">Context-Sensitive 185 Mappings</a></li> 186 <li>1.1.3 <a href="#Algorithm_Case">Case 187 Handling</a></li> 188 <li>1.1.4 <a href= 189 "#Algorithm_Reordering_Groups">Reordering 190 Groups</a></li> 191 <li>1.1.5 <a href="#Combining_Rules">Combining 192 Rules</a></li> 193 </ul> 194 </li> 195 </ul> 196 </li> 197 <li>2 <a href="#Root_Collation">Root Collation</a> 198 <ul class="toc"> 199 <li>2.1 <a href= 200 "#grouping_classes_of_characters">Grouping classes of 201 characters</a></li> 202 <li>2.2 <a href="#non_variable_symbols">Non-variable 203 symbols</a></li> 204 <li>2.3 <a href="#tibetan_contractions">Additional 205 contractions for Tibetan</a></li> 206 <li>2.4 <a href="#tailored_noncharacter_weights">Tailored 207 noncharacter weights</a></li> 208 <li>2.5 <a href="#Root_Data_Files">Root Collation Data 209 Files</a></li> 210 <li>2.6 <a href="#Root_Data_File_Formats">Root Collation 211 Data File Formats</a> 212 <ul class="toc"> 213 <li>2.6.1 <a href= 214 "#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></li> 215 <li>2.6.2 <a href= 216 "#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></li> 217 <li>2.6.3 <a href= 218 "#File_Format_UCA_Rules_txt">UCA_Rules.txt</a></li> 219 </ul> 220 </li> 221 </ul> 222 </li> 223 <li>3 <a href="#Collation_Tailorings">Collation 224 Tailorings</a> 225 <ul class="toc"> 226 <li>3.1 <a href="#Collation_Types">Collation Types</a> 227 <ul class="toc"> 228 <li>3.1.1 <a href= 229 "#Collation_Type_Fallback">Collation Type 230 Fallback</a> 231 <ul class="toc"> 232 <li>Table: <a href= 233 "#Sample_requested_and_actual_collation_locales_and_types"> 234 Sample requested and actual collation locales and 235 types</a></li> 236 </ul> 237 </li> 238 </ul> 239 </li> 240 <li>3.2 <a href="#Collation_Version">Version</a></li> 241 <li>3.3 <a href="#Collation_Element">Collation 242 Element</a></li> 243 <li>3.4 <a href="#Setting_Options">Setting Options</a> 244 <ul class="toc"> 245 <li>Table: <a href="#Collation_Settings">Collation 246 Settings</a></li> 247 <li>3.4.1 <a href="#Common_Settings">Common settings 248 combinations</a></li> 249 <li>3.4.2 <a href="#Normalization_Setting">Notes on 250 the normalization setting</a></li> 251 <li>3.4.3 <a href="#Variable_Top_Settings">Notes on 252 variable top settings</a></li> 253 </ul> 254 </li> 255 <li>3.5 <a href="#Rules">Collation Rule Syntax</a></li> 256 <li>3.6 <a href="#Orderings">Orderings</a> 257 <ul class="toc"> 258 <li>Table: <a href= 259 "#Specifying_Collation_Ordering">Specifying Collation 260 Ordering</a></li> 261 <li>Table: <a href= 262 "#Abbreviating_Ordering_Specifications">Abbreviating 263 Ordering Specifications</a></li> 264 </ul> 265 </li> 266 <li>3.7 <a href="#Contractions">Contractions</a> 267 <ul class="toc"> 268 <li>Table: <a href= 269 "#Specifying_Contractions">Specifying 270 Contractions</a></li> 271 </ul> 272 </li> 273 <li>3.8 <a href="#Expansions">Expansions</a></li> 274 <li>3.9 <a href="#Context_Before">Context Before</a> 275 <ul class="toc"> 276 <li>Table: <a href= 277 "#Specifying_Previous_Context">Specifying Previous 278 Context</a></li> 279 </ul> 280 </li> 281 <li>3.10 <a href= 282 "#Placing_Characters_Before_Others">Placing Characters 283 Before Others</a></li> 284 <li>3.11 <a href="#Logical_Reset_Positions">Logical Reset 285 Positions</a> 286 <ul class="toc"> 287 <li>Table: <a href= 288 "#Specifying_Logical_Positions">Specifying Logical 289 Positions</a></li> 290 </ul> 291 </li> 292 <li>3.12 <a href= 293 "#Special_Purpose_Commands">Special-Purpose Commands</a> 294 <ul class="toc"> 295 <li>Table: <a href= 296 "#Special_Purpose_Elements">Special-Purpose 297 Elements</a></li> 298 </ul> 299 </li> 300 <li>3.13 <a href="#Script_Reordering">Collation 301 Reordering</a> 302 <ul class="toc"> 303 <li>3.13.1 <a href= 304 "#Interpretation_reordering">Interpretation of a 305 reordering list</a></li> 306 <li>3.13.2 <a href= 307 "#Reordering_Groups_allkeys">Reordering Groups for 308 allkeys.txt</a></li> 309 </ul> 310 </li> 311 <li>3.14 <a href="#Case_Parameters">Case Parameters</a> 312 <ul class="toc"> 313 <li>3.14.1 <a href="#Case_Untailored">Untailored 314 Characters</a></li> 315 <li>3.14.2 <a href="#Case_Weights">Compute Modified 316 Collation Elements</a></li> 317 <li>3.14.3 <a href="#Case_Tailored">Tailored 318 Strings</a></li> 319 </ul> 320 </li> 321 <li>3.15 <a href="#Visibility">Visibility</a></li> 322 <li>3.16 <a href="#Collation_Indexes">Collation 323 Indexes</a> 324 <ul class="toc"> 325 <li>3.16.1 <a href="#Index_Characters">Index 326 Characters</a></li> 327 <li>3.16.2 <a href="#CJK_Index_Markers">CJK Index 328 Markers</a></li> 329 </ul> 330 </li> 331 </ul> 332 </li> 333 </ul><!-- END Generated TOC: CheckHtmlFiles --> 334 <h2>1 <a name="CLDR_Collation" href="#CLDR_Collation" id= 335 "CLDR_Collation">CLDR Collation</a></h2> 336 <p>Collation is the general term for the process and function 337 of determining the sorting order of strings of characters, for 338 example for lists of strings presented to users, or in 339 databases for sorting and selecting records.</p> 340 <p>Collation varies by language, by application (some languages 341 use special phonebook sorting), and other criteria (for 342 example, phonetic vs. visual).</p> 343 <p>CLDR provides collation data for many languages and styles. 344 The data supports not only sorting but also language-sensitive 345 searching and grouping under index headers. All CLDR collations 346 are based on the [<a href= 347 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] default 348 order, with common modifications applied in the CLDR root 349 collation, and further tailored for language and style as 350 needed.</p> 351 <h3>1.1 <a name="CLDR_Collation_Algorithm" href= 352 "#CLDR_Collation_Algorithm" id="CLDR_Collation_Algorithm">CLDR 353 Collation Algorithm</a></h3> 354 <p>The CLDR collation algorithm is an extension of the <a href= 355 "https://www.unicode.org/reports/tr10/#Main_Algorithm">Unicode 356 Collation Algorithm</a>.</p> 357 <h4>1.1.1 <a name="Algorithm_FFFE" href="#Algorithm_FFFE" id= 358 "Algorithm_FFFE">U+FFFE</a></h4> 359 <p>U+FFFE maps to a CE with a minimal, unique primary weight. 360 Its primary weight is not "variable": U+FFFE must not become 361 ignorable in alternate handling. On the identical level, a 362 minimal, unique “weight” must be emitted for U+FFFE as well. 363 This allows for <a href= 364 "https://www.unicode.org/reports/tr10/#Merging_Sort_Keys">Merging 365 Sort Keys</a> within code point space.</p> 366 <p>For example, when sorting names in a database, a sortable 367 string can be formed with <em>last_name</em> + '\uFFFE' + 368 <em>first_name</em>. These strings would sort properly, without 369 ever comparing the last part of a last name with the first part 370 of another first name.</p> 371 <p>For backwards secondary level sorting, text <i>segments</i> 372 separated by U+FFFE are processed in forward segment order, and 373 <i>within</i> each segment the secondary weights are compared 374 backwards. This is so that such combined strings are processed 375 consistently with merging their sort keys (for example, by 376 concatenating them level by level with a low separator).</p> 377 <p class="note">Note: With unique, low weights on <i>all</i> 378 levels it is possible to achieve <code>sortkey(str1 + "\uFFFE" 379 + str2) == mergeSortkeys(sortkey(str1), sortkey(str2))</code> . 380 When that is not necessary, then code can be a little simpler 381 (no special handling for U+FFFE except for 382 backwards-secondary), sort keys can be a little shorter (when 383 using compressible common non-primary weights for U+FFFE), and 384 another low weight can be used in tailorings.</p> 385 <h4>1.1.2 <a name="Context_Sensitive_Mappings" href= 386 "#Context_Sensitive_Mappings" id= 387 "Context_Sensitive_Mappings">Context-Sensitive 388 Mappings</a></h4> 389 <p>Contraction matching, as in the UCA, starts from the first 390 character of the contraction string. It slows down processing 391 of that first character even when none of its contractions 392 matches. In some cases, it is preferrable to change such 393 contractions to mappings with a prefix (context before a 394 character), so that complex processing is done only when the 395 less-frequently occurring trailing character is 396 encountered.</p> 397 <p>For example, the DUCET contains contractions for several 398 variants of L· (L followed by middle dot). Collating ASCII text 399 is slowed down by contraction matching starting with L/l. In 400 the CLDR root collation, these contractions are replaced by 401 prefix mappings (L|·) which are triggered only when the middle 402 dot is encountered. CLDR also uses prefix rules in the Japanese 403 tailoring, for processing of Hiragana/Katakana length and 404 iteration marks.</p> 405 <p>The mapping is conditional on the prefix match but does not 406 change the mappings for the preceding text. As a result, a 407 contraction mapping for "px" can be replaced by a prefix rule 408 "p|x" only if px maps to the collation elements for p followed 409 by the collation elements for "x if after p". In the DUCET, L· 410 maps to CE(L) followed by a special secondary CE (which differs 411 from CE(·) when · is not preceded by L). In the CLDR root 412 collation, L has no context-sensitive mappings, but · maps to 413 that special secondary CE if preceded by L.</p> 414 <p>A prefix mapping for p|x behaves mostly like the contraction 415 px, except when there is a contraction that overlaps with the 416 prefix, for example one for "op". A contraction matches only 417 new text (and consumes it), while a prefix matches only 418 already-consumed text.</p> 419 <ul> 420 <li>With mappings for "op" and "px", only the first 421 contraction matches in text "opx". (It consumes the "op" 422 characters, and there is no context-sensitive mapping for 423 x.)</li> 424 <li>With mappings for "op" and "p|x", both the contraction 425 and the prefix rule match in text "opx". (The prefix always 426 matches already-consumed characters, regardless of whether 427 they mapped as part of contractions.)</li> 428 </ul> 429 <p class="note">Note: Matching of discontiguous contractions 430 should be implemented without rewriting the text (unlike in the 431 [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] 432 algorithm specification), so that prefix matching is 433 predictable. (It should also help with contraction matching 434 performance.) An implementation that does rewrite the text, as 435 in the UCA, will get different results for some (unusual) 436 combinations of contractions, prefix rules, and input text.</p> 437 <p>Prefix matching uses a simple longest-match algorithm (op|c 438 wins over p|c). It is recommended that prefix rules be limited 439 to mappings where both the prefix string and the mapped string 440 begin with an NFC boundary (that is, with a normalization 441 starter that does not combine backwards). (In op|ch both o and 442 c should be starters (ccc=0) and NFC_QC=Yes.) Otherwise, prefix 443 matching would be affected by canonical reordering and 444 discontiguous matching, like contractions. Prefix matching is 445 thus always contiguous.</p> 446 <p>A character can have mappings with both prefixes (context 447 before) and contraction suffixes. Prefixes are matched first. 448 This is to keep them reasonably implementable: When there is a 449 mapping with both a prefix and a contraction suffix (like in 450 Japanese: ぐ|ゞ), then the matching needs to go in both 451 directions. The contraction might involve discontiguous 452 matching, which needs complex text iteration and handling of 453 skipped combining marks, and will consume the matching suffix. 454 Prefix matching should be first because, regardless of whether 455 there is a match, the implementation will always return to the 456 original text index (right after the prefix) from where it will 457 start to look at all of the contractions for that prefix.</p> 458 <p>If there is a match for a prefix but no match for any of the 459 suffixes for that prefix, then fall back to mappings with the 460 next-longest matching prefix, and so on, ultimately to mappings 461 with no prefix. (Otherwise mappings with longer prefixes would 462 “hide” mappings with shorter prefixes.)</p> 463 <p>Consider the following mappings.</p> 464 <ol> 465 <li>p → CE(p)</li> 466 <li>h → CE(h)</li> 467 <li>c → CE(c)</li> 468 <li>ch → CE(d)</li> 469 <li>p|c → CE(u)</li> 470 <li>p|ci → CE(v)</li> 471 <li>p|ĉ → CE(w)</li> 472 <li>op|ck → CE(x)</li> 473 </ol> 474 <p>With these, text collates like this:</p> 475 <ul> 476 <li>pc → CE(p)CE(u)</li> 477 <li>pci → CE(p)CE(v)</li> 478 <li>pch → CE(p)CE(u)CE(h)</li> 479 <li>pĉ → CE(p)CE(w)</li> 480 <li>pĉ̣ → CE(p)CE(w)CE(U+0323) // discontiguous</li> 481 <li>opck → CE(o)CE(p)CE(x)</li> 482 <li>opch → CE(o)CE(p)CE(u)CE(h)</li> 483 </ul> 484 <p>However, if the mapping p|c → CE(u) is missing, then text 485 "pch" maps to CE(p)CE(d), "opch" maps to CE(o)CE(p)CE(d), and 486 "pĉ̣" maps to CE(p)CE(c)CE(U+0323)CE(U+0302) (because 487 discontiguous contraction matching extends <i>an existing 488 match</i> by one non-starter at a time).</p> 489 <h4>1.1.3 <a name="Algorithm_Case" href="#Algorithm_Case" id= 490 "Algorithm_Case">Case Handling</a></h4> 491 <p>CLDR specifies how to sort lowercase or uppercase first, as 492 a stronger distinction than other tertiary variants 493 (<strong>caseFirst</strong>) or while completely ignoring all 494 other tertiary distinctions (<strong>caseLevel</strong>). See 495 <i>Section 3.3 <a href="#Setting_Options">Setting 496 Options</a></i> and <i>Section 3.13 <a href= 497 "#Case_Parameters">Case Parameters</a></i>.</p> 498 <h4>1.1.4 <a name="Algorithm_Reordering_Groups" href= 499 "#Algorithm_Reordering_Groups" id= 500 "Algorithm_Reordering_Groups">Reordering Groups</a></h4> 501 <p>CLDR specifies how to do parametric reordering of groups of 502 scripts (e.g., “native script first”) as well as special groups 503 (e.g., “digits after letters”), and provides data for the 504 effective implementation of such reordering.</p> 505 <h4>1.1.5 <a name="Combining_Rules" href="#Combining_Rules" id= 506 "Combining_Rules">Combining Rules</a></h4> 507 <p>Rules from different sources can be combined, with the later 508 rules overriding the earlier ones. The following is an example 509 of how this can be useful.</p> 510 <p>There is a root collation for "emoji" in CLDR. So use of 511 "-u-co-emoji" in a Unicode locale identifier will access that 512 ordering.</p> 513 <p>Example, using ICU:</p> 514 <blockquote> 515 <p>collator = 516 Collator.getInstance(ULocale.forLanguageTag("en-u-co-emoji"));</p> 517 </blockquote> 518 <p>However, use of the emoji will supplant the language's 519 customizations. So the above is the equivalent of:</p> 520 <blockquote> 521 <p>collator = 522 Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji"));</p> 523 </blockquote> 524 <p>The same structure will not work for a language that does 525 require customization, like Danish. That is, the following will 526 fail.</p> 527 <blockquote> 528 <p>collator = 529 Collator.getInstance(ULocale.forLanguageTag("da-u-co-emoji"));</p> 530 </blockquote> 531 <p>For that, a slightly more cumbersome method needs to be 532 employed, which is to take the rules for Danish, and explicitly 533 add the rules for emoji.</p> 534 <blockquote> 535 <p>RuleBasedCollator collator = new RuleBasedCollator(<br> 536 ((RuleBasedCollator) 537 Collator.getInstance(ULocale.forLanguageTag("da"))).getRules() 538 +<br> 539 ((RuleBasedCollator) 540 Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji")))<br> 541 542 .getRules());</p> 543 </blockquote> 544 <p>The following table shows the differences. When emoji 545 ordering is supported, the two faces will be adjacent. When 546 Danish ordering is supported, the ü is after the y.</p> 547 <table class='simple'> 548 <tbody> 549 <tr> 550 <td>code point order</td> 551 <td>,</td> 552 <td></td> 553 <td></td> 554 <td>Z</td> 555 <td>a</td> 556 <td>y</td> 557 <td>ü</td> 558 <td>☹️</td> 559 <td>✈️️</td> 560 <td>글</td> 561 <td></td> 562 </tr> 563 <tr> 564 <td>en</td> 565 <td>,</td> 566 <td>☹️</td> 567 <td>✈️️</td> 568 <td></td> 569 <td>a</td> 570 <td>ü</td> 571 <td>y</td> 572 <td>Z</td> 573 <td>글</td> 574 </tr> 575 <tr> 576 <td>en-u-co-emoji</td> 577 <td>,</td> 578 <td></td> 579 <td>☹️</td> 580 <td>✈️️</td> 581 <td>a</td> 582 <td>ü</td> 583 <td>y</td> 584 <td>Z</td> 585 <td>글</td> 586 </tr> 587 <tr> 588 <td>da</td> 589 <td>,</td> 590 <td>☹️</td> 591 <td>✈️️</td> 592 <td></td> 593 <td>a</td> 594 <td>y</td> 595 <td><strong><u>ü</u></strong></td> 596 <td>Z</td> 597 <td>글</td> 598 </tr> 599 <tr> 600 <td>da-u-co-emoji</td> 601 <td>,</td> 602 <td></td> 603 <td>☹️</td> 604 <td>✈️️</td> 605 <td>a</td> 606 <td><strong><u>ü</u></strong></td> 607 <td>y</td> 608 <td>Z</td> 609 <td>글</td> 610 </tr> 611 <tr> 612 <td>combined rules</td> 613 <td>,</td> 614 <td></td> 615 <td>☹️</td> 616 <td>✈️️</td> 617 <td>a</td> 618 <td>y</td> 619 <td><strong><u>ü</u></strong></td> 620 <td>Z</td> 621 <td>글</td> 622 </tr> 623 </tbody> 624 </table><br> 625 <p> </p> 626 <h2>2 <a name="Root_Collation" href="#Root_Collation" id= 627 "Root_Collation">Root Collation</a></h2> 628 <p>The CLDR root collation order is based on the <a href= 629 "https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table"> 630 Default Unicode Collation Element Table (DUCET)</a> defined in 631 <em>UTS #10: Unicode Collation Algorithm</em> [<a href= 632 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. It is 633 used by all other locales by default, or as the base for their 634 tailorings. (For a chart view of the UCA, see Collation Chart 635 [<a href="tr35.html#UCAChart">UCAChart</a>].)</p> 636 <p>Starting with CLDR 1.9, CLDR uses modified tables for the 637 root collation order. The root locale ordering is tailored in 638 the following ways:</p> 639 <h3>2.1 <a name="grouping_classes_of_characters" href= 640 "#grouping_classes_of_characters" id= 641 "grouping_classes_of_characters">Grouping classes of 642 characters</a></h3> 643 <p>As of Version 6.1.0, the DUCET puts characters into the 644 following ordering:</p> 645 <ul> 646 <li>First "common characters": whitespace, punctuation, 647 general symbols, some numbers, currency symbols, and other 648 numbers.</li> 649 <li>Then "script characters": Latin, Greek, and the rest of 650 the scripts.</li> 651 </ul> 652 <p>(There are a few exceptions to this general ordering.)</p> 653 <p>The CLDR root locale modifies the DUCET tailoring by 654 ordering the common characters more strictly by category:</p> 655 <ul> 656 <li>whitespace, punctuation, general symbols, currency 657 symbols, and numbers.</li> 658 </ul> 659 <p>What the regrouping allows is for users to parametrically 660 reorder the groups. For example, users can reorder numbers 661 after all scripts, or reorder Greek before Latin.</p> 662 <p>The relative order within each of these groups still matches 663 the DUCET. Symbols, punctuation, and numbers that are grouped 664 with a particular script stay with that script. The differences 665 between CLDR and the DUCET order are:</p> 666 <ol> 667 <li>CLDR groups the numbers together after currency symbols, 668 instead of splitting them with some before and some after. 669 Thus the following are put <em>after</em> currencies and just 670 before all the other numbers. 671 <blockquote> 672 <p>U+09F4 ( ৴ ) [No] BENGALI CURRENCY NUMERATOR ONE<br> 673 ...<br> 674 U+1D371 ( ) [No] COUNTING ROD TENS DIGIT NINE</p> 675 </blockquote> 676 </li> 677 <li>CLDR handles a few other characters differently 678 <ol> 679 <li>U+10A7F ( ) [Po] OLD SOUTH ARABIAN NUMERIC 680 INDICATOR is put with punctuation, not symbols</li> 681 <li>U+20A8 ( ₨ ) [Sc] RUPEE SIGN and U+FDFC ( ﷼ ) [Sc] 682 RIAL SIGN are put with currency signs, not with R and 683 REH.</li> 684 </ol> 685 </li> 686 </ol> 687 <h3>2.2 <a name="non_variable_symbols" href= 688 "#non_variable_symbols" id="non_variable_symbols">Non-variable 689 symbols</a></h3> 690 <p>There are multiple <a href= 691 "https://www.unicode.org/reports/tr10/#Variable_Weighting">Variable-Weighting</a> 692 options in the UCA for symbols and punctuation, including 693 <em>non-ignorable</em> and <em>shifted</em>. With the 694 <em>shifted</em> option, almost all symbols and punctuation are 695 ignored—except at a fourth level. The CLDR root locale ordering 696 is modified so that symbols are not affected by the 697 <em>shifted</em> option. That is, by default, symbols are not 698 “variable” in CLDR. So <em>shifted</em> only causes whitespace 699 and punctuation to be ignored, but not symbols (like ♥). The 700 DUCET behavior can be specified with a locale ID using the "kv" 701 keyword, to set the Variable section to include all of the 702 symbols below it, or be set parametrically where 703 implementations allow access.</p> 704 <p>See also:</p> 705 <ul> 706 <li><i>Section 3.3, <a href="#Setting_Options">Setting 707 Options</a></i></li> 708 <li><a href= 709 "https://www.unicode.org/charts/collation/">https://www.unicode.org/charts/collation/</a></li> 710 </ul> 711 <h3>2.3 <a name="tibetan_contractions" href= 712 "#tibetan_contractions" id="tibetan_contractions">Additional 713 contractions for Tibetan</a></h3> 714 <p>Ten contractions are added for Tibetan: Two to fulfill 715 <a href= 716 "https://www.unicode.org/reports/tr10/#WF5">well-formedness 717 condition 5</a>, and eight more to preserve the default order 718 for Tibetan. For details see <i>UTS #10, Section 3.8.2, 719 <a href="https://www.unicode.org/reports/tr10/#Well_Formed_DUCET"> 720 Well-Formedness of the DUCET</a></i>.</p> 721 <h3>2.4 <a name="tailored_noncharacter_weights" href= 722 "#tailored_noncharacter_weights" id= 723 "tailored_noncharacter_weights">Tailored noncharacter 724 weights</a></h3> 725 <p>U+FFFE and U+FFFF have special tailorings:</p> 726 <blockquote> 727 <p><strong>U+FFFF:</strong> This code point is tailored to 728 have a primary weight higher than all other characters. This 729 allows the reliable specification of a range, such as “Sch” ≤ 730 X ≤ “Sch\uFFFF”, to include all strings starting with "sch" 731 or equivalent.</p> 732 <p><strong>U+FFFE:</strong> This code point produces a CE 733 with minimal, unique weights on primary and identical levels. 734 For details see the <i><a href="#Algorithm_FFFE">CLDR 735 Collation Algorithm</a></i> above.</p> 736 </blockquote> 737 <p>UCA (beginning with version 6.3) also maps 738 <strong>U+FFFD</strong> to a special collation element with a 739 very high primary weight, so that it is reliably non-<a href= 740 "https://www.unicode.org/reports/tr10/#Variable_Weighting">variable</a>, 741 for use with <a href= 742 "https://www.unicode.org/reports/tr10/#Handling_Illformed">ill-formed 743 code unit sequences</a>.</p> 744 <p>In CLDR, so as to maintain the special collation elements, 745 <strong>U+FFFD..U+FFFF</strong> are not further tailorable, and 746 nothing can tailor to them. That is, neither can occur in a 747 collation rule. For example, the following rules are 748 illegal:</p> 749 <p><code>&\uFFFF < x</code></p> 750 <p><code>&x <\uFFFF</code><br></p> 751 <p class="note"><b>Note:</b></p> 752 <ul> 753 <li class="note">Java uses an early version of this collation 754 syntax, but has not been updated recently. It does not 755 support any of the syntax marked with [...], and its default 756 table is not the DUCET nor the CLDR root collation.</li> 757 </ul> 758 <h3>2.5 <a name="Root_Data_Files" href="#Root_Data_Files" id= 759 "Root_Data_Files">Root Collation Data Files</a></h3> 760 <p>The CLDR root collation data files are in the CLDR 761 repository and release, under the path <a href= 762 "https://github.com/unicode-org/cldr/tree/latest/common/uca/">common/uca/</a>.</p> 763 <p>For most data files there are <strong>_SHORT</strong> 764 versions available. They contain the same data but only minimal 765 comments, to reduce the file sizes.</p> 766 <p>Comments with DUCET-style weights in files other than 767 allkeys_CLDR.txt and allkeys_DUCET.txt use the weights defined 768 in allkeys_CLDR.txt.</p> 769 <ul> 770 <li><strong>allkeys_CLDR</strong> - A file that provides a 771 remapping of UCA DUCET weights for use with CLDR.</li> 772 <li><strong>allkeys_DUCET</strong> - The same as DUCET 773 allkeys.txt, but in alternate=non-ignorable sort order, for 774 easier comparison with allkeys_CLDR.txt.</li> 775 <li> 776 <strong>FractionalUCA</strong> - A file that provides a 777 remapping of UCA DUCET weights for use with CLDR. The 778 weight values are modified: 779 <ul> 780 <li>The weights have variable length, with 1..4 bytes 781 each. Each secondary or tertiary weight currently uses at 782 most 2 bytes.</li> 783 <li>There are tailoring gaps between adjacent weights, so 784 that a number of characters can be tailored to sort 785 between any two root collation elements.</li> 786 <li>There are collation elements with primary weights at 787 the boundaries between reordering groups and Unicode 788 scripts, so that tailoring around the first or last 789 primary of a group/script results in new collation 790 elements that sort and reorder together with that group 791 or script. These boundary weights also define the primary 792 weight ranges for parametric group and script 793 reordering.</li> 794 </ul>An implementation may modify the weights further to 795 fit the needs of its data structures. 796 </li> 797 <li><strong>UCA_Rules</strong> - A file that specifies the 798 root collation order in the form of <a href= 799 "#Collation_Tailorings">tailoring rules</a>. This is only an 800 approximation of the FractionalUCA data, since the rule 801 syntax cannot express every detail of the collation elements. 802 For example, in the DUCET and in FractionalUCA, tertiary 803 differences are usually expressed with special tertiary 804 weights on all collation elements of an expansion, while a 805 typical from-rules builder will modify the tertiary weight of 806 only one of the collation elements.</li> 807 <li> 808 <strong>CollationTest_CLDR</strong> - The CLDR versions of 809 the CollationTest files, which use the tailorings for CLDR. 810 For information on the format, see <a href= 811 "https://www.unicode.org/Public/UCA/latest/CollationTest.html"> 812 CollationTest.html</a> in the <a href= 813 "https://www.unicode.org/reports/tr10/#Data10">UCA data 814 directory</a>. 815 <ul> 816 <li>CollationTest_CLDR_NON_IGNORABLE.txt</li> 817 <li>CollationTest_CLDR_SHIFTED.txt</li> 818 </ul> 819 </li> 820 </ul> 821 <h3>2.6 <a name="Root_Data_File_Formats" href= 822 "#Root_Data_File_Formats" id="Root_Data_File_Formats">Root 823 Collation Data File Formats</a></h3> 824 <p>The file formats may change between versions of CLDR. The 825 formats for CLDR 23 and beyond are as follows. As usual, text 826 after a # is a comment.</p> 827 <h4>2.6.1 <a name="File_Format_allkeys_CLDR_txt" href= 828 "#File_Format_allkeys_CLDR_txt" id= 829 "File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></h4> 830 <p>This file defines CLDR’s tailoring of the DUCET, as 831 described in <i>Section 2, <a href="#Root_Collation">Root 832 Collation</a></i> .</p> 833 <p>The format is similar to that of <a href= 834 "https://www.unicode.org/reports/tr10/#File_Format">allkeys.txt</a>, 835 although there may be some differences in whitespace.</p> 836 <h4>2.6.2 <a name="File_Format_FractionalUCA_txt" href= 837 "#File_Format_FractionalUCA_txt" id= 838 "File_Format_FractionalUCA_txt">FractionalUCA.txt</a></h4> 839 <p>The format is illustrated by the following sample lines, 840 with commentary afterwards.</p> 841 <pre>[UCA version = 6.0.0]</pre> 842 <blockquote> 843 <p>Provides the version number of the UCA table.</p> 844 </blockquote> 845 <pre> 846 [Unified_Ideograph 4E00..9FCC FA0E..FA0F FA11 FA13..FA14 FA1F FA21 FA23..FA24 FA27..FA29 3400..4DB5 20000..2A6D6 2A700..2B734 2B740..2B81D]</pre> 847 <blockquote> 848 <p>Lists the ranges of Unified_Ideograph characters in 849 collation order. (New in CLDR 24.) They map to collation 850 elements with <a href= 851 "https://www.unicode.org/reports/tr10/#Implicit_Weights">implicit 852 (constructed) primary weights</a>.</p> 853 </blockquote> 854 <pre>[radical 6=⼅亅:亅了-亇予㐧-争亊-事㐨-] 855[radical 210=⿑齊:齊齋䶒䶓齌齍-齎齏-] 856[radical 210'=⻬齐:齐齑] 857[radical end]</pre> 858 <blockquote> 859 <p>Data for Unihan radical-stroke order. (New in CLDR 26.) 860 Following the [Unified_Ideograph] line, a section of 861 <code>[radical ...]</code> lines defines a radical-stroke 862 order of the Unified_Ideograph characters.</p> 863 <p>For Han characters, an implementation may choose either to 864 implement the order defined in the UCA and the 865 [Unified_Ideograph] data, or to implement the order defined 866 by the <code>[radical ...]</code> lines. Beginning with CLDR 867 26, the CJK type="unihan" tailorings assume that the root 868 collation order sorts Han characters in Unihan radical-stroke 869 order according to the <code>[radical ...]</code> data. The 870 CollationTest_CLDR files only contain Han characters that are 871 in the same relative order using implicit weights or the 872 radical-stroke order.</p> 873 <p>The root collation radical-stroke order is derived from 874 the first (normative) values of the <a href= 875 "https://www.unicode.org/reports/tr38/#kRSUnicode">Unihan 876 kRSUnicode</a> field for each Han character. Han characters 877 are ordered by radical, with traditional forms sorting before 878 simplified ones. Characters with the same radical are ordered 879 by residual stroke count. Characters with the same 880 radical-stroke values are ordered by block and code point, as 881 for <a href= 882 "https://www.unicode.org/reports/tr10/#Implicit_Weights">UCA 883 implicit weights</a>.</p> 884 <p>There is one <code>[radical ...]</code> line per radical, 885 in the order of radical numbers. Each line shows the radical 886 number and the representative characters from the <a href= 887 "https://www.unicode.org/reports/tr44/#UCD_Files_Table">UCD 888 file CJKRadicals.txt</a>, followed by a colon (“:”) and the 889 Han characters with that radical in the order as described 890 above. A range like <code>万-丌</code> indicates that the code 891 points in that range sort in code point order.</p> 892 <p>The radical number and characters are informational. The 893 sort order is established only by the order of the 894 <code>[radical ...]</code> lines, and within each line by the 895 characters and ranges between the colon (“:”) and the bracket 896 (“]”).</p> 897 <p>Each Unified_Ideograph occurs exactly once. Only 898 Unified_Ideograph characters are listed on <code>[radical 899 ...]</code> lines.</p> 900 <p>This section is terminated with one <code>[radical 901 end]</code> line.</p> 902 </blockquote> 903 <pre> 904 0000; [,,] # Zyyy Cc [0000.0000.0000] * <NULL></pre> 905 <blockquote> 906 <p>Provides a weight line. The first element (before the ";") 907 is a hex codepoint sequence. The second field is a sequence 908 of collation elements. Each collation element has 3 parts 909 separated by commas: the primary weight, secondary weight, 910 and tertiary weight. The tertiary weight actually consists of 911 two components: the top two bits (0xC0) are used for the 912 <em>case level</em>, and should be masked off where a case 913 level is not used.</p> 914 <p>A weight is either empty (meaning a zero or ignorable 915 weight) or is a sequence of one or more bytes. The bytes are 916 interpreted as a "fraction", meaning that the ordering is 04 917 < 05 05 < 06. The weights are constructed so that no 918 weight is an initial subsequence of another: that is, having 919 both the weights 05 and 05 05 is illegal. The above line 920 consists of all ignorable weights.</p> 921 <p>The vertical bar (“|”) character is used to indicate 922 context, as in:</p> 923 </blockquote> 924 <pre>006C | 00B7; [, DB A9, 05]</pre> 925 <blockquote> 926 This example indicates that if U+00B7 appears immediately 927 after U+006C, it is given the corresponding collation element 928 instead. This syntax is roughly equivalent to the following 929 contraction, but is more efficient. For details see the 930 specification of <i><a href= 931 "#Context_Sensitive_Mappings">Context-Sensitive 932 Mappings</a></i> above. 933 </blockquote> 934 <pre>006C 00B7; <em>CE(006C)</em> [, DB A9, 05]</pre> 935 <blockquote> 936 <p>Single-byte primary weights are given to particularly 937 frequent characters, such as space, digits, and a-z. More 938 frequent characters are given two-byte weights, while 939 relatively infrequent characters are given three-byte 940 weights. For example:</p> 941 </blockquote> 942 <pre>... 9430009; [03 05, 05, 05] # Zyyy Cc [0100.0020.0002] * <CHARACTER TABULATION> 944... 9451B60; [06 14 0C, 05, 05] # Bali Po [0111.0020.0002] * BALINESE PAMENENG 946... 9470031; [14, 05, 05] # Zyyy Nd [149B.0020.0002] * DIGIT ONE</pre> 948 <blockquote> 949 <p>The assignment of 2 vs 3 bytes does not reflect 950 importance, or exact frequency.</p> 951 </blockquote> 952 <pre> 9533041; [76 06, 05, 03] # Hira Lo [3888.0020.000D] * HIRAGANA LETTER SMALL A 9543042; [76 06, 05, 85] # Hira Lo [3888.0020.000E] * HIRAGANA LETTER A 95530A1; [76 06, 05, 10] # Kana Lo [3888.0020.000F] * KATAKANA LETTER SMALL A 95630A2; [76 06, 05, 9E] # Kana Lo [3888.0020.0011] * KATAKANA LETTER A</pre> 957 <blockquote> 958 <p>Beginning with CLDR 27, some primary or secondary 959 collation elements may have below-common tertiary weights 960 (e.g., <code>03</code> ), in particular to allow normal 961 Hiragana letters to have common tertiary weights.</p> 962 </blockquote> 963 <pre># SPECIAL MAX/MIN COLLATION ELEMENTS 964FFFE; [02, 05, 05] # Special LOWEST primary, for merge/interleaving 965FFFF; [EF FE, 05, 05] # Special HIGHEST primary, for ranges</pre> 966 <blockquote> 967 <p>The two tailored noncharacters have their own primary 968 weights.</p> 969 </blockquote> 970 <pre> 971F967; [U+4E0D] # Hani Lo [FB40.0020.0002][CE0D.0000.0000] * CJK COMPATIBILITY IDEOGRAPH-F967 9722F02; [U+4E36, 10] # Hani So [FB40.0020.0004][CE36.0000.0000] * KANGXI RADICAL DOT 9732E80; [U+4E36, 70, 20] # Hani So [FB40.0020.0004][CE36.0000.0000][0000.00FC.0004] * CJK RADICAL REPEAT</pre> 974 <blockquote> 975 <p>Some collation elements are specified by reference to 976 other mappings. This is particularly useful for Han 977 characters which are given implicit/constructed primary 978 weights; the reference to a Unified_Ideograph makes these 979 mappings independent of implementation details. This 980 technique may also be used in other mappings to show the 981 relationship of character variants.</p> 982 <p>The referenced character must have a mapping listed 983 earlier in the file, or the mapping must have been defined 984 via the [Unified_Ideograph] data line. The referenced 985 character must map to exactly one collation element.</p> 986 <p><code>[U+4E0D]</code> copies U+4E0D’s entire collation 987 element. <code>[U+4E36, 10]</code> copies U+4E36’s primary 988 and secondary weights and specifies a different tertiary 989 weight. <code>[U+4E36, 70, 20]</code> only copies U+4E36’s 990 primary weight and specifies other secondary and tertiary 991 weights.</p> 992 <p>FractionalUCA.txt does not have any explicit mappings for 993 implicit weights. Therefore, an implementation is free to 994 choose an algorithm for computing implicit weights according 995 to the principles specified in the UCA.</p> 996 </blockquote> 997 <pre> 998FDD1 20AC; [0D 20 02, 05, 05] # CURRENCY first primary 999FDD1 0034; [0E 02 02, 05, 05] # DIGIT first primary starts new lead byte 1000FDD0 FF21; [26 02 02, 05, 05] # REORDER_RESERVED_BEFORE_LATIN first primary starts new lead byte 1001FDD1 004C; [28 02 02, 05, 05] # LATIN first primary starts new lead byte 1002FDD0 FF3A; [5D 02 02, 05, 05] # REORDER_RESERVED_AFTER_LATIN first primary starts new lead byte 1003FDD1 03A9; [5F 04 02, 05, 05] # GREEK first primary starts new lead byte (compressible) 1004FDD1 03E2; [5F 60 02, 05, 05] # COPTIC first primary (compressible)</pre> 1005 <blockquote> 1006 <p>These are special mappings with primaries at the 1007 boundaries of scripts and reordering groups. They serve as 1008 tailoring boundaries, so that tailoring near the first or 1009 last character of a script or group places the tailored item 1010 into the same group. Beginning with CLDR 24, each of these is 1011 a contraction of U+FDD1 with a character of the corresponding 1012 script (or of the General_Category [Z, P, S, Sc, Nd] 1013 corresponding to a special reordering group), mapping to the 1014 first possible primary weight per script or group. They can 1015 be enumerated for implementations of <a href= 1016 "#Collation_Indexes">Collation Indexes</a>. (Earlier versions 1017 mapped contractions with U+FDD0 to the last primary weights 1018 of each group but not each script.)</p> 1019 <p>Beginning with CLDR 27, these mappings alone define the 1020 boundaries for reordering single scripts. (There are no 1021 mappings for Hrkt, Hans, or Hant because they are not fully 1022 distinct scripts; they share primary weights with other 1023 scripts: Hrkt=Hira=Kana & Hans=Hant=Hani.) There are some 1024 reserved ranges, beginning at boundaries marked with U+FDD0 1025 plus following characters as shown above. The reserved ranges 1026 are not used for collation elements and are not available for 1027 tailoring.</p> 1028 <p>Some primary lead bytes must be reserved so that 1029 reordering of scripts along partial-lead-byte boundaries can 1030 “split” the primary lead byte and use up a reserved byte. 1031 This is for implementations that write sort keys, which must 1032 reorder primary weights by offsetting them by whole lead 1033 bytes. There are reorder-reserved ranges before and after 1034 Latin, so that reordering scripts with few primary lead bytes 1035 relative to Latin can move those scripts into the reserved 1036 ranges without changing the primary weights of any other 1037 script. Each of these boundaries begins with a new two-byte 1038 primary; that is, no two groups/scripts/ranges share the top 1039 16 bits of their primary weights.</p> 1040 </blockquote> 1041 <pre> 1042FDD0 0034; [11, 05, 05] # lead byte for numeric sorting</pre> 1043 <blockquote> 1044 <p>This mapping specifies the lead byte for numeric sorting. 1045 It must be different from the lead byte of any other primary 1046 weight, otherwise numeric sorting would generate ill-formed 1047 collation elements. Therefore, this mapping itself must be 1048 excluded from the set of regular mappings. This value can be 1049 ignored by implementations that do not support numeric 1050 sorting. (Other contractions with U+FDD0 can normally be 1051 ignored altogether.)</p> 1052 </blockquote> 1053 <pre> 1054# HOMELESS COLLATION ELEMENTS 1055FDD0 0063; [, 97, 3D] # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F] * U+01C6 LATIN SMALL LETTER DZ WITH CARON 1056FDD0 0064; [, A7, 09] # [15D1.0020.0004] [0000.0056.0004] * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA 1057FDD0 0065; [, B1, 09] # [1644.0020.0004] [0000.0061.0004] * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE</pre> 1058 <blockquote> 1059 <p>The DUCET has some weights that don't correspond directly 1060 to a character. To allow for implementations to have a 1061 mapping for each collation element (necessary for certain 1062 implementations of tailoring), this requires the construction 1063 of special sequences for those weights. These collation 1064 elements can normally be ignored.</p> 1065 </blockquote> 1066 <p>Next, a number of tables are defined. The function of each 1067 of the tables is summarized afterwards.</p> 1068 <pre># VALUES BASED ON UCA 1069... 1070[first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT 1071[last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032 1072[first implicit [E0 04 06, 05, 05]] # CONSTRUCTED 1073[last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED 1074[first trailing [E5, 05, 05]] # CONSTRUCTED 1075[last trailing [E5, 05, 05]] # CONSTRUCTED 1076...</pre> 1077 <blockquote> 1078 <p>This table summarizes ranges of important groups of 1079 characters for implementations.</p> 1080 </blockquote> 1081 <pre># Top Byte => Reordering Tokens 1082[top_byte 00 TERMINATOR ] # [0] TERMINATOR=1 1083[top_byte 01 LEVEL-SEPARATOR ] # [0] LEVEL-SEPARATOR=1 1084[top_byte 02 FIELD-SEPARATOR ] # [0] FIELD-SEPARATOR=1 1085[top_byte 03 SPACE ] # [9] SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1 1086...</pre> 1087 <blockquote> 1088 <p>This table defines the reordering groups, for script 1089 reordering. The table maps from the first bytes of the 1090 fractional weights to a reordering token. The format is 1091 "[top_byte " byte-value reordering-token "COMPRESS"? "]". The 1092 "COMPRESS" value is present when there is only one byte in 1093 the reordering token, and primary-weight compression can be 1094 applied. Most reordering tokens are script values; others are 1095 special-purpose values, such as PUNCTUATION. Beginning with 1096 CLDR 24, this table precedes the regular mappings, so that 1097 parsers can use this information while processing and 1098 optimizing mappings. Beginning with CLDR 27, most of this 1099 data is irrelevant because single scripts can be reordered. 1100 Only the "COMPRESS" data is still useful.</p> 1101 </blockquote> 1102 <pre># Reordering Tokens => Top Bytes 1103[reorderingTokens Arab 61=910 62=910 ] 1104[reorderingTokens Armi 7A=22 ] 1105[reorderingTokens Armn 5F=82 ] 1106[reorderingTokens Avst 7A=54 ] 1107...</pre> 1108 <blockquote> 1109 <p>This table is an inverse mapping from reordering token to 1110 top byte(s). In terms like "61=910", the first value is the 1111 top byte, while the second is informational, indicating the 1112 number of primaries assigned with that top byte.</p> 1113 </blockquote> 1114 <pre># General Categories => Top Byte 1115[categories Cc 03{SPACE}=6 ] 1116[categories Cf 77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ] 1117[categories Lm 0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...</pre> 1118 <blockquote> 1119 <p>This table is informational, providing the top bytes, 1120 scripts, and primaries associated with each general category 1121 value.</p> 1122 </blockquote> 1123 <pre># FIXED VALUES 1124[fixed first implicit byte E0] 1125[fixed last implicit byte E4] 1126[fixed first trail byte E5] 1127[fixed last trail byte EF] 1128[fixed first special byte F0] 1129[fixed last special byte FF] 1130 1131[fixed secondary common byte 05] 1132[fixed last secondary common byte 45] 1133[fixed first ignorable secondary byte 80] 1134 1135[fixed tertiary common byte 05] 1136[fixed first ignorable tertiary byte 3C] 1137 </pre> 1138 <blockquote> 1139 <p>The final table gives certain hard-coded byte values. The 1140 "trail" area is provided for implementation of the "trailing 1141 weights" as described in the UCA.</p> 1142 </blockquote> 1143 <p class="note">Note: The particular primary lead bytes for 1144 Hani vs. IMPLICIT vs. TRAILING are only an example. An 1145 implementation is free to move them if it also moves the 1146 explicit TRAILING weights. This affects only a small number of 1147 explicit mappings in FractionalUCA.txt, such as for U+FFFD, 1148 U+FFFF, and the “unassigned first primary”. It is possible to 1149 use no SPECIAL bytes at all, and to use only the one primary 1150 lead byte FF for TRAILING weights.</p> 1151 <h4>2.6.3 <a name="File_Format_UCA_Rules_txt" href= 1152 "#File_Format_UCA_Rules_txt" id= 1153 "File_Format_UCA_Rules_txt">UCA_Rules.txt</a></h4> 1154 <p>The format for this file uses the CLDR collation syntax, see 1155 <i>Section 3, <a href="#Collation_Tailorings">Collation 1156 Tailorings</a></i> .</p> 1157 <h2>3 <a name="Collation_Tailorings" href= 1158 "#Collation_Tailorings" id="Collation_Tailorings">Collation 1159 Tailorings</a></h2> 1160 <p class="dtd"><!ELEMENT collations (alias | 1161 (defaultCollation?, collation*, special*)) ></p> 1162 <p class="dtd"><!ELEMENT defaultCollation ( #PCDATA ) 1163 ></p> 1164 <p>This element of the LDML format contains one or more 1165 <span class="element">collation</span> elements, distinguished 1166 by type. Each <span class="element">collation</span> contains 1167 elements with parametric settings, or rules that specify a 1168 certain sort order, as a tailoring of the root order, or 1169 both.</p> 1170 <p class="note">Note: CLDR collation tailoring data should 1171 follow the <a href= 1172 "http://cldr.unicode.org/index/cldr-spec/collation-guidelines">CLDR 1173 Collation Guidelines</a>.</p> 1174 <h3>3.1 <a name="Collation_Types" href="#Collation_Types" id= 1175 "Collation_Types">Collation Types</a></h3> 1176 <p>Each locale may have multiple sort orders (types). The 1177 <span class="element">defaultCollation</span> element defines 1178 the default tailoring for a locale and its sublocales. For 1179 example:</p> 1180 <ul> 1181 <li>root.xml: 1182 <code><defaultCollation>standard</defaultCollation></code></li> 1183 <li>zh.xml: 1184 <code><defaultCollation>pinyin</defaultCollation></code></li> 1185 <li>zh_Hant.xml: 1186 <code><defaultCollation>stroke</defaultCollation></code></li> 1187 </ul> 1188 <p>To allow implementations in reduced memory environments to 1189 use CJK sorting, there are also short forms of each of these 1190 collation sequences. These provide for the most common 1191 characters in common use, and are marked with <span class= 1192 "attribute">alt</span>="<span class= 1193 "attributeValue">short</span>".</p> 1194 <p>A collation type name that starts with "private-", for 1195 example, "private-kana", indicates an incomplete tailoring that 1196 is only intended for import into one or more other tailorings 1197 (usually for sharing common rules). It does not establish a 1198 complete sort order. An implementation should not build data 1199 tables for a private collation type, and should not include a 1200 private collation type in a list of available types.</p> 1201 <p class="note"><b>Note:</b></p> 1202 <ul> 1203 <li>There is an on-line demonstration of collation at 1204 [<a href="tr35.html#LocaleExplorer">LocaleExplorer</a>] that 1205 uses the same rule syntax. (Pick the locale and scroll to 1206 "Collation Rules", near the end.)</li> 1207 <li class="note">In CLDR 23 and before, LDML collation files 1208 used an XML format. Starting with CLDR 24, the XML collation 1209 syntax is deprecated and no longer used. See the <i><a href= 1210 "https://www.unicode.org/reports/tr35/tr35-31/tr35-collation.html#Collation_Tailorings"> 1211 CLDR 23 version of this document</a></i> for details about 1212 the XML collation syntax.</li> 1213 </ul> 1214 <h4>3.1.1 <a name="Collation_Type_Fallback" href= 1215 "#Collation_Type_Fallback" id= 1216 "Collation_Type_Fallback">Collation Type Fallback</a></h4> 1217 <p>When loading a requested tailoring from its data file and 1218 the parent file chain, use the following type fallback to find 1219 the tailoring.</p> 1220 <ol> 1221 <li>Determine the default type from the 1222 <defaultCollation> element; map the default type to its 1223 alias if one is defined. If there is no 1224 <defaultCollation> element, then use "standard" as the 1225 default type.</li> 1226 <li>If the request language tag specifies the collation type 1227 (keyword "co"), then map it to its alias if one is defined 1228 (e.g., "-co-phonebk" → "phonebook"). If the language tag does 1229 not specify the type, then use the default type.</li> 1230 <li>Use the <collation> element with this type.</li> 1231 <li>If it does not exist, and the type starts with "search" 1232 but is longer, then set the type to "search" and use that 1233 <collation> element. (For example, "searchjl" → 1234 "search".)</li> 1235 <li>If it does not exist, and the type is not the default 1236 type, then set the type to the default type and use that 1237 <collation> element.</li> 1238 <li>If it does not exist, and the type is not "standard", 1239 then set the type to "standard" and use that 1240 <collation> element.</li> 1241 <li>If it does not exist, then use the CLDR root 1242 collation.</li> 1243 </ol> 1244 <p class="note">Note that the CLDR collation/root.xml contains 1245 <defaultCollation>standard</defaultCollation>, 1246 <collation type="standard"> (with an empty tailoring, so 1247 this is the same as the CLDR root collation), and <collation 1248 type="search">.</p> 1249 <p>For example, assume that we have collation data for the 1250 following tailorings. ("da/search" is shorthand for 1251 "da-u-co-search".)</p> 1252 <ul> 1253 <li>root/defaultCollation=standard</li> 1254 <li>root/standard (this is the same as “the CLDR root 1255 collator”)</li> 1256 <li>root/search</li> 1257 <li>da/standard</li> 1258 <li>da/search</li> 1259 <li>el/standard</li> 1260 <li>ko/standard</li> 1261 <li>ko/search</li> 1262 <li>ko/searchjl</li> 1263 <li>zh/defaultCollation=pinyin</li> 1264 <li>zh/pinyin</li> 1265 <li>zh/stroke</li> 1266 <li>zh-Hant/defaultCollation=stroke</li> 1267 </ul> 1268 <table> 1269 <caption> 1270 <a name= 1271 "Sample_requested_and_actual_collation_locales_and_types" 1272 href= 1273 "#Sample_requested_and_actual_collation_locales_and_types" 1274 id= 1275 "Sample_requested_and_actual_collation_locales_and_types">Sample 1276 requested and actual collation locales and types</a> 1277 </caption> 1278 <tr> 1279 <th>requested</th> 1280 <th>actual</th> 1281 <th>comment</th> 1282 </tr> 1283 <tr> 1284 <td>da/phonebook</td> 1285 <td>da/standard</td> 1286 <td>default type for Danish</td> 1287 </tr> 1288 <tr> 1289 <td>zh</td> 1290 <td>zh/pinyin</td> 1291 <td>default type for zh</td> 1292 </tr> 1293 <tr> 1294 <td>zh/standard</td> 1295 <td>root/standard</td> 1296 <td>no "standard" tailoring for zh, falls back to root</td> 1297 </tr> 1298 <tr> 1299 <td>zh/phonebook</td> 1300 <td>zh/pinyin</td> 1301 <td>default type for zh</td> 1302 </tr> 1303 <tr> 1304 <td>zh-Hant/phonebook</td> 1305 <td>zh/stroke</td> 1306 <td>default type for zh-Hant is "stroke"</td> 1307 </tr> 1308 <tr> 1309 <td>da/searchjl</td> 1310 <td>da/search</td> 1311 <td>"search.+" falls back to "search"</td> 1312 </tr> 1313 <tr> 1314 <td>el/search</td> 1315 <td>root/search</td> 1316 <td>no "search" tailoring for Greek</td> 1317 </tr> 1318 <tr> 1319 <td>el/searchjl</td> 1320 <td>root/search</td> 1321 <td>"search.+" falls back to "search", found in root</td> 1322 </tr> 1323 <tr> 1324 <td>ko/searchjl</td> 1325 <td>ko/searchjl</td> 1326 <td>requested data is actually available</td> 1327 </tr> 1328 </table> 1329 <h3>3.2 <a name="Collation_Version" href="#Collation_Version" 1330 id="Collation_Version">Version</a></h3> 1331 <p>The version attribute is used in case a specific version of 1332 the UCA is to be specified. It is optional, and is specified if 1333 the results are to be identical on different systems. If it is 1334 not supplied, then the version is assumed to be the same as the 1335 Unicode version for the system as a whole.</p> 1336 <blockquote> 1337 <p class="note"><b>Note:</b> For version 3.1.1 of the UCA, 1338 the version of Unicode must also be specified with any 1339 versioning information; an example would be "3.1.1/3.2" for 1340 version 3.1.1 of the UCA, for version 3.2 of Unicode. This 1341 was changed by decision of the UTC, so that dual versions 1342 were no longer necessary. So for UCA 4.0 and beyond, the 1343 version just has a single number.</p> 1344 </blockquote> 1345 <h3>3.3 <a name="Collation_Element" href="#Collation_Element" 1346 id="Collation_Element">Collation Element</a></h3> 1347 <p class="dtd"><!ELEMENT collation (alias | (cr*, special*)) 1348 ></p> 1349 <p>The tailoring syntax is designed to be independent of the 1350 actual weights used in any particular UCA table. That way the 1351 same rules can be applied to UCA versions over time, even if 1352 the underlying weights change. The following illustrates the 1353 overall structure of a <span class= 1354 "element">collation</span>:</p> 1355 <pre><collation type="phonebook"> 1356 <cr><![CDATA[ 1357 [caseLevel on] 1358 &c < k 1359 ]]></cr> 1360</collation></pre> 1361 <h3>3.4 <a name="Setting_Options" href="#Setting_Options" id= 1362 "Setting_Options">Setting Options</a></h3> 1363 <p>Parametric settings can be specified in language tags or in 1364 rule syntax (in the form <code>[keyword value]</code> ). For 1365 example, <code>-ks-level2</code> or <code>[strength 2]</code> 1366 will only compare strings based on their primary and secondary 1367 weights.</p> 1368 <p>If a setting is not present, the CLDR default (or the 1369 default for the locale, if there is one) is used. That default 1370 is listed in bold italics. Where there is a UCA default that is 1371 different, it is listed in bold with (<strong>UCA 1372 default</strong>). Note that the default value for a locale may 1373 be different than the normal default value for the setting.</p> 1374 <table> 1375 <caption> 1376 <a name="Collation_Settings" href="#Collation_Settings" id= 1377 "Collation_Settings">Collation Settings</a> 1378 </caption> 1379 <tr> 1380 <th>BCP47 Key</th> 1381 <th>BCP47 Value</th> 1382 <th>Rule Syntax</th> 1383 <th>Description</th> 1384 </tr> 1385 <tr> 1386 <td rowspan="5">ks</td> 1387 <td>level1</td> 1388 <td><code>[strength 1]</code><br> 1389 (primary)</td> 1390 <td rowspan="5">Sets the default strength for comparison, 1391 as described in the [<a href= 1392 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. 1393 <em>Note that a strength setting of greater than 4 may have 1394 the same effect as <strong>identical</strong>, depending on 1395 the locale and implementation.</em></td> 1396 </tr> 1397 <tr> 1398 <td>level2</td> 1399 <td><code>[strength 2]</code><br> 1400 (secondary)</td> 1401 </tr> 1402 <tr> 1403 <td>level3</td> 1404 <td><em><strong><code>[strength 3]</code><br> 1405 (tertiary)</strong></em></td> 1406 </tr> 1407 <tr> 1408 <td>level4</td> 1409 <td><code>[strength 4]</code><br> 1410 (quaternary)</td> 1411 </tr> 1412 <tr> 1413 <td>identic</td> 1414 <td><code>[strength I]</code><br> 1415 (identical)</td> 1416 </tr> 1417 <tr> 1418 <td rowspan="3">ka</td> 1419 <td>noignore</td> 1420 <td><i><strong><code>[alternate 1421 non-ignorable]</code></strong></i><br></td> 1422 <td rowspan="3">Sets alternate handling for variable 1423 weights, as described in [<a href= 1424 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], 1425 where "shifted" causes certain characters to be ignored in 1426 comparison. <em>The default for LDML is different than it 1427 is in the UCA. In LDML, the default for alternate handling 1428 is <strong>non-ignorable</strong>, while in UCA it is 1429 <strong>shifted</strong>. In addition, in LDML only 1430 whitespace and punctuation are variable by 1431 default.</em></td> 1432 </tr> 1433 <tr> 1434 <td>shifted</td> 1435 <td><strong><code>[alternate shifted]</code><br> 1436 (UCA default)</strong></td> 1437 </tr> 1438 <tr> 1439 <td><em>n/a</em></td> 1440 <td><i>n/a</i><br> 1441 (blanked)</td> 1442 </tr> 1443 <tr> 1444 <td rowspan="2">kb</td> 1445 <td>true</td> 1446 <td><code>[backwards 2]</code></td> 1447 <td rowspan="2">Sets the comparison for the second level to 1448 be <strong>backwards</strong>, as described in [<a href= 1449 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td> 1450 </tr> 1451 <tr> 1452 <td>false</td> 1453 <td><i><strong>n/a</strong></i></td> 1454 </tr> 1455 <tr> 1456 <td rowspan="2">kk</td> 1457 <td>true</td> 1458 <td><strong><code>[normalization on]</code><br> 1459 (UCA default)</strong></td> 1460 <td rowspan="2">If <strong>on</strong>, then the normal 1461 [<a href= 1462 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] 1463 algorithm is used. If <strong>off</strong>, then most 1464 strings should still sort correctly despite not normalizing 1465 to NFD first.<br> 1466 <em>Note that the default for CLDR locales may be different 1467 than in the UCA. The rules for particular locales have it 1468 set to <strong>on</strong>: those locales whose exemplar 1469 characters (in forms commonly interchanged) would be 1470 affected by normalization.</em></td> 1471 </tr> 1472 <tr> 1473 <td>false</td> 1474 <td><i><strong><code>[normalization 1475 off]</code></strong></i></td> 1476 </tr> 1477 <tr> 1478 <td rowspan="2">kc</td> 1479 <td>true</td> 1480 <td><code>[caseLevel on]</code></td> 1481 <td rowspan="2">If set to <strong>on</strong><i>,</i> a 1482 level consisting only of case characteristics will be 1483 inserted in front of tertiary level, as a "Level 2.5". To 1484 ignore accents but take case into account, set strength to 1485 <strong>primary</strong> and case level to 1486 <strong>on</strong>. For details, see <em>Section 3.14, 1487 <a href="#Case_Parameters">Case Parameters</a></em> .</td> 1488 </tr> 1489 <tr> 1490 <td>false</td> 1491 <td><i><strong><code>[caseLevel 1492 off]</code></strong></i></td> 1493 </tr> 1494 <tr> 1495 <td rowspan="3">kf</td> 1496 <td>upper</td> 1497 <td><code>[caseFirst upper]</code></td> 1498 <td rowspan="3">If set to <strong>upper</strong>, causes 1499 upper case to sort before lower case. If set to 1500 <strong>lower</strong>, causes lower case to sort before 1501 upper case. Useful for locales that have already supported 1502 ordering but require different order of cases. Affects case 1503 and tertiary levels. For details, see <em>Section 3.14, 1504 <a href="#Case_Parameters">Case Parameters</a></em> .</td> 1505 </tr> 1506 <tr> 1507 <td>lower</td> 1508 <td><code>[caseFirst lower]</code></td> 1509 </tr> 1510 <tr> 1511 <td>false</td> 1512 <td><i><strong><code>[caseFirst 1513 off]</code></strong></i></td> 1514 </tr> 1515 <tr> 1516 <td rowspan="2">kh</td> 1517 <td>true<br> 1518 <i><strong>Deprecated:</strong></i> Use rules with 1519 quaternary relations instead.</td> 1520 <td><code>[hiraganaQ on]</code></td> 1521 <td rowspan="2">Controls special treatment of Hiragana code 1522 points on quaternary level. If turned <strong>on</strong>, 1523 Hiragana codepoints will get lower values than all the 1524 other non-variable code points in <strong>shifted</strong>. 1525 That is, the normal Level 4 value for a regular collation 1526 element is FFFF, as described in [<a href= 1527 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], 1528 <em>Section 3.6, <a href= 1529 "https://www.unicode.org/reports/tr10/#Variable_Weighting">Variable 1530 Weighting</a></em> . This is changed to FFFE for 1531 [:script=Hiragana:] characters. The strength must be 1532 greater or equal than quaternary if this attribute is to 1533 have any effect.</td> 1534 </tr> 1535 <tr> 1536 <td>false</td> 1537 <td><i><strong><code>[hiraganaQ 1538 off]</code></strong></i></td> 1539 </tr> 1540 <tr> 1541 <td rowspan="2">kn</td> 1542 <td>true</td> 1543 <td><code>[numericOrdering on]</code></td> 1544 <td rowspan="2">If set to <strong>on</strong>, any sequence 1545 of Decimal Digits (General_Category = Nd in the [<a href= 1546 "https://www.unicode.org/reports/tr41/#UAX44">UAX44</a>]) is 1547 sorted at a primary level with its numeric value. For 1548 example, "A-21" < "A-123". The computed primary weights 1549 are all at the start of the <strong>digit</strong> 1550 reordering group. Thus with an untailored UCA table, "a$" 1551 < "a0" < "a2" < "a12" < "a⓪" < "aa".</td> 1552 </tr> 1553 <tr> 1554 <td>false</td> 1555 <td><i><strong><code>[numericOrdering 1556 off]</code></strong></i></td> 1557 </tr> 1558 <tr> 1559 <td>kr</td> 1560 <td>a sequence of one or more reorder codes: <strong>space, 1561 punct, symbol, currency, digit</strong>, or any BCP47 1562 script ID</td> 1563 <td><code>[reorder Grek digit]</code></td> 1564 <td>Specifies a reordering of scripts or other significant 1565 blocks of characters such as symbols, punctuation, and 1566 digits. For the precise meaning and usage of the reorder 1567 codes, see <em>Section 3.13, <a href= 1568 "#Script_Reordering">Collation Reordering</a>.</em></td> 1569 </tr> 1570 <tr> 1571 <td rowspan="4">kv</td> 1572 <td>space</td> 1573 <td><code>[maxVariable space]</code></td> 1574 <td rowspan="4">Sets the variable top to the top of the 1575 specified reordering group. All code points with primary 1576 weights less than or equal to the variable top will be 1577 considered variable, and thus affected by the alternate 1578 handling. Variables are ignorable by default in [<a href= 1579 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but 1580 not in CLDR.</td> 1581 </tr> 1582 <tr> 1583 <td>punct</td> 1584 <td><i><strong><code>[maxVariable 1585 punct]</code></strong></i></td> 1586 </tr> 1587 <tr> 1588 <td>symbol</td> 1589 <td><strong><code>[maxVariable symbol]</code><br> 1590 (UCA default)</strong></td> 1591 </tr> 1592 <tr> 1593 <td>currency</td> 1594 <td><code>[maxVariable currency]</code></td> 1595 </tr> 1596 <tr> 1597 <td>vt</td> 1598 <td>See <i>Part 1 Section 3.6.4, <a href= 1599 "tr35.html#Unicode_Locale_Extension_Data_Files">U Extension 1600 Data Files</a></i>.<br> 1601 <i><strong>Deprecated:</strong></i> Use maxVariable 1602 instead.</td> 1603 <td><code>&\u00XX\uYYYY < [variable top]</code><br> 1604 <br> 1605 (the default is set to the highest punctuation, thus 1606 including spaces and punctuation, but not symbols)</td> 1607 <td> 1608 <p>The BCP47 value is described in <i>Appendix Q: 1609 <a href="tr35.html#Locale_Extension_Key_and_Type_Data">Locale 1610 Extension Keys and Types</a>.</i></p> 1611 <p>Sets the string value for the variable top. All the 1612 code points with primary weights less than or equal to 1613 the variable top will be considered variable, and thus 1614 affected by the alternate handling.<br> 1615 An implementation that supports the variableTop setting 1616 should also support the maxVariable setting, and it 1617 should "pin" ("round up") the variableTop to the top of 1618 the containing reordering group.<br> 1619 Variables are ignorable by default in [<a href= 1620 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], 1621 but not in CLDR. See below for more information.</p> 1622 </td> 1623 </tr> 1624 <tr> 1625 <td><em>n/a</em></td> 1626 <td><em>n/a</em></td> 1627 <td><em>n/a</em></td> 1628 <td>match-boundaries: <em><strong>none</strong></em> | 1629 whole-character | whole-word<br> 1630 Defined by <em>Section 8, <a href= 1631 "https://www.unicode.org/reports/tr10/#Searching">Searching 1632 and Matching</a></em> of [<a href= 1633 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td> 1634 </tr> 1635 <tr> 1636 <td><em>n/a</em></td> 1637 <td><em>n/a</em></td> 1638 <td><em>n/a</em></td> 1639 <td>match-style: <em><strong>minimal</strong></em> | medial 1640 | maximal<br> 1641 Defined by <em>Section 8, <a href= 1642 "https://www.unicode.org/reports/tr10/#Searching">Searching 1643 and Matching</a></em> of [<a href= 1644 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td> 1645 </tr> 1646 </table> 1647 <h4>3.4.1 <a name="Common_Settings" href="#Common_Settings" id= 1648 "Common_Settings">Common settings combinations</a></h4> 1649 <p>Some commonly used parametric collation settings are 1650 available via combinations of LDML settings attributes:</p> 1651 <ul> 1652 <li>“Ignore accents”: <strong>strength=primary</strong></li> 1653 <li>“Ignore accents” but take case into account: 1654 <strong>strength=primary caseLevel=on</strong></li> 1655 <li>“Ignore case”: <strong>strength=secondary</strong></li> 1656 <li>“Ignore punctuation” (completely): 1657 <strong>strength=tertiary alternate=shifted</strong></li> 1658 <li>“Ignore punctuation” but distinguish among punctuation 1659 marks: <strong>strength=quaternary 1660 alternate=shifted</strong></li> 1661 </ul> 1662 <h4>3.4.2 <a name="Normalization_Setting" href= 1663 "#Normalization_Setting" id="Normalization_Setting">Notes on 1664 the normalization setting</a></h4> 1665 <p>The UCA always normalizes input strings into NFD form before 1666 the rest of the algorithm. However, this results in poor 1667 performance.</p> 1668 <p>With <strong>normalization=off</strong>, strings that are in 1669 [<a href="tr35.html#FCD">FCD</a>] and do not contain Tibetan 1670 precomposed vowels (U+0F73, U+0F75, U+0F81) should sort 1671 correctly. With <strong>normalization=on</strong>, an 1672 implementation that does not normalize to NFD must at least 1673 perform an incremental FCD check and normalize substrings as 1674 necessary. It should also always decompose the Tibetan 1675 precomposed vowels. (Otherwise discontiguous contractions 1676 across their leading components cannot be handled 1677 correctly.)</p> 1678 <p>Another complication for an implementation that does not 1679 always use NFD arises when contraction mappings overlap with 1680 canonical Decomposition_Mapping strings. For example, the 1681 Danish contraction “aa” overlaps with the decompositions of 1682 ‘ä’, ‘å’, and other characters. In the root collation (and in 1683 the DUCET), Cyrillic ‘ӛ’ maps to a single collation element, 1684 which means that its decomposition “ә+◌̈” forms a contraction, 1685 and its second character (U+0308) is the same as the first 1686 character in the Decomposition_Mapping of U+0344 1687 ‘◌̈́’=“◌̈+◌́”.</p> 1688 <p>In order to handle strings with these characters (e.g., “aä” 1689 and “ӛ́” [which are in FCD]) exactly as with prior NFD 1690 normalization, an implementation needs to either add overlap 1691 contractions to its data (e.g., “a+ä” and “ә+◌̈́”), or it needs 1692 to decompose the relevant composites (e.g., ‘ä’ and ‘◌̈́’) as 1693 soon as they are encountered.</p> 1694 <h4>3.4.3 <a name="Variable_Top_Settings" href= 1695 "#Variable_Top_Settings" id="Variable_Top_Settings">Notes on 1696 variable top settings</a></h4> 1697 <p>Users may want to include more or fewer characters as 1698 Variable. For example, someone could want to restrict the 1699 Variable characters to just include space marks. In that case, 1700 maxVariable would be set to "space". (In CLDR 24 and earlier, 1701 the now-deprecated variableTop would be set to U+1680, see the 1702 “Whitespace” <a href="https://unicode.org/charts/collation/">UCA 1703 collation chart</a>). Alternatively, someone could want more of 1704 the Common characters in them, and include characters up to 1705 (but not including) '0', by setting maxVariable to "currency". 1706 (In CLDR 24 and earlier, the now-deprecated variableTop would 1707 be set to U+20BA, see the “Currency-Symbol” collation 1708 chart).</p> 1709 <p>The effect of these settings is to customize to ignore 1710 different sets of characters when comparing strings. For 1711 example, the locale identifier "de-u-ka-shifted-kv-currency" is 1712 requesting settings appropriate for German, including German 1713 sorting conventions, and that currency symbols and characters 1714 sorting below them are ignored in sorting.</p> 1715 <h3>3.5 <a name="Rules" href="#Rules" id="Rules">Collation Rule 1716 Syntax</a></h3> 1717 <p class="dtd"><!ELEMENT cr #PCDATA ></p> 1718 <p>The goal for the collation rule syntax is to have clearly 1719 expressed rules with a concise format. The CLDR rule syntax is 1720 a subset of the [<a href= 1721 "tr35.html#ICUCollation">ICUCollation</a>] syntax.</p> 1722 <p>For the CLDR root collation, the FractionalUCA.txt file 1723 defines all mappings for all of Unicode directly, and it also 1724 provides information about script boundaries, reordering 1725 groups, and other details. For tailorings, this is neither 1726 necessary nor practical. In particular, while the root 1727 collation sort order rarely changes for existing characters, 1728 their numeric collation weights change with every version. If 1729 tailorings also specified numeric weights directly, then they 1730 would have to change with every version, parallel with the root 1731 collation. Instead, for tailorings, mappings are added and 1732 modified relative to the root collation. (There is no syntax to 1733 <i>remove</i> mappings, except via <a href= 1734 "#Special_Purpose_Commands">special [suppressContractions 1735 [...]]</a> .)</p> 1736 <p>The ASCII [:P:] and [:S:] characters are reserved for 1737 collation syntax: <code>[\u0021-\u002F \u003A-\u0040 1738 \u005B-\u0060 \u007B-\u007E]</code></p> 1739 <p>Unicode Pattern_White_Space characters between tokens are 1740 ignored. Unquoted white space terminates reset and relation 1741 strings.</p> 1742 <p>A pair of ASCII apostrophes encloses quoted literal text. 1743 They are normally used to enclose a syntax character or white 1744 space, or a whole reset/relation string containing one or more 1745 such characters, so that those are parsed as part of the 1746 reset/relation strings rather than treated as syntax. A pair of 1747 immediately adjacent apostrophes is used to encode one 1748 apostrophe.</p> 1749 <p>Code points can be escaped with <code>\uhhhh</code> and 1750 <code>\U00hhhhhh</code> escapes, as well as common escapes like 1751 <code>\t</code> and <code>\n</code> . (For details see the 1752 documentation of ICU UnicodeString::unescape().) This is 1753 particularly useful for default-ignorable code points, 1754 combining marks, visually indistinct variants, hard-to-type 1755 characters, etc. These sequences are unescaped before the rules 1756 are parsed; this means that even escaped syntax and white space 1757 characters need to be enclosed in apostrophes. For example: 1758 <code>&'\u0020'='\u3000'</code>. Note: The unescaping is 1759 done by ICU tools (genrb) and demos before passing rule strings 1760 into the ICU library code. The ICU collation API does not 1761 unescape rule strings.</p> 1762 <p>The ASCII double quote must be both escaped (so that the 1763 collation syntax can be enclosed in pairs of double quotes in 1764 programming environments such as ICU resource bundle .txt 1765 files) and quoted. For example: 1766 <code>&'\u0022'<<<x</code></p> 1767 <p>Comments are allowed at the beginning, and after any 1768 complete reset, relation, setting, or command. A comment begins 1769 with a <code>#</code> and extends to the end of the line 1770 (according to the Unicode Newline Guidelines).</p> 1771 <p>The collation syntax is case-sensitive.</p> 1772 <h3>3.6 <a name="Orderings" href="#Orderings" id= 1773 "Orderings">Orderings</a></h3> 1774 <p>The root collation mappings form the initial state. Mappings 1775 are added and removed via a sequence of rule chains. Each 1776 tailoring rule builds on the current state after all of the 1777 preceding rules (and is not affected by any following rules). 1778 Rule chains may alternate with comments, settings, and special 1779 commands.</p> 1780 <p>A rule chain consists of a reset followed by one or more 1781 relations. The reset position is a string which maps to one or 1782 more collation elements according to the current state. A 1783 relation consists of an operator and a string; it maps the 1784 string to the current collation elements, modified according to 1785 the operator.</p> 1786 <table> 1787 <caption> 1788 <a name="Specifying_Collation_Ordering" href= 1789 "#Specifying_Collation_Ordering" id= 1790 "Specifying_Collation_Ordering">Specifying Collation 1791 Ordering</a> 1792 </caption> 1793 <tr> 1794 <th>Relation Operator</th> 1795 <th> Example</th> 1796 <th>Description</th> 1797 </tr> 1798 <tr> 1799 <td><code>&</code></td> 1800 <td><code>& Z</code></td> 1801 <td>Map Z to collation elements according to the current 1802 state. These will be modified according to the following 1803 relation operators and then assigned to the corresponding 1804 relation strings.</td> 1805 </tr> 1806 <tr> 1807 <td><code><</code></td> 1808 <td><code>& a<br> 1809 < b</code></td> 1810 <td>Make 'b' sort after 'a', as a <i>primary</i> 1811 (base-character) difference</td> 1812 </tr> 1813 <tr> 1814 <td><code><<</code></td> 1815 <td><code>& a<br> 1816 << ä</code></td> 1817 <td>Make 'ä' sort after 'a' as a <i>secondary</i> (accent) 1818 difference</td> 1819 </tr> 1820 <tr> 1821 <td><code><<<</code></td> 1822 <td><code>& a<br> 1823 <<< A</code></td> 1824 <td>Make 'A' sort after 'a' as a <i>tertiary</i> 1825 (case/variant) difference</td> 1826 </tr> 1827 <tr> 1828 <td><code><<<<</code></td> 1829 <td><code>& か<br> 1830 <<<< カ</code></td> 1831 <td>Make 'カ' (Katakana Ka) sort after 'か' (Hiragana Ka) as 1832 a <i>quaternary</i> difference</td> 1833 </tr> 1834 <tr> 1835 <td><code>= </code></td> 1836 <td><code>& v<br> 1837 = w </code></td> 1838 <td>Make 'w' sort <i>identically</i> to 'v'</td> 1839 </tr> 1840 </table> 1841 <p>The following shows the result of serially applying three 1842 rules.</p> 1843 <table> 1844 <tr> 1845 <th> </th> 1846 <th>Rules</th> 1847 <th>Result</th> 1848 <th>Comment</th> 1849 </tr> 1850 <tr> 1851 <td>1</td> 1852 <td>& a < g</td> 1853 <td>... a <font color="red"><<sub>1</sub> g</font> 1854 ...</td> 1855 <td>Put g after a.</td> 1856 </tr> 1857 <tr> 1858 <td>2</td> 1859 <td>& a < h < k</td> 1860 <td>... a <font color="red"><<sub>1</sub> h 1861 <<sub>1</sub> k</font> <<sub>1</sub> g ...</td> 1862 <td>Now put h and k after a (inserting before the g).</td> 1863 </tr> 1864 <tr> 1865 <td>3</td> 1866 <td>& h << g</td> 1867 <td>... a <<sub>1</sub> h <font color= 1868 "red"><<sub>1</sub> g</font> <<sub>1</sub> k ...</td> 1869 <td>Now put g after h (inserting before k).</td> 1870 </tr> 1871 </table> 1872 <p>Notice that relation strings can occur multiple times, and 1873 thus override previous rules.</p> 1874 <p>Each relation uses and modifies the collation elements of 1875 the immediately preceding reset position or relation. A rule 1876 chain with two or more relations is equivalent to a sequence of 1877 “atomic rules” where each rule chain has exactly one relation, 1878 and each relation is followed by a reset to this same relation 1879 string.</p> 1880 <p><i>Example:</i></p> 1881 <table> 1882 <tr> 1883 <th>Rules</th> 1884 <th>Equivalent Atomic Rules</th> 1885 </tr> 1886 <tr> 1887 <td>& b < q <<< Q<br> 1888 & a < x <<< X << q <<< Q 1889 < z</td> 1890 <td>& b < q<br> 1891 & q <<< Q<br> 1892 & a < x<br> 1893 & x <<< X<br> 1894 & X << q<br> 1895 & q <<< Q<br> 1896 & Q < z</td> 1897 </tr> 1898 </table> 1899 <p>This is not always possible because prefix and extension 1900 strings can occur in a relation but not in a reset (see 1901 below).</p> 1902 <p>The relation operator <code>=</code> maps its relation 1903 string to the current collation elements. Any other relation 1904 operator modifies the current collation elements as 1905 follows.</p> 1906 <ul> 1907 <li>Find the <i>last</i> collation element whose strength is 1908 at least as great as the strength of the operator. For 1909 example, for <code><<</code> find the last primary or 1910 secondary CE. This CE will be modified; all following CEs 1911 should be removed. If there is no such CE, then reset the 1912 collation elements to a single completely-ignorable CE.</li> 1913 <li>Increment the collation element weight corresponding to 1914 the strength of the operator. For example, for 1915 <code><<</code> increment the secondary weight.</li> 1916 <li>The new weight must be less than the next weight for the 1917 same combination of higher-level weights of any collation 1918 element according to the current state.</li> 1919 <li>Weights must be allocated in accordance with the <a href= 1920 "https://www.unicode.org/reports/tr10/#Well-Formed">UCA 1921 well-formedness conditions</a>.</li> 1922 <li>When incrementing any weight, lower-level weights should 1923 be reset to the “common” values, to help with sort key 1924 compression.</li> 1925 </ul> 1926 <p>In all cases, even for <code>=</code> , the case bits are 1927 recomputed according to <i>Section 3.13, <a href= 1928 "#Case_Parameters">Case Parameters</a></i>. (This can be 1929 skipped if an implementation does not support the caseLevel or 1930 caseFirst settings.)</p> 1931 <p>For example, <code>&ae<x</code> maps ‘x’ to two 1932 collation elements. The first one is the same as for ‘a’, and 1933 the second one has a primary weight between those for ‘e’ and 1934 ‘f’. As a result, ‘x’ sorts between “ae” and “af”. (If the 1935 primary of the first collation element was incremented instead, 1936 then ‘x’ would sort after “az”. While also sorting 1937 primary-after “ae” this would be surprising and 1938 sub-optimal.)</p> 1939 <p>Some additional operators are provided to save space with 1940 large tailorings. The addition of a * to the relation operator 1941 indicates that each of the following single characters are to 1942 be handled as if they were separate relations with the 1943 corresponding strength. Each of the following single characters 1944 must be NFD-inert, that is, it does not have a canonical 1945 decomposition and it does not reorder (ccc=0). This keeps 1946 abbreviated rules unambiguous.</p> 1947 <p>A starred relation operator is followed by a sequence of 1948 characters with the same quoting/escaping rules as normal 1949 relation strings. Such a sequence can also be followed by one 1950 or more pairs of ‘-’ and another sequence of characters. The 1951 single characters adjacent to the ‘-’ establish a code point 1952 order range. The same character cannot be both the end of a 1953 range and the start of another range. (For example, 1954 <code><a-d-g</code> is not allowed.)</p> 1955 <table> 1956 <caption> 1957 <a name="Abbreviating_Ordering_Specifications" href= 1958 "#Abbreviating_Ordering_Specifications" id= 1959 "Abbreviating_Ordering_Specifications">Abbreviating 1960 Ordering Specifications</a> 1961 </caption> 1962 <tr> 1963 <th>Relation Operator</th> 1964 <th>Example</th> 1965 <th>Equivalent</th> 1966 </tr> 1967 <tr> 1968 <td><code><*</code></td> 1969 <td><code>& <span style="color: blue">a</span><br> 1970 <* <span style= 1971 "color: blue">bcd-gp-s</span> </code></td> 1972 <td><code>& <span style="color: blue">a</span><br> 1973 < <span style="color: blue">b</span> < <span style= 1974 "color: blue">c</span> < <span style= 1975 "color: blue">d</span> < <span style= 1976 "color: blue">e</span> < <span style= 1977 "color: blue">f</span> < <span style= 1978 "color: blue">g</span> < <span style= 1979 "color: blue">p</span> < <span style= 1980 "color: blue">q</span> < <span style= 1981 "color: blue">r</span> < <span style= 1982 "color: blue">s</span></code></td> 1983 </tr> 1984 <tr> 1985 <td><code><<*</code></td> 1986 <td><code>& <span style="color: blue">a</span><br> 1987 <<* <span style="color: blue">æᶏɐ</span></code></td> 1988 <td><code>& <span style="color: blue">a</span><br> 1989 << <span style="color: blue">æ</span> << 1990 <span style="color: blue">ᶏ</span> << <span style= 1991 "color: blue">ɐ</span></code></td> 1992 </tr> 1993 <tr> 1994 <td><code><<<*</code></td> 1995 <td><code>& <span style="color: blue">p</span><br> 1996 <<<* <span style= 1997 "color: blue">PpP</span></code></td> 1998 <td><code>& <span style="color: blue">p</span><br> 1999 <<< <span style="color: blue">P</span> 2000 <<< <span style="color: blue">p</span> 2001 <<< <span style="color: blue">P</span></code></td> 2002 </tr> 2003 <tr> 2004 <td><code><<<<*</code></td> 2005 <td><code>& <span style="color: blue">k</span><br> 2006 <<<<* <span style= 2007 "color: blue">qQ</span></code></td> 2008 <td><code>& <span style="color: blue">k</span><br> 2009 <<<< <span style="color: blue">q</span> 2010 <<<< <span style= 2011 "color: blue">Q</span></code></td> 2012 </tr> 2013 <tr> 2014 <td><code>=*</code></td> 2015 <td><code>& <span style="color: blue">v</span><br> 2016 =* <span style="color: blue">VwW</span></code></td> 2017 <td><code>& <span style="color: blue">v</span><br> 2018 = <span style="color: blue">V</span> = <span style= 2019 "color: blue">w</span> = <span style= 2020 "color: blue">W</span></code></td> 2021 </tr> 2022 </table> 2023 <h3>3.7 <a name="Contractions" href="#Contractions" id= 2024 "Contractions">Contractions</a></h3> 2025 <p>A multi-character relation string defines a contraction.</p> 2026 <table> 2027 <caption> 2028 <a name="Specifying_Contractions" href= 2029 "#Specifying_Contractions" id= 2030 "Specifying_Contractions">Specifying Contractions</a> 2031 </caption> 2032 <tr> 2033 <th>Example</th> 2034 <th>Description</th> 2035 </tr> 2036 <tr> 2037 <td><code>& k<br> 2038 < ch</code></td> 2039 <td>Make the sequence 'ch' sort after 'k', as a primary 2040 (base-character) difference</td> 2041 </tr> 2042 </table> 2043 <h3>3.8 <a name="Expansions" href="#Expansions" id= 2044 "Expansions">Expansions</a></h3> 2045 <p>A mapping to multiple collation elements defines an 2046 expansion. This is normally the result of a reset position 2047 (and/or preceding relation) that yields multiple collation 2048 elements, for example <code>&ae<x</code> or 2049 <code>&æ<y</code> .</p> 2050 <p>A relation string can also be followed by <code>/</code> and 2051 an <i>extension string</i>. The extension string is mapped to 2052 collation elements according to the current state, and the 2053 relation string is mapped to the concatenation of the regular 2054 CEs and the extension CEs. The extension CEs are not modified, 2055 not even their case bits. The extension CEs are <i>not</i> 2056 retained for following relations.</p> 2057 <p>For example, <code>&a<z/e</code> maps ‘z’ to an 2058 expansion similar to <code>&ae<x</code> . However, the 2059 first CE of ‘z’ is primary-after that of ‘a’, and the second CE 2060 is exactly that of ‘e’, which yields the order ae < x < 2061 af < ag < ... < az < z < b.</p> 2062 <p>The choice of reset-to-expansion vs. use of an extension 2063 string can be exploited to affect contextual mappings. For 2064 example, <code>&L·=x</code> yields a second CE for ‘x’ 2065 equal to the context-sensitive middle-dot-after-L (which is a 2066 secondary CE in the root collation). On the other hand, 2067 <code>&L=x/·</code> yields a second CE of the middle dot by 2068 itself (which is a primary CE).</p> 2069 <p>The two ways of specifying expansions also differ in how 2070 case bits are computed. When some of the CEs are copied 2071 verbatim from an extension string, then the relation string’s 2072 case bits are distributed over a smaller number of normal CEs. 2073 For example, <code>&aE=Ch</code> yields an uppercase CE and 2074 a lowercase CE, but <code>&a=Ch/E</code> yields a 2075 mixed-case CE (for ‘C’ and ‘h’ together) followed by an 2076 uppercase CE (copied from ‘E’).</p> 2077 <p>In summary, there are two ways of specifying expansions 2078 which produce subtly different mappings. The use of extension 2079 strings is unusual but sometimes necessary.</p> 2080 <h3>3.9 <a name="Context_Before" href="#Context_Before" id= 2081 "Context_Before">Context Before</a></h3> 2082 <p>A relation string can have a prefix (context before) which 2083 makes the mapping from the relation string to its tailored 2084 position conditional on the string occurring after that prefix. 2085 For details see the specification of <i><a href= 2086 "#Context_Sensitive_Mappings">Context-Sensitive 2087 Mappings</a></i>.</p> 2088 <p>For example, suppose that "-" is sorted like the previous 2089 vowel. Then one could have rules that take "a-", "e-", and so 2090 on. However, that means that every time a very common character 2091 (a, e, ...) is encountered, a system will slow down as it looks 2092 for possible contractions. An alternative is to indicate that 2093 when "-" is encountered, and it comes after an 'a', it sorts 2094 like an 'a', and so on.</p> 2095 <table> 2096 <caption> 2097 <a name="Specifying_Previous_Context" href= 2098 "#Specifying_Previous_Context" id= 2099 "Specifying_Previous_Context">Specifying Previous 2100 Context</a> 2101 </caption> 2102 <tr> 2103 <th>Rules</th> 2104 </tr> 2105 <tr> 2106 <td><code>& a <<< a | '-'<br> 2107 & e <<< e | '-'<br> 2108 ...</code></td> 2109 </tr> 2110 </table> 2111 <p>Both the prefix and extension strings can occur in a 2112 relation. For example, the following are allowed:</p> 2113 <ul> 2114 <li><code>< abc | def / ghi</code></li> 2115 <li><code>< def / ghi</code></li> 2116 <li><code>< abc | def</code></li> 2117 </ul> 2118 <h3>3.10 <a name="Placing_Characters_Before_Others" href= 2119 "#Placing_Characters_Before_Others" id= 2120 "Placing_Characters_Before_Others">Placing Characters Before 2121 Others</a></h3> 2122 <p>There are certain circumstances where characters need to be 2123 placed before a given character, rather than after. This is the 2124 case with Pinyin, for example, where certain accented letters 2125 are positioned before the base letter. That is accomplished 2126 with the following syntax.</p> 2127 <pre>&[before 2] a << à</pre> 2128 <p>The before-strength can be 1 (primary), 2 (secondary), or 3 2129 (tertiary).</p> 2130 <p>It is an error if the strength of the reset-before differs 2131 from the strength of the immediately following relation. Thus 2132 the following are errors.</p> 2133 <ul> 2134 <li><code>&[before 2] a < à # error</code></li> 2135 <li><code>&[before 2] a <<< à # 2136 error</code></li> 2137 </ul> 2138 <h3>3.11 <a name="Logical_Reset_Positions" href= 2139 "#Logical_Reset_Positions" id="Logical_Reset_Positions">Logical 2140 Reset Positions</a></h3> 2141 <p>The CLDR table (based on UCA) has the following overall 2142 structure for weights, going from low to high.</p> 2143 <table> 2144 <caption> 2145 <a name="Specifying_Logical_Positions" href= 2146 "#Specifying_Logical_Positions" id= 2147 "Specifying_Logical_Positions">Specifying Logical 2148 Positions</a> 2149 </caption> 2150 <tr> 2151 <th>Name</th> 2152 <th>Description</th> 2153 <th>UCA Examples</th> 2154 </tr> 2155 <tr> 2156 <td>first tertiary ignorable<br> 2157 ...<br> 2158 last tertiary ignorable</td> 2159 <td>p, s, t = ignore</td> 2160 <td>Control Codes<br> 2161 Format Characters<br> 2162 Hebrew Points<br> 2163 Tibetan Signs<br> 2164 ...</td> 2165 </tr> 2166 <tr> 2167 <td>first secondary ignorable<br> 2168 ...<br> 2169 last secondary ignorable</td> 2170 <td>p, s = ignore</td> 2171 <td>None in UCA</td> 2172 </tr> 2173 <tr> 2174 <td>first primary ignorable<br> 2175 ...<br> 2176 last primary ignorable</td> 2177 <td>p = ignore</td> 2178 <td>Most combining marks</td> 2179 </tr> 2180 <tr> 2181 <td>first variable<br> 2182 ...<br> 2183 last variable</td> 2184 <td><i><b>if</b> alternate = non-ignorable<br></i> p != 2185 ignore,<br> 2186 <i><b>if</b> alternate = shifted</i><br> 2187 p, s, t = ignore</td> 2188 <td>Whitespace,<br> 2189 Punctuation</td> 2190 </tr> 2191 <tr> 2192 <td>first regular<br> 2193 ...<br> 2194 last regular</td> 2195 <td>p != ignore</td> 2196 <td>General Symbols<br> 2197 Currency Symbols<br> 2198 Numbers<br> 2199 Latin<br> 2200 Greek<br> 2201 ...</td> 2202 </tr> 2203 <tr> 2204 <td>first implicit<br> 2205 ...<br> 2206 last implicit</td> 2207 <td>p != ignore, assigned automatically</td> 2208 <td>CJK, CJK compatibility (those that are not 2209 decomposed)<br> 2210 CJK Extension A, B, C, ...<br> 2211 Unassigned</td> 2212 </tr> 2213 <tr> 2214 <td>first trailing<br> 2215 ...<br> 2216 last trailing</td> 2217 <td>p != ignore,<br> 2218 used for trailing syllable components</td> 2219 <td>Jamo Trailing<br> 2220 Jamo Leading<br> 2221 U+FFFD<br> 2222 U+FFFF</td> 2223 </tr> 2224 </table> 2225 <p>Each of the above Names can be used with a reset to position 2226 characters relative to that logical position. That allows 2227 characters to be ordered before or after a <i>logical</i> 2228 position rather than a specific character.</p> 2229 <blockquote> 2230 <p class="note"><b>Note:</b> The reason for this is so that 2231 tailorings can be more stable. A future version of the UCA 2232 might add characters at any point in the above list. Suppose 2233 that you set character X to be after Y. It could be that you 2234 want X to come after Y, no matter what future characters are 2235 added; or it could be that you just want Y to come after a 2236 given logical position, for example, after the last primary 2237 ignorable.</p> 2238 </blockquote> 2239 <p>Each of these special reset positions always maps to a 2240 single collation element.</p> 2241 <p>Here is an example of the syntax:</p> 2242 <pre>& [first tertiary ignorable] << à </pre> 2243 <p>For example, to make a character be a secondary ignorable, 2244 one can make it be immediately after (at a secondary level) a 2245 specific character (like a combining diaeresis), or one can 2246 make it be immediately after the last secondary ignorable.</p> 2247 <p>Each special reset position adjusts to the effects of 2248 preceding rules, just like normal reset position strings. For 2249 example, if a tailoring rule creates a new collation element 2250 after <code>&[last variable]</code> (via explicit tailoring 2251 after that, or via tailoring after the relevant character), 2252 then this new CE becomes the new <i>last variable</i> CE, and 2253 is used in following resets to <code>[last variable]</code> 2254 .</p> 2255 <p>[first variable] and [first regular] and [first trailing] 2256 should be the first real such CEs (e.g., CE(U+0060 `)), as 2257 adjusted according to the tailoring, not the boundary CEs (see 2258 the FractionalUCA.txt “first primary” mappings starting with 2259 U+FDD1).</p> 2260 <p><code>[last regular]</code> is not actually the last normal 2261 CE with a primary weight before implicit primaries. It is used 2262 to tailor large numbers of characters, usually CJK, into the 2263 script=Hani range between the last regular script and the first 2264 implicit CE. (The first group of implicit CEs is for Han 2265 characters.) Therefore, <code>[last regular]</code> is set to 2266 the first Hani CE, the artificial script boundary CE at the 2267 beginning of this range. For example: <code>&[last 2268 regular]<*亜唖娃阿...</code></p> 2269 <p>The [last trailing] is the CE of U+FFFF. Tailoring to that 2270 is not allowed.</p> 2271 <p>The <code>[last variable]</code> indicates the "highest" 2272 character that is treated as punctuation with alternate 2273 handling.</p> 2274 <p>The value can be changed by using the maxVariable setting. 2275 This takes effect, however, after the rules have been built, 2276 and does not affect any characters that are reset relative to 2277 the <code>[last variable]</code> value when the rules are being 2278 built. The maxVariable setting might also be changed via a 2279 runtime parameter. That also does not affect the rules.<br> 2280 (In CLDR 24 and earlier, the variable top could also be set by 2281 using a tailoring rule with <code>[variable top]</code> in the 2282 place of a relation string.)</p> 2283 <h3>3.12 <a name="Special_Purpose_Commands" href= 2284 "#Special_Purpose_Commands" id= 2285 "Special_Purpose_Commands">Special-Purpose Commands</a></h3> 2286 <p>The import command imports rules from another collation. 2287 This allows for better maintenance and smaller rule sizes. The 2288 source is a BCP 47 language tag with an optional collation type 2289 but without other extensions. The collation type is the BCP 47 2290 form of the collation type in the source; it defaults to 2291 "standard".</p> 2292 <p><em>Examples:</em></p> 2293 <ul> 2294 <li><code>[import de-u-co-phonebk]</code> (not 2295 "...-co-phonebook")</li> 2296 <li><code>[import und-u-co-search]</code> (not 2297 "root-...")</li> 2298 <li><code>[import ja-u-co-private-kana]</code> 2299 (language "ja" required even when this import itself is in 2300 another "ja" tailoring.)</li> 2301 </ul> 2302 <table> 2303 <caption> 2304 <a name="Special_Purpose_Elements" href= 2305 "#Special_Purpose_Elements" id= 2306 "Special_Purpose_Elements">Special-Purpose Elements</a> 2307 </caption> 2308 <tr> 2309 <th>Rule Syntax</th> 2310 </tr> 2311 <tr> 2312 <td>[suppressContractions [Љ-ґ]]</td> 2313 </tr> 2314 <tr> 2315 <td>[optimize [Ά-ώ]]</td> 2316 </tr> 2317 </table> 2318 <p>The <i>suppress contractions</i> tailoring command turns off 2319 any existing contractions that begin with those characters, as 2320 well as any prefixes for those characters. It is typically used 2321 to turn off the Cyrillic contractions in the UCA, since they 2322 are not used in many languages and have a considerable 2323 performance penalty. The argument is a <a href= 2324 "tr35.html#Unicode_Sets">Unicode Set</a>.</p> 2325 <p>The <i>suppress contractions</i> command has immediate 2326 effect on the current set of mappings, including mappings added 2327 by preceding rules. Following rules are processed after 2328 removing any context-sensitive mappings originating from any of 2329 the characters in the set.</p> 2330 <p>The <i>optimize</i> tailoring command is purely for 2331 performance. It indicates that those characters are 2332 sufficiently common in the target language for the tailoring 2333 that their performance should be enhanced.</p> 2334 <p>The reason that these are not settings is so that their 2335 contents can be arbitrary characters.</p> 2336 <hr width="50%"> 2337 <p><i>Example:</i></p> 2338 <p>The following is a simple example that combines portions of 2339 different tailorings for illustration. For more complete 2340 examples, see the actual locale data: <a href= 2341 "https://github.com/unicode-org/cldr/tree/latest/common/collation/ja.xml"> 2342 Japanese</a>, <a href= 2343 "https://github.com/unicode-org/cldr/tree/latest/common/collation/zh.xml"> 2344 Chinese</a>, <a href= 2345 "https://github.com/unicode-org/cldr/tree/latest/common/collation/sv.xml"> 2346 Swedish</a>, and <a href= 2347 "https://github.com/unicode-org/cldr/tree/latest/common/collation/de.xml"> 2348 German</a> (type="phonebook") are particularly 2349 illustrative.</p> 2350 <pre><collation> 2351 <cr><![CDATA[ 2352 [caseLevel on] 2353 &Z 2354 < æ <<< Æ 2355 < å <<< Å <<< aa <<< aA <<< Aa <<< AA 2356 < ä <<< Ä 2357 < ö <<< Ö << ű <<< Ű 2358 < ő <<< Ő << ø <<< Ø 2359 &V <<<* wW 2360 &Y <<<* üÜ 2361 &[last non-ignorable] 2362 <span style= 2363"color: green"># The following is equivalent to <亜<唖<娃...</span> 2364 <* 亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦 2365 <* 鯵梓圧斡扱 2366 ]]></cr> 2367</collation></pre> 2368 <h3>3.13 <a name="Script_Reordering" href="#Script_Reordering" 2369 id="Script_Reordering">Collation Reordering</a></h3> 2370 <p>Collation reordering allows scripts and certain other 2371 defined blocks of characters to be moved relative to each other 2372 parametrically, without changing the detailed rules for all the 2373 characters involved. This reordering is done on top of any 2374 specific ordering rules within the script or block currently in 2375 effect. Reordering can specify groups to be placed at the start 2376 and/or the end of the collation order. For example, to reorder 2377 Greek characters before Latin characters, and digits afterwards 2378 (but before other scripts), the following can be used:</p> 2379 <table> 2380 <tr> 2381 <th>Rule Syntax</th> 2382 <th>Locale Identifier</th> 2383 </tr> 2384 <tr> 2385 <td><code>[reorder Grek Latn digit]</code></td> 2386 <td><code>en-u-kr-grek-latn-digit</code></td> 2387 </tr> 2388 </table> 2389 <p>In each case, a sequence of 2390 <em><strong>reorder_codes</strong></em> is used, separated by 2391 spaces in the settings attribute and in rule syntax, and by 2392 hyphens in locale identifiers.</p> 2393 <p>A <strong><em>reorder_code</em></strong> is any of the 2394 following special codes:</p> 2395 <ol> 2396 <li><strong>space, punct, symbol, currency, digit</strong> - 2397 core groups of characters below 'a'</li> 2398 <li> 2399 <strong>any script code</strong> except 2400 <strong>Common</strong> and <strong>Inherited</strong>. 2401 <ul> 2402 <li>Some pairs of scripts sort primary-equal and always 2403 reorder together. For example, Katakana characters are 2404 are always reordered with Hiragana.</li> 2405 </ul> 2406 </li> 2407 <li><strong>others</strong> - where all codes not explicitly 2408 mentioned should be ordered. The script code 2409 <strong>Zzzz</strong> (Unknown Script) is a synonym for 2410 <strong>others</strong>.</li> 2411 </ol> 2412 <p>It is an error if a code occurs multiple times.</p> 2413 <p>It is an error if the sequence of reorder codes is empty in 2414 the XML attribute or in the locale identifier. Some 2415 implementations may interpret an empty sequence in the 2416 <code>[reorder]</code> rule syntax as a reset to the DUCET 2417 ordering, synonymous with <code>[reorder others]</code> ; other 2418 implementations may forbid an empty sequence in the rule syntax 2419 as well.</p> 2420 <p>Interaction with <strong>alternate=shifted</strong>: Whether 2421 a primary weight is “variable” is determined according to the 2422 “variable top”, before applying script reordering. Once that is 2423 determined, script reordering is applied to the primary weight 2424 regardless of whether it is “regular” (used in the primary 2425 level) or “shifted” (used in the quaternary level).</p> 2426 <h4>3.13.1 <a name="Interpretation_reordering" href= 2427 "#Interpretation_reordering" id= 2428 "Interpretation_reordering">Interpretation of a reordering 2429 list</a></h4> 2430 <p>The reordering list is interpreted as if it were processed 2431 in the following way.</p> 2432 <ol> 2433 <li>If any core code is not present, then it is inserted at 2434 the front of the list in the order given above.</li> 2435 <li>If the <strong>others</strong> code is not present, then 2436 it is inserted at the end of the list.</li> 2437 <li>The <strong>others</strong> code is replaced by the list 2438 of all script codes not explicitly mentioned, in DUCET 2439 order.</li> 2440 <li>The reordering list is now complete, and used to reorder 2441 characters in collation accordingly.</li> 2442 </ol> 2443 <p>The locale data may have a particular ordering. For example, 2444 the Czech locale data could put digits after all letters, with 2445 <code>[reorder others digit]</code> . Any reordering codes 2446 specified on top of that (such as with a bcp47 locale 2447 identifier) completely replace what was there. To specify a 2448 version of collation that completely resets any existing 2449 reordering to the DUCET ordering, the single code 2450 <strong>Zzzz</strong> or <strong>others</strong> can be used, 2451 as below.</p> 2452 <p><em>Examples:</em></p> 2453 <table cellpadding="0" cellspacing="0"> 2454 <tbody> 2455 <tr> 2456 <th>Locale Identifier</th> 2457 <th>Effect</th> 2458 </tr> 2459 <tr> 2460 <td><code>en-u-kr-latn-digit</code></td> 2461 <td>Reorder digits after Latin characters (but before 2462 other scripts like Cyrillic).</td> 2463 </tr> 2464 <tr> 2465 <td><code>en-u-kr-others-digit</code></td> 2466 <td>Reorder digits after all other characters.</td> 2467 </tr> 2468 <tr> 2469 <td><code>en-u-kr-arab-cyrl-others-symbol</code></td> 2470 <td>Reorder Arabic characters first, then Cyrillic, and 2471 put symbols at the end—after all other characters.</td> 2472 </tr> 2473 <tr> 2474 <td><code>en-u-kr-others</code></td> 2475 <td>Remove any locale-specific reordering, and use DUCET 2476 order for reordering blocks.</td> 2477 </tr> 2478 </tbody> 2479 </table> 2480 <p>The default reordering groups are defined by the 2481 FractionalUCA.txt file, based on the primary weights of 2482 associated collation elements. The file contains special 2483 mappings for the start of each group, script, and 2484 reorder-reserved range, see <i>Section 2.6.2, <a href= 2485 "#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></i>.</p> 2486 <p>There are some special cases:</p> 2487 <ul> 2488 <li>The <strong>Hani</strong> group includes implicit weights 2489 for <em>Han characters</em> according to the UCA as well as 2490 any characters tailored relative to a Han character, or after 2491 <code>&[first Hani]</code>.</li> 2492 <li>Implicit weights for <em>unassigned code points</em> 2493 according to the UCA reorder as the last weights in the 2494 <strong>others</strong> (<strong>Zzzz</strong>) group.<br> 2495 There is no script code to explicitly reorder the 2496 unassigned-implicit weights into a particular position. 2497 (Unassigned-implicit weights are used for non-Hani code 2498 points without any mappings. For a given Unicode version they 2499 are the code points with General_Category values Cn, Co, 2500 Cs.)</li> 2501 <li>The TRAILING group, the FIELD-SEPARATOR (associated with 2502 U+FFFE), and collation elements with only zero primary 2503 weights are not reordered.</li> 2504 <li>The TERMINATOR, LEVEL-SEPARATOR, and SPECIAL groups are 2505 never associated with characters.</li> 2506 </ul> 2507 <p>For example, <code>reorder="Hani Zzzz Grek"</code> sorts 2508 Hani, Latin, Cyrillic, ... (all other scripts) ..., unassigned, 2509 Greek, TRAILING.</p> 2510 <p>Notes for implementations that write sort keys:</p> 2511 <ul> 2512 <li>Primaries must always be offset by one or more whole 2513 primary lead bytes. (Otherwise the number of bytes in a 2514 fractional weight may change, compressible scripts may span 2515 multiple lead bytes, or trailing primary bytes may collide 2516 with separators and primary-compression terminators.)</li> 2517 <li>When a script is reordered that does not start and end on 2518 whole-primary-lead-byte boundaries, then the lead byte needs 2519 to be “split”, and a reserved byte is used up. The data 2520 supports this via reorder-reserved ranges of primary weights 2521 that are not used for collation elements.</li> 2522 <li>Primary weights from different original lead bytes can be 2523 reordered to a shared lead byte, as long as they do not 2524 overlap. Primary compression ends when the target lead byte 2525 differs or when the original lead byte of the next primary is 2526 not compressible.</li> 2527 <li>Non-compressible groups and scripts begin or end on 2528 whole-primary-lead-byte boundaries (or both), so that 2529 reordering cannot surround a non-compressible script by two 2530 compressible ones within the same target lead byte. This is 2531 so that primary compression can be terminated reliably 2532 (choosing the low or high terminator byte) simply by 2533 comparing the previous and current primary weights. Otherwise 2534 it would have to also check for another condition (e.g., 2535 equal scripts).</li> 2536 </ul> 2537 <h4>3.13.2 <a name="Reordering_Groups_allkeys" href= 2538 "#Reordering_Groups_allkeys" id= 2539 "Reordering_Groups_allkeys">Reordering Groups for 2540 allkeys.txt</a></h4> 2541 <p>For allkeys_CLDR.txt, the start of each reordering group can 2542 be determined from FractionalUCA.txt, by finding the first real 2543 mapping (after “xyz first primary”) of that group (e.g., 2544 <code>0060; [0D 07, 05, 05] # Zyyy Sk [0312.0020.0002] * GRAVE 2545 ACCENT</code> ), and looking for that mapping's character 2546 sequence ( <code>0060</code> ) in allkeys_CLDR.txt. The comment 2547 in FractionalUCA.txt ( <code>[0312.0020.0002]</code> ) also 2548 shows the allkeys_CLDR.txt collation elements.</p> 2549 <p>The DUCET ordering of some characters is slightly different 2550 from the CLDR root collation order. The reordering groups for 2551 the DUCET are not specified. The following describes how 2552 reordering groups for the DUCET can be derived.</p> 2553 <p>For allkeys_DUCET.txt, the start of each reordering group is 2554 normally the primary weight corresponding to the same character 2555 sequence as for allkeys_CLDR.txt. In a few cases this requires 2556 adjustment, especially for the special reordering groups, due 2557 to CLDR’s ordering the common characters more strictly by 2558 category than the DUCET (as described in <i>Section 2, <a href= 2559 "#Root_Collation">Root Collation</a></i>). The necessary 2560 adjustment would set the start of each allkeys_DUCET.txt 2561 reordering group to the primary weight of the first mapping for 2562 the relevant General_Category for a special reordering group 2563 (for characters that sort before ‘a’), or the primary weight of 2564 the first mapping for the first script (e.g., sc=Grek) of an 2565 “alphabetic” group (for characters that sort at or after 2566 ‘a’).</p> 2567 <p>Note that the following only applies to primary weights 2568 greater than the one for U+FFFE and less than "trailing" 2569 weights.</p> 2570 <p>The special reordering groups correspond to General_Category 2571 values as follows:</p> 2572 <ul> 2573 <li>punct: P</li> 2574 <li>symbol: Sk, Sm, So</li> 2575 <li>space: Z, Cc</li> 2576 <li>currency: Sc</li> 2577 <li>digit: Nd</li> 2578 </ul> 2579 <p>In the DUCET, some characters that sort below ‘a’ and have 2580 other General_Category values not mentioned above (e.g., gc=Lm) 2581 are also grouped with symbols. Variants of numbers (gc=No or 2582 Nl) can be found among punctuation, symbols, and digits.</p> 2583 <p>Each collation element of an expansion may be in a different 2584 reordering group, for example for parenthesized characters.</p> 2585 <h3>3.14 <a name="Case_Parameters" href="#Case_Parameters" id= 2586 "Case_Parameters">Case Parameters</a></h3> 2587 <p>The <strong>case level</strong> is an <em>optional</em> 2588 intermediate level ("2.5") between Level 2 and Level 3 (or 2589 after Level 1, if there is no Level 2 due to strength 2590 settings). The case level is used to support two parametric 2591 features: ignoring non-case variants (Level 3 differences) 2592 except for case, and giving case differences a higher-level 2593 priority than other tertiary differences. Distinctions between 2594 small and large Kana characters are also included as case 2595 differences, to support Japanese collation.</p> 2596 <p>The <strong>case first</strong> parameter controls whether 2597 to swap the order of upper and lowercase. It can be used with 2598 or without the case level.</p> 2599 <p>Importantly, the case parameters have no effect in many 2600 instances. For example, they have no effect on the comparison 2601 of two non-ignorable characters with different primary weights, 2602 or with different secondary weights if the strength = 2603 <strong>secondary (or higher).</strong></p> 2604 <p>When either the <strong>case level</strong> or <strong>case 2605 first</strong> parameters are set, the following describes the 2606 derivation of the modified collation elements. It assumes the 2607 original levels for the code point are [p.s.t] (primary, 2608 secondary, tertiary). This derivation may change in future 2609 versions of LDML, to track the case characteristics more 2610 closely.</p> 2611 <h4>3.14.1 <a name="Case_Untailored" href="#Case_Untailored" 2612 id="Case_Untailored">Untailored Characters</a></h4> 2613 <p>For untailored characters and strings, that is, for mappings 2614 in the root collation, the case value for each collation 2615 element is computed from the tertiary weight listed in 2616 allkeys_CLDR.txt. This is used to modify the collation 2617 element.</p> 2618 <p>Look up a case value for the tertiary weight x of each 2619 collation element:</p> 2620 <ol> 2621 <li>UPPER if x ∈ {08-0C, 0E, 11, 12, 1D}</li> 2622 <li>UNCASED otherwise</li> 2623 <li>FractionalUCA.txt encodes the case information in bits 6 2624 and 7 of the first byte in each tertiary weight. The case 2625 bits are set to 00 for UNCASED and LOWERCASE, and 10 for 2626 UPPER. There is no MIXED case value (01) in the root 2627 collation.</li> 2628 </ol> 2629 <h4>3.14.2 <a name="Case_Weights" href="#Case_Weights" id= 2630 "Case_Weights">Compute Modified Collation Elements</a></h4> 2631 <p>From a computed case value, set a weight <strong>c</strong> 2632 according to the following.</p> 2633 <ol> 2634 <li>If <strong>CaseFirst=UpperFirst</strong>, set 2635 <strong>c</strong> = UPPER ? <strong>1</strong> : MIXED ? 2 : 2636 <strong>3</strong></li> 2637 <li>Otherwise set <strong>c</strong> = UPPER ? 2638 <strong>3</strong> : MIXED ? 2 : <strong>1</strong></li> 2639 </ol> 2640 <p>Compute a new collation element according to the following 2641 table. The notation <em>xt</em> means that the values are 2642 numerically combined into a single level, such that xt < yu 2643 whenever x < y. The fourth level (if it exists) is 2644 unaffected. Note that a secondary CE must have a secondary 2645 weight S which is greater than the secondary weight s of any 2646 primary CE; and a tertiary CE must have a tertiary weight T 2647 which is greater than the tertiary weight t of any primary or 2648 secondary CE ([<a href= 2649 "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a href= 2650 "https://www.unicode.org/reports/tr10/#WF2">WF2</a>).</p> 2651 <div align="center"> 2652 <table> 2653 <tbody> 2654 <tr> 2655 <th>Case Level</th> 2656 <th>Strength</th> 2657 <th>Original CE</th> 2658 <th>Modified CE</th> 2659 <th>Comment</th> 2660 </tr> 2661 <tr> 2662 <td rowspan="5"><strong>on</strong></td> 2663 <td rowspan="2"><strong>primary</strong></td> 2664 <td><code>0.S.t</code></td> 2665 <td><code>0.0</code></td> 2666 <td rowspan="2">ignore case level weights of 2667 primary-ignorable CEs</td> 2668 </tr> 2669 <tr> 2670 <td><code>p.s.t</code></td> 2671 <td><code>p.c</code></td> 2672 </tr> 2673 <tr> 2674 <td rowspan="3"><strong>secondary<br></strong> or 2675 higher</td> 2676 <td><code>0.0.T</code></td> 2677 <td><code>0.0.0.T</code></td> 2678 <td rowspan="3">ignore case level weights of 2679 secondary-ignorable CEs</td> 2680 </tr> 2681 <tr> 2682 <td><code>0.S.t</code></td> 2683 <td><code>0.S.c.t</code></td> 2684 </tr> 2685 <tr> 2686 <td><code>p.s.t</code></td> 2687 <td><code>p.s.c.t</code></td> 2688 </tr> 2689 <tr> 2690 <td rowspan="4"><strong>off</strong></td> 2691 <td rowspan="4">any</td> 2692 <td><code>0.0.0</code></td> 2693 <td><code>0.0.00</code></td> 2694 <td rowspan="4">ignore case level weights of 2695 tertiary-ignorable CEs</td> 2696 </tr> 2697 <tr> 2698 <td><code>0.0.T</code></td> 2699 <td><code>0.0.3T</code></td> 2700 </tr> 2701 <tr> 2702 <td><code>0.S.t</code></td> 2703 <td><code>0.S.ct</code></td> 2704 </tr> 2705 <tr> 2706 <td><code>p.s.t</code></td> 2707 <td><code>p.s.ct</code></td> 2708 </tr> 2709 </tbody> 2710 </table> 2711 </div> 2712 <p>For primary+case, which is used for “ignore accents but not 2713 case” collation, primary ignorables are ignored so that a = ä. 2714 For secondary+case, which would by analogy mean “ignore 2715 variants but not case”, secondary ignorables are ignored for 2716 equivalent behavior.</p> 2717 <p>When using <strong>caseFirst</strong> but not 2718 <strong>caseLevel</strong>, the combined case+tertiary weight 2719 of a tertiary CE must be greater than the combined 2720 case+tertiary weight of any primary or secondary CE so that 2721 [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] 2722 <a href= 2723 "https://www.unicode.org/reports/tr10/#WF2">well-formedness 2724 condition 2</a> is fulfilled. Since the tertiary CE’s tertiary 2725 weight T is already greater than any t of primary or secondary 2726 CEs, it is sufficient to set its case weight to UPPER=3. It 2727 must not be affected by <strong>caseFirst=upper</strong>. (The 2728 table uses the constant 3 in this case rather than the computed 2729 c.)</p> 2730 <p>The case weight of a tertiary-ignorable CE must be 0 so that 2731 [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] 2732 <a href= 2733 "https://www.unicode.org/reports/tr10/#WF1">well-formedness 2734 condition 1</a> is fulfilled.</p> 2735 <h4>3.14.3 <a name="Case_Tailored" href="#Case_Tailored" id= 2736 "Case_Tailored">Tailored Strings</a></h4> 2737 <p>Characters and strings that are tailored have case values 2738 computed from their root collation case bits.</p> 2739 <ol> 2740 <li>Look up the tailored string’s root CEs. (Ignore any 2741 prefix or extension strings.) N=number of primary root 2742 CEs.</li> 2743 <li>Determine the number and type (primary vs. weaker) of CEs 2744 a tailored string maps to. M=number of primary tailored 2745 CEs.</li> 2746 <li>If N<=M (no more root than tailoring primary CEs): 2747 Copy the root case bits for primary CEs 0..N-1. 2748 <ul> 2749 <li>If N<M (fewer root primary CEs): Clear the case 2750 bits of the remaining tailored primary CEs. 2751 (uncased/lowercase/small Kana)</li> 2752 </ul> 2753 </li> 2754 <li>If N>M (more root primary CEs): Copy the root case 2755 bits for primary CEs 0..M-2. Set the case bits for tailored 2756 primary CE M-1 according to the remaining root primary CEs 2757 M-1..N-1: 2758 <ul> 2759 <li>Set to uncased/lower if all remaining root primary 2760 CEs have uncased/lower.</li> 2761 <li>Set to uppercase if all remaining root primary CEs 2762 have uppercase.</li> 2763 <li>Otherwise, set to mixed.</li> 2764 </ul> 2765 </li> 2766 <li>Clear the case bits for secondary CEs 0.s.t.</li> 2767 <li>Tertiary CEs 0.0.t must get uppercase bits.</li> 2768 <li>Tertiary-ignorable CEs 0.0.0 must get 2769 ignorable-case=lowercase bits.</li> 2770 </ol> 2771 <p class="note">Note: Almost all Cased characters have primary 2772 (non-ignorable) root collation CEs, except for U+0345 Combining 2773 Ypogegrammeni which is Lowercase. All Uppercase characters have 2774 primary root collation CEs.</p> 2775 <h3>3.15 <a name="Visibility" href="#Visibility" id= 2776 "Visibility">Visibility</a></h3> 2777 <p>Collations have external visibility by default, meaning that 2778 they can be displayed in a list of collation options for users 2779 to choose from. A collation whose type name starts with 2780 "private-" is internal and should not be shown in such a list. 2781 Collations are typically internal when they are partial 2782 sequences included in other collations. See <i>Section 3.1, 2783 <a href="#Collation_Types">Collation Types</a></i> .</p> 2784 <h3>3.16 <a name="Collation_Indexes" href="#Collation_Indexes" 2785 id="Collation_Indexes">Collation Indexes</a></h3> 2786 <h4>3.16.1 <a name="Index_Characters" href="#Index_Characters" 2787 id="Index_Characters">Index Characters</a></h4> 2788 <p>The main data includes <exemplarCharacters> for 2789 collation indexes. See <i>Part 2 General, Section 3, <a href= 2790 "tr35-general.html#Character_Elements">Character 2791 Elements</a></i>, for general information about exemplar 2792 characters.</p> 2793 <p>The index characters are a set of characters for use as a UI 2794 "index", that is, a list of clickable characters (or character 2795 sequences) that allow the user to see a segment of a larger 2796 "target" list. Each character corresponds to a bucket in the 2797 target list. One may have different kinds of index lists; one 2798 that produces an index list that is relatively static, and the 2799 other is a list that produces roughly equally-sized buckets. 2800 While CLDR is mostly focused on the first, there is provision 2801 for supporting the second as well.</p> 2802 <p>The index characters need to be used in conjunction with a 2803 collation for the locale, which will determine the order of the 2804 characters. It will also determine which index characters show 2805 up.</p> 2806 <p>The static list would be presented as something like the 2807 following (either vertically or horizontally):</p> 2808 <p align="center">… A B C D E F G H CH I J K L M N O P Q R 2809 S T U V W X Y Z …</p> 2810 <p>In the "A" bucket, you would find all items that are primary 2811 greater than or equal to "A" in collation order, and primary 2812 less than "B". The use of the list requires that the target 2813 list be sorted according to the locale that is used to create 2814 that list. Although we say "character" above, the index 2815 character could be a sequence, like "CH" above. The index 2816 exemplar characters must always be used with a collation 2817 appropriate for the locale. Any characters that do not have 2818 primary differences from others in the set should be 2819 removed.</p> 2820 <p>Details:</p> 2821 <ol> 2822 <li>The primary weight (according to the collation) is used 2823 to determine which bucket a string is in. There are special 2824 buckets for before the first character, between buckets of 2825 different scripts, and after the last bucket (and of a 2826 different script).</li> 2827 <li>Characters in the <em>index characters</em> do not need 2828 to have distinct primary weights. That is, the <em>index 2829 characters</em> are adapted to the underlying collation: 2830 normally Ё is in the Е bucket for Russian, but if someone 2831 used a variant of Russian collation that distinguished them 2832 on a primary level, then Ё would show up as its own 2833 bucket.</li> 2834 <li>If an <em>index character</em> string ends with a single 2835 "*" (U+002A), for example "Sch*" and "St*" in German, then 2836 there will be a separate bucket for the string minus the "*", 2837 for example "Sch" and "St", even if that string does not sort 2838 distinctly.</li> 2839 <li>An <em>index character</em> can have multiple primary 2840 weights, for example "Æ" and "Sch". Names that have the same 2841 initial primary weights sort into this <em>index 2842 character</em>’s bucket. This can be achieved by using an 2843 upper-boundary string that is the concatenation of the 2844 <em>index character</em> and U+FFFF, for example "Æ\uFFFF" 2845 and "Sch\uFFFF". Names that sort greater than this upper 2846 boundary but less than the next index character are 2847 redirected to the last preceding single-primary index 2848 character (A and S for the examples here).</li> 2849 </ol> 2850 <p>For example, for index characters <code>[A Æ B R S {Sch*} 2851 {St*} T]</code> the following sample names are sorted into an 2852 index as shown.</p> 2853 <ul> 2854 <li>A — Adelbert, Afrika</li> 2855 <li>Æ — Æsculap, Aesthet</li> 2856 <li>B — Berlin</li> 2857 <li>R — Rilke</li> 2858 <li>S — Sacher, Seiler, Sultan</li> 2859 <li>Sch — Schiller</li> 2860 <li>St — Steiff</li> 2861 <li>T — Thomas</li> 2862 </ul> 2863 <p>The … items are special: each is a bucket for 2864 everything else, either less or greater. They are inserted at 2865 the start and end of the index list, <em>and</em> on script 2866 boundaries. Each script has its own range, except where scripts 2867 sort primary-equal (e.g., Hira & Kana). All characters that 2868 sort in one of the low reordering groups (whitespace, 2869 punctuation, symbols, currency symbols, digits) are treated as 2870 a single script for this purpose.</p> 2871 <p>If you tailor a Greek character into the Cyrillic script, 2872 that Greek character will be bucketed (and sorted) among the 2873 Cyrillic ones.</p> 2874 <p>Even in an implementation that reorders groups of scripts 2875 rather than single scripts, for example Hebrew together with 2876 Phoenician and Samaritan, the index boundaries are really 2877 script boundaries, <em>not</em> multi-script-group boundaries. 2878 So if you had a collation that reordered Hebrew after Ethiopic, 2879 you would still get index boundaries between the following (and 2880 in that order):</p> 2881 <ol> 2882 <li>Ethiopic</li> 2883 <li>Hebrew</li> 2884 <li>Phoenician<em> // included in the Hebrew reordering 2885 group</em></li> 2886 <li>Samaritan<em> // included in the Hebrew reordering 2887 group</em></li> 2888 <li>Devanagari</li> 2889 </ol> 2890 <p>(Beginning with CLDR 27, single scripts can be 2891 reordered.)</p> 2892 <p>In the UI, an index character could also be omitted or 2893 grayed out if its bucket is empty. For example, if there is 2894 nothing in the bucket for Q, then Q could be omitted. That 2895 would be up to the implementation. Additional buckets could be 2896 added if other characters are present. For example, we might 2897 see something like the following:</p> 2898 <table border="1" cellspacing="0"> 2899 <tbody> 2900 <tr align="center"> 2901 <td> 2902 <div align="center"> 2903 <strong>Sample Greek Index<br></strong> 2904 </div> 2905 </td> 2906 <td><strong>Contents<br></strong></td> 2907 </tr> 2908 <tr align="center"> 2909 <td> 2910 <div align="center"> 2911 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω 2912 </div> 2913 </td> 2914 <td>With only content beginning with Greek 2915 letters <br></td> 2916 </tr> 2917 <tr align="center"> 2918 <td> 2919 <div align="center"> 2920 … Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ 2921 Ω … 2922 </div> 2923 </td> 2924 <td>With some content before or after</td> 2925 </tr> 2926 <tr align="center"> 2927 <td> 2928 <div align="center"> 2929 … 9 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ 2930 Ψ Ω … 2931 </div> 2932 </td> 2933 <td>With numbers, and nothing between 9 and Alpha</td> 2934 </tr> 2935 <tr align="center"> 2936 <td> 2937 <div align="center"> 2938 … 9 <em>A-Z</em> Α Β Γ Δ Ε Ζ Η Θ Ι Κ 2939 Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω … 2940 </div> 2941 </td> 2942 <td>With numbers, some Latin</td> 2943 </tr> 2944 </tbody> 2945 </table> 2946 <p>Here is a sample of the XML structure:</p> 2947 <pre> 2948 <exemplarCharacters type="index">[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]</exemplarCharacters></pre> 2949 <p>The display of the index characters can be modified with the 2950 Index labels elements, discussed in the <i>Part 2 General, 2951 Section 3.3, <a href="tr35-general.html#IndexLabels">Index 2952 Labels</a></i> .</p> 2953 <h4>3.16.2 <a name="CJK_Index_Markers" href= 2954 "#CJK_Index_Markers" id="CJK_Index_Markers">CJK Index 2955 Markers</a></h4> 2956 <p>Special index markers have been added to the CJK collations 2957 for stroke, pinyin, zhuyin, and unihan. These markers allow for 2958 effective and robust use of indexes for these collations.</p> 2959 <p>The per-language index exemplar characters are not useful 2960 for collation indexes for CJK because for each such language 2961 there are multiple sort orders in use (for example, Chinese 2962 pinyin vs. stroke vs. unihan vs. zhuyin), and these sort orders 2963 use very different index characters. In addition, sometimes the 2964 boundary strings are different from the bucket label strings. 2965 For collations that contain index markers, the boundary strings 2966 and bucket labels should be derived from those index markers, 2967 ignoring the index exemplar characters.</p> 2968 <p>For example, near the start of the pinyin tailoring there is 2969 the following:</p> 2970 <p><p> A</p><!-- INDEX A --><br> 2971 <pc>阿呵锕</pc><!-- ā --></p> 2972 <p>…</p> 2973 <p><pc>翶</pc><!-- ao --><br> 2974 <p> B</p><!-- INDEX B --></p> 2975 <p>These indicate the boundaries of "buckets" that can be used 2976 for indexing. They are always two characters starting with the 2977 noncharacter U+FDD0, and thus will not occur in normal text. 2978 For pinyin the second character is A-Z; for unihan it is one of 2979 the radicals; and for stroke it is a character after U+2800 2980 indicating the number of strokes, such as ⠁. For zhuyin the 2981 second character is one of the standard Bopomofo characters in 2982 the range U+3105 through U+3129.</p> 2983 <p>The corresponding bucket label strings are the boundary 2984 strings with the leading U+FDD0 removed. For example, the 2985 Pinyin boundary string "\uFDD0A" yields the label string 2986 "A".</p> 2987 <p>However, for stroke order, the label string is the stroke 2988 count (second character minus U+2800) as a decimal-digit number 2989 followed by 劃 (U+5283). For example, the stroke order boundary 2990 string "\uFDD0\u2805" yields the label string "5劃".</p> 2991 <hr> 2992 <p class="copyright">Copyright © 2001–2020 Unicode, Inc. All 2993 Rights Reserved. The Unicode Consortium makes no expressed or 2994 implied warranty of any kind, and assumes no liability for 2995 errors or omissions. No liability is assumed for incidental and 2996 consequential damages in connection with or arising out of the 2997 use of the information or programs contained or accompanying 2998 this technical report. The Unicode <a href= 2999 "https://unicode.org/copyright.html">Terms of Use</a> apply.</p> 3000 <p class="copyright">Unicode and the Unicode logo are 3001 trademarks of Unicode, Inc., and are registered in some 3002 jurisdictions.</p> 3003 </div> 3004</body> 3005</html> 3006