1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2"https://www.w3.org/TR/html4/loose.dtd">
3<html>
4<head>
5  <meta name="generator" content=
6  "HTML Tidy for HTML5 for Apple macOS version 5.6.0">
7  <meta http-equiv="Content-Type" content=
8  "text/html; charset=utf-8">
9  <meta http-equiv="Content-Language" content="en-us">
10  <link rel="stylesheet" href=
11  "../reports.css" type="text/css">
12  <title>UTS #35: Unicode LDML: Collation</title>
13  <style type="text/css">
14  <!--
15  .dtd {
16        font-family: monospace;
17        font-size: 90%;
18        background-color: #CCCCFF;
19        border-style: dotted;
20        border-width: 1px;
21  }
22
23  .xmlExample {
24        font-family: monospace;
25        font-size: 80%
26  }
27
28  .blockedInherited {
29        font-style: italic;
30        font-weight: bold;
31        border-style: dashed;
32        border-width: 1px;
33        background-color: #FF0000
34  }
35
36  .inherited {
37        font-weight: bold;
38        border-style: dashed;
39        border-width: 1px;
40        background-color: #00FF00
41  }
42
43  .element {
44        font-weight: bold;
45        color: red;
46  }
47
48  .attribute {
49        font-weight: bold;
50        color: maroon;
51  }
52
53  .attributeValue {
54        font-weight: bold;
55        color: blue;
56  }
57
58  li, p {
59        margin-top: 0.5em;
60        margin-bottom: 0.5em
61  }
62
63  h2, h3, h4, table {
64        margin-top: 1.5em;
65        margin-bottom: 0.5em;
66  }
67  -->
68  </style>
69</head>
70<body>
71  <table class="header" width="100%">
72    <tr>
73      <td class="icon"><a href="https://unicode.org"><img alt=
74      "[Unicode]" src="../logo60s2.gif"
75      width="34" height="33" style=
76      "vertical-align: middle; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px; border-top-width: 0px;"></a>&nbsp;
77      <a class="bar" href=
78      "https://www.unicode.org/reports/">Technical Reports</a></td>
79    </tr>
80    <tr>
81      <td class="gray">&nbsp;</td>
82    </tr>
83  </table>
84  <div class="body">
85    <h2 style="text-align: center">Unicode Technical Standard #35</h2>
86    <h1>Unicode Locale Data Markup Language (LDML)<br>
87    Part 5: Collation</h1>
88    <!-- At least the first row of this header table should be identical across the parts of this UTS. -->
89    <table border="1" cellpadding="2" cellspacing="0" class="wide">
90      <tr>
91        <td>Version</td>
92        <td>38</td>
93      </tr>
94      <tr>
95        <td>Editors</td>
96        <td>Markus Scherer (<a href="mailto:markus.icu@gmail.com">markus.icu@gmail.com</a>) and
97        <a href="tr35.html#Acknowledgments">other CLDR committee
98        members</a></td>
99      </tr>
100    </table>
101    <p>For the full header, summary, and status, see <a href=
102    "tr35.html">Part 1: Core</a></p>
103    <h3><i>Summary</i></h3>
104    <p>This document describes parts of an XML format
105    (<i>vocabulary</i>) for the exchange of structured locale data.
106    This format is used in the <a href=
107    "https://unicode.org/cldr/">Unicode Common Locale Data
108    Repository</a>.</p>
109    <p>This is a partial document, describing only those parts of
110    the LDML that are relevant for collation (sorting, searching
111    &amp; grouping). For the other parts of the LDML see the
112    <a href="tr35.html">main LDML document</a> and the links
113    above.</p>
114    <h3><i>Status</i></h3>
115
116    <!-- NOT YET APPROVED
117                <p>
118                                <i class="changed">This is a<b><font color="#ff3333">
119                                draft </font></b>document which may be updated, replaced, or superseded by
120                                other documents at any time. Publication does not imply endorsement
121                                by the Unicode Consortium. This is not a stable document; it is
122                                inappropriate to cite this document as other than a work in
123                                progress.
124                        </i>
125                </p>
126     END NOT YET APPROVED -->
127    <!-- APPROVED -->
128    <p><i>This document has been reviewed by Unicode members and
129    other interested parties, and has been approved for publication
130    by the Unicode Consortium. This is a stable document and may be
131    used as reference material or cited as a normative reference by
132    other specifications.</i></p>
133    <!-- END APPROVED -->
134
135    <blockquote>
136      <p><i><b>A Unicode Technical Standard (UTS)</b> is an
137      independent specification. Conformance to the Unicode
138      Standard does not imply conformance to any UTS.</i></p>
139    </blockquote>
140    <p><i>Please submit corrigenda and other comments with the CLDR
141    bug reporting form [<a href="tr35.html#Bugs">Bugs</a>]. Related
142    information that is useful in understanding this document is
143    found in the <a href="tr35.html#References">References</a>. For
144    the latest version of the Unicode Standard see [<a href=
145    "tr35.html#Unicode">Unicode</a>]. For a list of current Unicode
146    Technical Reports see [<a href=
147    "tr35.html#Reports">Reports</a>]. For more information about
148    versions of the Unicode Standard, see [<a href=
149    "tr35.html#Versions">Versions</a>].</i></p>
150    <h2><a name="Parts" href="#Parts" id="Parts">Parts</a></h2>
151    <!-- This section of Parts should be identical in all of the parts of this UTS. -->
152    <p>The LDML specification is divided into the following
153    parts:</p>
154    <ul class="toc">
155      <li>Part 1: <a href="tr35.html#Contents">Core</a> (languages,
156      locales, basic structure)</li>
157      <li>Part 2: <a href="tr35-general.html#Contents">General</a>
158      (display names &amp; transforms, etc.)</li>
159      <li>Part 3: <a href="tr35-numbers.html#Contents">Numbers</a>
160      (number &amp; currency formatting)</li>
161      <li>Part 4: <a href="tr35-dates.html#Contents">Dates</a>
162      (date, time, time zone formatting)</li>
163      <li>Part 5: <a href=
164      "tr35-collation.html#Contents">Collation</a> (sorting,
165      searching, grouping)</li>
166      <li>Part 6: <a href=
167      "tr35-info.html#Contents">Supplemental</a> (supplemental
168      data)</li>
169      <li>Part 7: <a href=
170      "tr35-keyboards.html#Contents">Keyboards</a> (keyboard
171      mappings)</li>
172    </ul>
173    <h2><a name="Contents" href="#Contents" id="Contents">Contents
174    of Part 5, Collation</a></h2>
175    <!-- START Generated TOC: CheckHtmlFiles -->
176    <ul class="toc">
177      <li>1 <a href="#CLDR_Collation">CLDR Collation</a>
178        <ul class="toc">
179          <li>1.1 <a href="#CLDR_Collation_Algorithm">CLDR
180          Collation Algorithm</a>
181            <ul class="toc">
182              <li>1.1.1 <a href="#Algorithm_FFFE">U+FFFE</a></li>
183              <li>1.1.2 <a href=
184              "#Context_Sensitive_Mappings">Context-Sensitive
185              Mappings</a></li>
186              <li>1.1.3 <a href="#Algorithm_Case">Case
187              Handling</a></li>
188              <li>1.1.4 <a href=
189              "#Algorithm_Reordering_Groups">Reordering
190              Groups</a></li>
191              <li>1.1.5 <a href="#Combining_Rules">Combining
192              Rules</a></li>
193            </ul>
194          </li>
195        </ul>
196      </li>
197      <li>2 <a href="#Root_Collation">Root Collation</a>
198        <ul class="toc">
199          <li>2.1 <a href=
200          "#grouping_classes_of_characters">Grouping classes of
201          characters</a></li>
202          <li>2.2 <a href="#non_variable_symbols">Non-variable
203          symbols</a></li>
204          <li>2.3 <a href="#tibetan_contractions">Additional
205          contractions for Tibetan</a></li>
206          <li>2.4 <a href="#tailored_noncharacter_weights">Tailored
207          noncharacter weights</a></li>
208          <li>2.5 <a href="#Root_Data_Files">Root Collation Data
209          Files</a></li>
210          <li>2.6 <a href="#Root_Data_File_Formats">Root Collation
211          Data File Formats</a>
212            <ul class="toc">
213              <li>2.6.1 <a href=
214              "#File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></li>
215              <li>2.6.2 <a href=
216              "#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></li>
217              <li>2.6.3 <a href=
218              "#File_Format_UCA_Rules_txt">UCA_Rules.txt</a></li>
219            </ul>
220          </li>
221        </ul>
222      </li>
223      <li>3 <a href="#Collation_Tailorings">Collation
224      Tailorings</a>
225        <ul class="toc">
226          <li>3.1 <a href="#Collation_Types">Collation Types</a>
227            <ul class="toc">
228              <li>3.1.1 <a href=
229              "#Collation_Type_Fallback">Collation Type
230              Fallback</a>
231                <ul class="toc">
232                  <li>Table: <a href=
233                  "#Sample_requested_and_actual_collation_locales_and_types">
234                  Sample requested and actual collation locales and
235                  types</a></li>
236                </ul>
237              </li>
238            </ul>
239          </li>
240          <li>3.2 <a href="#Collation_Version">Version</a></li>
241          <li>3.3 <a href="#Collation_Element">Collation
242          Element</a></li>
243          <li>3.4 <a href="#Setting_Options">Setting Options</a>
244            <ul class="toc">
245              <li>Table: <a href="#Collation_Settings">Collation
246              Settings</a></li>
247              <li>3.4.1 <a href="#Common_Settings">Common settings
248              combinations</a></li>
249              <li>3.4.2 <a href="#Normalization_Setting">Notes on
250              the normalization setting</a></li>
251              <li>3.4.3 <a href="#Variable_Top_Settings">Notes on
252              variable top settings</a></li>
253            </ul>
254          </li>
255          <li>3.5 <a href="#Rules">Collation Rule Syntax</a></li>
256          <li>3.6 <a href="#Orderings">Orderings</a>
257            <ul class="toc">
258              <li>Table: <a href=
259              "#Specifying_Collation_Ordering">Specifying Collation
260              Ordering</a></li>
261              <li>Table: <a href=
262              "#Abbreviating_Ordering_Specifications">Abbreviating
263              Ordering Specifications</a></li>
264            </ul>
265          </li>
266          <li>3.7 <a href="#Contractions">Contractions</a>
267            <ul class="toc">
268              <li>Table: <a href=
269              "#Specifying_Contractions">Specifying
270              Contractions</a></li>
271            </ul>
272          </li>
273          <li>3.8 <a href="#Expansions">Expansions</a></li>
274          <li>3.9 <a href="#Context_Before">Context Before</a>
275            <ul class="toc">
276              <li>Table: <a href=
277              "#Specifying_Previous_Context">Specifying Previous
278              Context</a></li>
279            </ul>
280          </li>
281          <li>3.10 <a href=
282          "#Placing_Characters_Before_Others">Placing Characters
283          Before Others</a></li>
284          <li>3.11 <a href="#Logical_Reset_Positions">Logical Reset
285          Positions</a>
286            <ul class="toc">
287              <li>Table: <a href=
288              "#Specifying_Logical_Positions">Specifying Logical
289              Positions</a></li>
290            </ul>
291          </li>
292          <li>3.12 <a href=
293          "#Special_Purpose_Commands">Special-Purpose Commands</a>
294            <ul class="toc">
295              <li>Table: <a href=
296              "#Special_Purpose_Elements">Special-Purpose
297              Elements</a></li>
298            </ul>
299          </li>
300          <li>3.13 <a href="#Script_Reordering">Collation
301          Reordering</a>
302            <ul class="toc">
303              <li>3.13.1 <a href=
304              "#Interpretation_reordering">Interpretation of a
305              reordering list</a></li>
306              <li>3.13.2 <a href=
307              "#Reordering_Groups_allkeys">Reordering Groups for
308              allkeys.txt</a></li>
309            </ul>
310          </li>
311          <li>3.14 <a href="#Case_Parameters">Case Parameters</a>
312            <ul class="toc">
313              <li>3.14.1 <a href="#Case_Untailored">Untailored
314              Characters</a></li>
315              <li>3.14.2 <a href="#Case_Weights">Compute Modified
316              Collation Elements</a></li>
317              <li>3.14.3 <a href="#Case_Tailored">Tailored
318              Strings</a></li>
319            </ul>
320          </li>
321          <li>3.15 <a href="#Visibility">Visibility</a></li>
322          <li>3.16 <a href="#Collation_Indexes">Collation
323          Indexes</a>
324            <ul class="toc">
325              <li>3.16.1 <a href="#Index_Characters">Index
326              Characters</a></li>
327              <li>3.16.2 <a href="#CJK_Index_Markers">CJK Index
328              Markers</a></li>
329            </ul>
330          </li>
331        </ul>
332      </li>
333    </ul><!-- END Generated TOC: CheckHtmlFiles -->
334    <h2>1 <a name="CLDR_Collation" href="#CLDR_Collation" id=
335    "CLDR_Collation">CLDR Collation</a></h2>
336    <p>Collation is the general term for the process and function
337    of determining the sorting order of strings of characters, for
338    example for lists of strings presented to users, or in
339    databases for sorting and selecting records.</p>
340    <p>Collation varies by language, by application (some languages
341    use special phonebook sorting), and other criteria (for
342    example, phonetic vs. visual).</p>
343    <p>CLDR provides collation data for many languages and styles.
344    The data supports not only sorting but also language-sensitive
345    searching and grouping under index headers. All CLDR collations
346    are based on the [<a href=
347    "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] default
348    order, with common modifications applied in the CLDR root
349    collation, and further tailored for language and style as
350    needed.</p>
351    <h3>1.1 <a name="CLDR_Collation_Algorithm" href=
352    "#CLDR_Collation_Algorithm" id="CLDR_Collation_Algorithm">CLDR
353    Collation Algorithm</a></h3>
354    <p>The CLDR collation algorithm is an extension of the <a href=
355    "https://www.unicode.org/reports/tr10/#Main_Algorithm">Unicode
356    Collation Algorithm</a>.</p>
357    <h4>1.1.1 <a name="Algorithm_FFFE" href="#Algorithm_FFFE" id=
358    "Algorithm_FFFE">U+FFFE</a></h4>
359    <p>U+FFFE maps to a CE with a minimal, unique primary weight.
360    Its primary weight is not "variable": U+FFFE must not become
361    ignorable in alternate handling. On the identical level, a
362    minimal, unique “weight” must be emitted for U+FFFE as well.
363    This allows for <a href=
364    "https://www.unicode.org/reports/tr10/#Merging_Sort_Keys">Merging
365    Sort Keys</a> within code point space.</p>
366    <p>For example, when sorting names in a database, a sortable
367    string can be formed with <em>last_name</em> + '\uFFFE' +
368    <em>first_name</em>. These strings would sort properly, without
369    ever comparing the last part of a last name with the first part
370    of another first name.</p>
371    <p>For backwards secondary level sorting, text <i>segments</i>
372    separated by U+FFFE are processed in forward segment order, and
373    <i>within</i> each segment the secondary weights are compared
374    backwards. This is so that such combined strings are processed
375    consistently with merging their sort keys (for example, by
376    concatenating them level by level with a low separator).</p>
377    <p class="note">Note: With unique, low weights on <i>all</i>
378    levels it is possible to achieve <code>sortkey(str1 + "\uFFFE"
379    + str2) == mergeSortkeys(sortkey(str1), sortkey(str2))</code> .
380    When that is not necessary, then code can be a little simpler
381    (no special handling for U+FFFE except for
382    backwards-secondary), sort keys can be a little shorter (when
383    using compressible common non-primary weights for U+FFFE), and
384    another low weight can be used in tailorings.</p>
385    <h4>1.1.2 <a name="Context_Sensitive_Mappings" href=
386    "#Context_Sensitive_Mappings" id=
387    "Context_Sensitive_Mappings">Context-Sensitive
388    Mappings</a></h4>
389    <p>Contraction matching, as in the UCA, starts from the first
390    character of the contraction string. It slows down processing
391    of that first character even when none of its contractions
392    matches. In some cases, it is preferrable to change such
393    contractions to mappings with a prefix (context before a
394    character), so that complex processing is done only when the
395    less-frequently occurring trailing character is
396    encountered.</p>
397    <p>For example, the DUCET contains contractions for several
398    variants of L· (L followed by middle dot). Collating ASCII text
399    is slowed down by contraction matching starting with L/l. In
400    the CLDR root collation, these contractions are replaced by
401    prefix mappings (L|·) which are triggered only when the middle
402    dot is encountered. CLDR also uses prefix rules in the Japanese
403    tailoring, for processing of Hiragana/Katakana length and
404    iteration marks.</p>
405    <p>The mapping is conditional on the prefix match but does not
406    change the mappings for the preceding text. As a result, a
407    contraction mapping for "px" can be replaced by a prefix rule
408    "p|x" only if px maps to the collation elements for p followed
409    by the collation elements for "x if after p". In the DUCET, L·
410    maps to CE(L) followed by a special secondary CE (which differs
411    from CE(·) when · is not preceded by L). In the CLDR root
412    collation, L has no context-sensitive mappings, but · maps to
413    that special secondary CE if preceded by L.</p>
414    <p>A prefix mapping for p|x behaves mostly like the contraction
415    px, except when there is a contraction that overlaps with the
416    prefix, for example one for "op". A contraction matches only
417    new text (and consumes it), while a prefix matches only
418    already-consumed text.</p>
419    <ul>
420      <li>With mappings for "op" and "px", only the first
421      contraction matches in text "opx". (It consumes the "op"
422      characters, and there is no context-sensitive mapping for
423      x.)</li>
424      <li>With mappings for "op" and "p|x", both the contraction
425      and the prefix rule match in text "opx". (The prefix always
426      matches already-consumed characters, regardless of whether
427      they mapped as part of contractions.)</li>
428    </ul>
429    <p class="note">Note: Matching of discontiguous contractions
430    should be implemented without rewriting the text (unlike in the
431    [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
432    algorithm specification), so that prefix matching is
433    predictable. (It should also help with contraction matching
434    performance.) An implementation that does rewrite the text, as
435    in the UCA, will get different results for some (unusual)
436    combinations of contractions, prefix rules, and input text.</p>
437    <p>Prefix matching uses a simple longest-match algorithm (op|c
438    wins over p|c). It is recommended that prefix rules be limited
439    to mappings where both the prefix string and the mapped string
440    begin with an NFC boundary (that is, with a normalization
441    starter that does not combine backwards). (In op|ch both o and
442    c should be starters (ccc=0) and NFC_QC=Yes.) Otherwise, prefix
443    matching would be affected by canonical reordering and
444    discontiguous matching, like contractions. Prefix matching is
445    thus always contiguous.</p>
446    <p>A character can have mappings with both prefixes (context
447    before) and contraction suffixes. Prefixes are matched first.
448    This is to keep them reasonably implementable: When there is a
449    mapping with both a prefix and a contraction suffix (like in
450    Japanese: ぐ|ゞ), then the matching needs to go in both
451    directions. The contraction might involve discontiguous
452    matching, which needs complex text iteration and handling of
453    skipped combining marks, and will consume the matching suffix.
454    Prefix matching should be first because, regardless of whether
455    there is a match, the implementation will always return to the
456    original text index (right after the prefix) from where it will
457    start to look at all of the contractions for that prefix.</p>
458    <p>If there is a match for a prefix but no match for any of the
459    suffixes for that prefix, then fall back to mappings with the
460    next-longest matching prefix, and so on, ultimately to mappings
461    with no prefix. (Otherwise mappings with longer prefixes would
462    “hide” mappings with shorter prefixes.)</p>
463    <p>Consider the following mappings.</p>
464    <ol>
465      <li>p → CE(p)</li>
466      <li>h → CE(h)</li>
467      <li>c → CE(c)</li>
468      <li>ch → CE(d)</li>
469      <li>p|c → CE(u)</li>
470      <li>p|ci → CE(v)</li>
471      <li>p|ĉ → CE(w)</li>
472      <li>op|ck → CE(x)</li>
473    </ol>
474    <p>With these, text collates like this:</p>
475    <ul>
476      <li>pc → CE(p)CE(u)</li>
477      <li>pci → CE(p)CE(v)</li>
478      <li>pch → CE(p)CE(u)CE(h)</li>
479      <li>pĉ → CE(p)CE(w)</li>
480      <li>pĉ̣ → CE(p)CE(w)CE(U+0323) // discontiguous</li>
481      <li>opck → CE(o)CE(p)CE(x)</li>
482      <li>opch → CE(o)CE(p)CE(u)CE(h)</li>
483    </ul>
484    <p>However, if the mapping p|c → CE(u) is missing, then text
485    "pch" maps to CE(p)CE(d), "opch" maps to CE(o)CE(p)CE(d), and
486    "pĉ̣" maps to CE(p)CE(c)CE(U+0323)CE(U+0302) (because
487    discontiguous contraction matching extends <i>an existing
488    match</i> by one non-starter at a time).</p>
489    <h4>1.1.3 <a name="Algorithm_Case" href="#Algorithm_Case" id=
490    "Algorithm_Case">Case Handling</a></h4>
491    <p>CLDR specifies how to sort lowercase or uppercase first, as
492    a stronger distinction than other tertiary variants
493    (<strong>caseFirst</strong>) or while completely ignoring all
494    other tertiary distinctions (<strong>caseLevel</strong>). See
495    <i>Section 3.3 <a href="#Setting_Options">Setting
496    Options</a></i> and <i>Section 3.13 <a href=
497    "#Case_Parameters">Case Parameters</a></i>.</p>
498    <h4>1.1.4 <a name="Algorithm_Reordering_Groups" href=
499    "#Algorithm_Reordering_Groups" id=
500    "Algorithm_Reordering_Groups">Reordering Groups</a></h4>
501    <p>CLDR specifies how to do parametric reordering of groups of
502    scripts (e.g., “native script first”) as well as special groups
503    (e.g., “digits after letters”), and provides data for the
504    effective implementation of such reordering.</p>
505    <h4>1.1.5 <a name="Combining_Rules" href="#Combining_Rules" id=
506    "Combining_Rules">Combining Rules</a></h4>
507    <p>Rules from different sources can be combined, with the later
508    rules overriding the earlier ones. The following is an example
509    of how this can be useful.</p>
510    <p>There is a root collation for "emoji" in CLDR. So use of
511    "-u-co-emoji" in a Unicode locale identifier will access that
512    ordering.</p>
513    <p>Example, using ICU:</p>
514    <blockquote>
515      <p>collator =
516      Collator.getInstance(ULocale.forLanguageTag("en-u-co-emoji"));</p>
517    </blockquote>
518    <p>However, use of the emoji will supplant the language's
519    customizations. So the above is the equivalent of:</p>
520    <blockquote>
521      <p>collator =
522      Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji"));</p>
523    </blockquote>
524    <p>The same structure will not work for a language that does
525    require customization, like Danish. That is, the following will
526    fail.</p>
527    <blockquote>
528      <p>collator =
529      Collator.getInstance(ULocale.forLanguageTag("da-u-co-emoji"));</p>
530    </blockquote>
531    <p>For that, a slightly more cumbersome method needs to be
532    employed, which is to take the rules for Danish, and explicitly
533    add the rules for emoji.</p>
534    <blockquote>
535      <p>RuleBasedCollator collator = new RuleBasedCollator(<br>
536      ((RuleBasedCollator)
537      Collator.getInstance(ULocale.forLanguageTag("da"))).getRules()
538      +<br>
539      ((RuleBasedCollator)
540      Collator.getInstance(ULocale.forLanguageTag("und-u-co-emoji")))<br>
541
542      .getRules());</p>
543    </blockquote>
544    <p>The following table shows the differences. When emoji
545    ordering is supported, the two faces will be adjacent. When
546    Danish ordering is supported, the ü is after the y.</p>
547    <table class='simple'>
548      <tbody>
549        <tr>
550          <td>code point order</td>
551          <td>,</td>
552          <td></td>
553          <td></td>
554          <td>Z</td>
555          <td>a</td>
556          <td>y</td>
557          <td>ü</td>
558          <td>☹️</td>
559          <td>✈️️</td>
560          <td>글</td>
561          <td>��</td>
562        </tr>
563        <tr>
564          <td>en</td>
565          <td>,</td>
566          <td>☹️</td>
567          <td>✈️️</td>
568          <td>��</td>
569          <td>a</td>
570          <td>ü</td>
571          <td>y</td>
572          <td>Z</td>
573          <td>글</td>
574        </tr>
575        <tr>
576          <td>en-u-co-emoji</td>
577          <td>,</td>
578          <td>��</td>
579          <td>☹️</td>
580          <td>✈️️</td>
581          <td>a</td>
582          <td>ü</td>
583          <td>y</td>
584          <td>Z</td>
585          <td>글</td>
586        </tr>
587        <tr>
588          <td>da</td>
589          <td>,</td>
590          <td>☹️</td>
591          <td>✈️️</td>
592          <td>��</td>
593          <td>a</td>
594          <td>y</td>
595          <td><strong><u>ü</u></strong></td>
596          <td>Z</td>
597          <td>글</td>
598        </tr>
599        <tr>
600          <td>da-u-co-emoji</td>
601          <td>,</td>
602          <td>��</td>
603          <td>☹️</td>
604          <td>✈️️</td>
605          <td>a</td>
606          <td><strong><u>ü</u></strong></td>
607          <td>y</td>
608          <td>Z</td>
609          <td>글</td>
610        </tr>
611        <tr>
612          <td>combined rules</td>
613          <td>,</td>
614          <td>��</td>
615          <td>☹️</td>
616          <td>✈️️</td>
617          <td>a</td>
618          <td>y</td>
619          <td><strong><u>ü</u></strong></td>
620          <td>Z</td>
621          <td>글</td>
622        </tr>
623      </tbody>
624    </table><br>
625    <p>&nbsp;</p>
626    <h2>2 <a name="Root_Collation" href="#Root_Collation" id=
627    "Root_Collation">Root Collation</a></h2>
628    <p>The CLDR root collation order is based on the <a href=
629    "https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table">
630    Default Unicode Collation Element Table (DUCET)</a> defined in
631    <em>UTS #10: Unicode Collation Algorithm</em> [<a href=
632    "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]. It is
633    used by all other locales by default, or as the base for their
634    tailorings. (For a chart view of the UCA, see Collation Chart
635    [<a href="tr35.html#UCAChart">UCAChart</a>].)</p>
636    <p>Starting with CLDR 1.9, CLDR uses modified tables for the
637    root collation order. The root locale ordering is tailored in
638    the following ways:</p>
639    <h3>2.1 <a name="grouping_classes_of_characters" href=
640    "#grouping_classes_of_characters" id=
641    "grouping_classes_of_characters">Grouping classes of
642    characters</a></h3>
643    <p>As of Version 6.1.0, the DUCET puts characters into the
644    following ordering:</p>
645    <ul>
646      <li>First "common characters": whitespace, punctuation,
647      general symbols, some numbers, currency symbols, and other
648      numbers.</li>
649      <li>Then "script characters": Latin, Greek, and the rest of
650      the scripts.</li>
651    </ul>
652    <p>(There are a few exceptions to this general ordering.)</p>
653    <p>The CLDR root locale modifies the DUCET tailoring by
654    ordering the common characters more strictly by category:</p>
655    <ul>
656      <li>whitespace, punctuation, general symbols, currency
657      symbols, and numbers.</li>
658    </ul>
659    <p>What the regrouping allows is for users to parametrically
660    reorder the groups. For example, users can reorder numbers
661    after all scripts, or reorder Greek before Latin.</p>
662    <p>The relative order within each of these groups still matches
663    the DUCET. Symbols, punctuation, and numbers that are grouped
664    with a particular script stay with that script. The differences
665    between CLDR and the DUCET order are:</p>
666    <ol>
667      <li>CLDR groups the numbers together after currency symbols,
668      instead of splitting them with some before and some after.
669      Thus the following are put <em>after</em> currencies and just
670      before all the other numbers.
671        <blockquote>
672          <p>U+09F4 ( ৴ ) [No] BENGALI CURRENCY NUMERATOR ONE<br>
673          ...<br>
674          U+1D371 ( �� ) [No] COUNTING ROD TENS DIGIT NINE</p>
675        </blockquote>
676      </li>
677      <li>CLDR handles a few other characters differently
678        <ol>
679          <li>U+10A7F ( �� ) [Po] OLD SOUTH ARABIAN NUMERIC
680          INDICATOR is put with punctuation, not symbols</li>
681          <li>U+20A8 ( ₨ ) [Sc] RUPEE SIGN and U+FDFC ( ﷼ ) [Sc]
682          RIAL SIGN are put with currency signs, not with R and
683          REH.</li>
684        </ol>
685      </li>
686    </ol>
687    <h3>2.2 <a name="non_variable_symbols" href=
688    "#non_variable_symbols" id="non_variable_symbols">Non-variable
689    symbols</a></h3>
690    <p>There are multiple <a href=
691    "https://www.unicode.org/reports/tr10/#Variable_Weighting">Variable-Weighting</a>
692    options in the UCA for symbols and punctuation, including
693    <em>non-ignorable</em> and <em>shifted</em>. With the
694    <em>shifted</em> option, almost all symbols and punctuation are
695    ignored—except at a fourth level. The CLDR root locale ordering
696    is modified so that symbols are not affected by the
697    <em>shifted</em> option. That is, by default, symbols are not
698    “variable” in CLDR. So <em>shifted</em> only causes whitespace
699    and punctuation to be ignored, but not symbols (like ♥). The
700    DUCET behavior can be specified with a locale ID using the "kv"
701    keyword, to set the Variable section to include all of the
702    symbols below it, or be set parametrically where
703    implementations allow access.</p>
704    <p>See also:</p>
705    <ul>
706      <li><i>Section 3.3, <a href="#Setting_Options">Setting
707      Options</a></i></li>
708      <li><a href=
709      "https://www.unicode.org/charts/collation/">https://www.unicode.org/charts/collation/</a></li>
710    </ul>
711    <h3>2.3 <a name="tibetan_contractions" href=
712    "#tibetan_contractions" id="tibetan_contractions">Additional
713    contractions for Tibetan</a></h3>
714    <p>Ten contractions are added for Tibetan: Two to fulfill
715    <a href=
716    "https://www.unicode.org/reports/tr10/#WF5">well-formedness
717    condition 5</a>, and eight more to preserve the default order
718    for Tibetan. For details see <i>UTS #10, Section 3.8.2,
719    <a href="https://www.unicode.org/reports/tr10/#Well_Formed_DUCET">
720    Well-Formedness of the DUCET</a></i>.</p>
721    <h3>2.4 <a name="tailored_noncharacter_weights" href=
722    "#tailored_noncharacter_weights" id=
723    "tailored_noncharacter_weights">Tailored noncharacter
724    weights</a></h3>
725    <p>U+FFFE and U+FFFF have special tailorings:</p>
726    <blockquote>
727      <p><strong>U+FFFF:</strong> This code point is tailored to
728      have a primary weight higher than all other characters. This
729      allows the reliable specification of a range, such as “Sch” ≤
730      X ≤ “Sch\uFFFF”, to include all strings starting with "sch"
731      or equivalent.</p>
732      <p><strong>U+FFFE:</strong> This code point produces a CE
733      with minimal, unique weights on primary and identical levels.
734      For details see the <i><a href="#Algorithm_FFFE">CLDR
735      Collation Algorithm</a></i> above.</p>
736    </blockquote>
737    <p>UCA (beginning with version 6.3) also maps
738    <strong>U+FFFD</strong> to a special collation element with a
739    very high primary weight, so that it is reliably non-<a href=
740    "https://www.unicode.org/reports/tr10/#Variable_Weighting">variable</a>,
741    for use with <a href=
742    "https://www.unicode.org/reports/tr10/#Handling_Illformed">ill-formed
743    code unit sequences</a>.</p>
744    <p>In CLDR, so as to maintain the special collation elements,
745    <strong>U+FFFD..U+FFFF</strong> are not further tailorable, and
746    nothing can tailor to them. That is, neither can occur in a
747    collation rule. For example, the following rules are
748    illegal:</p>
749    <p><code>&amp;\uFFFF &lt; x</code></p>
750    <p><code>&amp;x &lt;\uFFFF</code><br></p>
751    <p class="note"><b>Note:</b></p>
752    <ul>
753      <li class="note">Java uses an early version of this collation
754      syntax, but has not been updated recently. It does not
755      support any of the syntax marked with [...], and its default
756      table is not the DUCET nor the CLDR root collation.</li>
757    </ul>
758    <h3>2.5 <a name="Root_Data_Files" href="#Root_Data_Files" id=
759    "Root_Data_Files">Root Collation Data Files</a></h3>
760    <p>The CLDR root collation data files are in the CLDR
761    repository and release, under the path <a href=
762    "https://github.com/unicode-org/cldr/tree/latest/common/uca/">common/uca/</a>.</p>
763    <p>For most data files there are <strong>_SHORT</strong>
764    versions available. They contain the same data but only minimal
765    comments, to reduce the file sizes.</p>
766    <p>Comments with DUCET-style weights in files other than
767    allkeys_CLDR.txt and allkeys_DUCET.txt use the weights defined
768    in allkeys_CLDR.txt.</p>
769    <ul>
770      <li><strong>allkeys_CLDR</strong> - A file that provides a
771      remapping of UCA DUCET weights for use with CLDR.</li>
772      <li><strong>allkeys_DUCET</strong> - The same as DUCET
773      allkeys.txt, but in alternate=non-ignorable sort order, for
774      easier comparison with allkeys_CLDR.txt.</li>
775      <li>
776        <strong>FractionalUCA</strong> - A file that provides a
777        remapping of UCA DUCET weights for use with CLDR. The
778        weight values are modified:
779        <ul>
780          <li>The weights have variable length, with 1..4 bytes
781          each. Each secondary or tertiary weight currently uses at
782          most 2 bytes.</li>
783          <li>There are tailoring gaps between adjacent weights, so
784          that a number of characters can be tailored to sort
785          between any two root collation elements.</li>
786          <li>There are collation elements with primary weights at
787          the boundaries between reordering groups and Unicode
788          scripts, so that tailoring around the first or last
789          primary of a group/script results in new collation
790          elements that sort and reorder together with that group
791          or script. These boundary weights also define the primary
792          weight ranges for parametric group and script
793          reordering.</li>
794        </ul>An implementation may modify the weights further to
795        fit the needs of its data structures.
796      </li>
797      <li><strong>UCA_Rules</strong> - A file that specifies the
798      root collation order in the form of <a href=
799      "#Collation_Tailorings">tailoring rules</a>. This is only an
800      approximation of the FractionalUCA data, since the rule
801      syntax cannot express every detail of the collation elements.
802      For example, in the DUCET and in FractionalUCA, tertiary
803      differences are usually expressed with special tertiary
804      weights on all collation elements of an expansion, while a
805      typical from-rules builder will modify the tertiary weight of
806      only one of the collation elements.</li>
807      <li>
808        <strong>CollationTest_CLDR</strong> - The CLDR versions of
809        the CollationTest files, which use the tailorings for CLDR.
810        For information on the format, see <a href=
811        "https://www.unicode.org/Public/UCA/latest/CollationTest.html">
812        CollationTest.html</a> in the <a href=
813        "https://www.unicode.org/reports/tr10/#Data10">UCA data
814        directory</a>.
815        <ul>
816          <li>CollationTest_CLDR_NON_IGNORABLE.txt</li>
817          <li>CollationTest_CLDR_SHIFTED.txt</li>
818        </ul>
819      </li>
820    </ul>
821    <h3>2.6 <a name="Root_Data_File_Formats" href=
822    "#Root_Data_File_Formats" id="Root_Data_File_Formats">Root
823    Collation Data File Formats</a></h3>
824    <p>The file formats may change between versions of CLDR. The
825    formats for CLDR 23 and beyond are as follows. As usual, text
826    after a # is a comment.</p>
827    <h4>2.6.1 <a name="File_Format_allkeys_CLDR_txt" href=
828    "#File_Format_allkeys_CLDR_txt" id=
829    "File_Format_allkeys_CLDR_txt">allkeys_CLDR.txt</a></h4>
830    <p>This file defines CLDR’s tailoring of the DUCET, as
831    described in <i>Section 2, <a href="#Root_Collation">Root
832    Collation</a></i> .</p>
833    <p>The format is similar to that of <a href=
834    "https://www.unicode.org/reports/tr10/#File_Format">allkeys.txt</a>,
835    although there may be some differences in whitespace.</p>
836    <h4>2.6.2 <a name="File_Format_FractionalUCA_txt" href=
837    "#File_Format_FractionalUCA_txt" id=
838    "File_Format_FractionalUCA_txt">FractionalUCA.txt</a></h4>
839    <p>The format is illustrated by the following sample lines,
840    with commentary afterwards.</p>
841    <pre>[UCA version = 6.0.0]</pre>
842    <blockquote>
843      <p>Provides the version number of the UCA table.</p>
844    </blockquote>
845    <pre>
846    [Unified_Ideograph 4E00..9FCC FA0E..FA0F FA11 FA13..FA14 FA1F FA21 FA23..FA24 FA27..FA29 3400..4DB5 20000..2A6D6 2A700..2B734 2B740..2B81D]</pre>
847    <blockquote>
848      <p>Lists the ranges of Unified_Ideograph characters in
849      collation order. (New in CLDR 24.) They map to collation
850      elements with <a href=
851      "https://www.unicode.org/reports/tr10/#Implicit_Weights">implicit
852      (constructed) primary weights</a>.</p>
853    </blockquote>
854    <pre>[radical 6=⼅亅:亅��了��-��亇��予㐧��-��争����亊��-����事㐨��-��������]
855[radical 210=⿑齊:齊����齋䶒䶓��齌������齍��-��齎����齏��-��]
856[radical 210'=⻬齐:齐齑]
857[radical end]</pre>
858    <blockquote>
859      <p>Data for Unihan radical-stroke order. (New in CLDR 26.)
860      Following the [Unified_Ideograph] line, a section of
861      <code>[radical ...]</code> lines defines a radical-stroke
862      order of the Unified_Ideograph characters.</p>
863      <p>For Han characters, an implementation may choose either to
864      implement the order defined in the UCA and the
865      [Unified_Ideograph] data, or to implement the order defined
866      by the <code>[radical ...]</code> lines. Beginning with CLDR
867      26, the CJK type="unihan" tailorings assume that the root
868      collation order sorts Han characters in Unihan radical-stroke
869      order according to the <code>[radical ...]</code> data. The
870      CollationTest_CLDR files only contain Han characters that are
871      in the same relative order using implicit weights or the
872      radical-stroke order.</p>
873      <p>The root collation radical-stroke order is derived from
874      the first (normative) values of the <a href=
875      "https://www.unicode.org/reports/tr38/#kRSUnicode">Unihan
876      kRSUnicode</a> field for each Han character. Han characters
877      are ordered by radical, with traditional forms sorting before
878      simplified ones. Characters with the same radical are ordered
879      by residual stroke count. Characters with the same
880      radical-stroke values are ordered by block and code point, as
881      for <a href=
882      "https://www.unicode.org/reports/tr10/#Implicit_Weights">UCA
883      implicit weights</a>.</p>
884      <p>There is one <code>[radical ...]</code> line per radical,
885      in the order of radical numbers. Each line shows the radical
886      number and the representative characters from the <a href=
887      "https://www.unicode.org/reports/tr44/#UCD_Files_Table">UCD
888      file CJKRadicals.txt</a>, followed by a colon (“:”) and the
889      Han characters with that radical in the order as described
890      above. A range like <code>万-丌</code> indicates that the code
891      points in that range sort in code point order.</p>
892      <p>The radical number and characters are informational. The
893      sort order is established only by the order of the
894      <code>[radical ...]</code> lines, and within each line by the
895      characters and ranges between the colon (“:”) and the bracket
896      (“]”).</p>
897      <p>Each Unified_Ideograph occurs exactly once. Only
898      Unified_Ideograph characters are listed on <code>[radical
899      ...]</code> lines.</p>
900      <p>This section is terminated with one <code>[radical
901      end]</code> line.</p>
902    </blockquote>
903    <pre>
904    0000; [,,]     # Zyyy Cc       [0000.0000.0000]        * &lt;NULL&gt;</pre>
905    <blockquote>
906      <p>Provides a weight line. The first element (before the ";")
907      is a hex codepoint sequence. The second field is a sequence
908      of collation elements. Each collation element has 3 parts
909      separated by commas: the primary weight, secondary weight,
910      and tertiary weight. The tertiary weight actually consists of
911      two components: the top two bits (0xC0) are used for the
912      <em>case level</em>, and should be masked off where a case
913      level is not used.</p>
914      <p>A weight is either empty (meaning a zero or ignorable
915      weight) or is a sequence of one or more bytes. The bytes are
916      interpreted as a "fraction", meaning that the ordering is 04
917      &lt; 05 05 &lt; 06. The weights are constructed so that no
918      weight is an initial subsequence of another: that is, having
919      both the weights 05 and 05 05 is illegal. The above line
920      consists of all ignorable weights.</p>
921      <p>The vertical bar (“|”) character is used to indicate
922      context, as in:</p>
923    </blockquote>
924    <pre>006C | 00B7; [, DB A9, 05]</pre>
925    <blockquote>
926      This example indicates that if U+00B7 appears immediately
927      after U+006C, it is given the corresponding collation element
928      instead. This syntax is roughly equivalent to the following
929      contraction, but is more efficient. For details see the
930      specification of <i><a href=
931      "#Context_Sensitive_Mappings">Context-Sensitive
932      Mappings</a></i> above.
933    </blockquote>
934    <pre>006C 00B7; <em>CE(006C)</em> [, DB A9, 05]</pre>
935    <blockquote>
936      <p>Single-byte primary weights are given to particularly
937      frequent characters, such as space, digits, and a-z. More
938      frequent characters are given two-byte weights, while
939      relatively infrequent characters are given three-byte
940      weights. For example:</p>
941    </blockquote>
942    <pre>...
9430009; [03 05, 05, 05] # Zyyy Cc       [0100.0020.0002]        * &lt;CHARACTER TABULATION&gt;
944...
9451B60; [06 14 0C, 05, 05]    # Bali Po       [0111.0020.0002]        * BALINESE PAMENENG
946...
9470031; [14, 05, 05]    # Zyyy Nd       [149B.0020.0002]        * DIGIT ONE</pre>
948    <blockquote>
949      <p>The assignment of 2 vs 3 bytes does not reflect
950      importance, or exact frequency.</p>
951    </blockquote>
952    <pre>
9533041; [76 06, 05, 03]   # Hira Lo       [3888.0020.000D]        * HIRAGANA LETTER SMALL A
9543042; [76 06, 05, 85]   # Hira Lo       [3888.0020.000E]        * HIRAGANA LETTER A
95530A1; [76 06, 05, 10]   # Kana Lo       [3888.0020.000F]        * KATAKANA LETTER SMALL A
95630A2; [76 06, 05, 9E]   # Kana Lo       [3888.0020.0011]        * KATAKANA LETTER A</pre>
957    <blockquote>
958      <p>Beginning with CLDR 27, some primary or secondary
959      collation elements may have below-common tertiary weights
960      (e.g., <code>03</code> ), in particular to allow normal
961      Hiragana letters to have common tertiary weights.</p>
962    </blockquote>
963    <pre># SPECIAL MAX/MIN COLLATION ELEMENTS
964FFFE; [02, 05, 05]     # Special LOWEST primary, for merge/interleaving
965FFFF; [EF FE, 05, 05]  # Special HIGHEST primary, for ranges</pre>
966    <blockquote>
967      <p>The two tailored noncharacters have their own primary
968      weights.</p>
969    </blockquote>
970    <pre>
971F967; [U+4E0D]  # Hani Lo       [FB40.0020.0002][CE0D.0000.0000]        * CJK COMPATIBILITY IDEOGRAPH-F967
9722F02; [U+4E36, 10]      # Hani So       [FB40.0020.0004][CE36.0000.0000]        * KANGXI RADICAL DOT
9732E80; [U+4E36, 70, 20]  # Hani So       [FB40.0020.0004][CE36.0000.0000][0000.00FC.0004]        * CJK RADICAL REPEAT</pre>
974    <blockquote>
975      <p>Some collation elements are specified by reference to
976      other mappings. This is particularly useful for Han
977      characters which are given implicit/constructed primary
978      weights; the reference to a Unified_Ideograph makes these
979      mappings independent of implementation details. This
980      technique may also be used in other mappings to show the
981      relationship of character variants.</p>
982      <p>The referenced character must have a mapping listed
983      earlier in the file, or the mapping must have been defined
984      via the [Unified_Ideograph] data line. The referenced
985      character must map to exactly one collation element.</p>
986      <p><code>[U+4E0D]</code> copies U+4E0D’s entire collation
987      element. <code>[U+4E36, 10]</code> copies U+4E36’s primary
988      and secondary weights and specifies a different tertiary
989      weight. <code>[U+4E36, 70, 20]</code> only copies U+4E36’s
990      primary weight and specifies other secondary and tertiary
991      weights.</p>
992      <p>FractionalUCA.txt does not have any explicit mappings for
993      implicit weights. Therefore, an implementation is free to
994      choose an algorithm for computing implicit weights according
995      to the principles specified in the UCA.</p>
996    </blockquote>
997    <pre>
998FDD1 20AC;      [0D 20 02, 05, 05]      # CURRENCY first primary
999FDD1 0034;      [0E 02 02, 05, 05]      # DIGIT first primary starts new lead byte
1000FDD0 FF21;      [26 02 02, 05, 05]      # REORDER_RESERVED_BEFORE_LATIN first primary starts new lead byte
1001FDD1 004C;      [28 02 02, 05, 05]      # LATIN first primary starts new lead byte
1002FDD0 FF3A;      [5D 02 02, 05, 05]      # REORDER_RESERVED_AFTER_LATIN first primary starts new lead byte
1003FDD1 03A9;      [5F 04 02, 05, 05]      # GREEK first primary starts new lead byte (compressible)
1004FDD1 03E2;      [5F 60 02, 05, 05]      # COPTIC first primary (compressible)</pre>
1005    <blockquote>
1006      <p>These are special mappings with primaries at the
1007      boundaries of scripts and reordering groups. They serve as
1008      tailoring boundaries, so that tailoring near the first or
1009      last character of a script or group places the tailored item
1010      into the same group. Beginning with CLDR 24, each of these is
1011      a contraction of U+FDD1 with a character of the corresponding
1012      script (or of the General_Category [Z, P, S, Sc, Nd]
1013      corresponding to a special reordering group), mapping to the
1014      first possible primary weight per script or group. They can
1015      be enumerated for implementations of <a href=
1016      "#Collation_Indexes">Collation Indexes</a>. (Earlier versions
1017      mapped contractions with U+FDD0 to the last primary weights
1018      of each group but not each script.)</p>
1019      <p>Beginning with CLDR 27, these mappings alone define the
1020      boundaries for reordering single scripts. (There are no
1021      mappings for Hrkt, Hans, or Hant because they are not fully
1022      distinct scripts; they share primary weights with other
1023      scripts: Hrkt=Hira=Kana &amp; Hans=Hant=Hani.) There are some
1024      reserved ranges, beginning at boundaries marked with U+FDD0
1025      plus following characters as shown above. The reserved ranges
1026      are not used for collation elements and are not available for
1027      tailoring.</p>
1028      <p>Some primary lead bytes must be reserved so that
1029      reordering of scripts along partial-lead-byte boundaries can
1030      “split” the primary lead byte and use up a reserved byte.
1031      This is for implementations that write sort keys, which must
1032      reorder primary weights by offsetting them by whole lead
1033      bytes. There are reorder-reserved ranges before and after
1034      Latin, so that reordering scripts with few primary lead bytes
1035      relative to Latin can move those scripts into the reserved
1036      ranges without changing the primary weights of any other
1037      script. Each of these boundaries begins with a new two-byte
1038      primary; that is, no two groups/scripts/ranges share the top
1039      16 bits of their primary weights.</p>
1040    </blockquote>
1041    <pre>
1042FDD0 0034;      [11, 05, 05]    # lead byte for numeric sorting</pre>
1043    <blockquote>
1044      <p>This mapping specifies the lead byte for numeric sorting.
1045      It must be different from the lead byte of any other primary
1046      weight, otherwise numeric sorting would generate ill-formed
1047      collation elements. Therefore, this mapping itself must be
1048      excluded from the set of regular mappings. This value can be
1049      ignored by implementations that do not support numeric
1050      sorting. (Other contractions with U+FDD0 can normally be
1051      ignored altogether.)</p>
1052    </blockquote>
1053    <pre>
1054# HOMELESS COLLATION ELEMENTS
1055FDD0 0063; [, 97, 3D]       # [15E4.0020.0004] [1844.0020.0004] [0000.0041.001F]    * U+01C6 LATIN SMALL LETTER DZ WITH CARON
1056FDD0 0064; [, A7, 09]       # [15D1.0020.0004] [0000.0056.0004]     * U+1DD7 COMBINING LATIN SMALL LETTER C CEDILLA
1057FDD0 0065; [, B1, 09]       # [1644.0020.0004] [0000.0061.0004]     * U+A7A1 LATIN SMALL LETTER G WITH OBLIQUE STROKE</pre>
1058    <blockquote>
1059      <p>The DUCET has some weights that don't correspond directly
1060      to a character. To allow for implementations to have a
1061      mapping for each collation element (necessary for certain
1062      implementations of tailoring), this requires the construction
1063      of special sequences for those weights. These collation
1064      elements can normally be ignored.</p>
1065    </blockquote>
1066    <p>Next, a number of tables are defined. The function of each
1067    of the tables is summarized afterwards.</p>
1068    <pre># VALUES BASED ON UCA
1069...
1070[first regular [0D 0A, 05, 05]] # U+0060 GRAVE ACCENT
1071[last regular [7A FE, 05, 05]] # U+1342E EGYPTIAN HIEROGLYPH AA032
1072[first implicit [E0 04 06, 05, 05]] # CONSTRUCTED
1073[last implicit [E4 DF 7E 20, 05, 05]] # CONSTRUCTED
1074[first trailing [E5, 05, 05]] # CONSTRUCTED
1075[last trailing [E5, 05, 05]] # CONSTRUCTED
1076...</pre>
1077    <blockquote>
1078      <p>This table summarizes ranges of important groups of
1079      characters for implementations.</p>
1080    </blockquote>
1081    <pre># Top Byte =&gt; Reordering Tokens
1082[top_byte     00      TERMINATOR ]    #       [0]     TERMINATOR=1
1083[top_byte     01      LEVEL-SEPARATOR ]       #       [0]     LEVEL-SEPARATOR=1
1084[top_byte     02      FIELD-SEPARATOR ]       #       [0]     FIELD-SEPARATOR=1
1085[top_byte     03      SPACE ] #       [9]     SPACE=1 Cc=6 Zl=1 Zp=1 Zs=1
1086...</pre>
1087    <blockquote>
1088      <p>This table defines the reordering groups, for script
1089      reordering. The table maps from the first bytes of the
1090      fractional weights to a reordering token. The format is
1091      "[top_byte " byte-value reordering-token "COMPRESS"? "]". The
1092      "COMPRESS" value is present when there is only one byte in
1093      the reordering token, and primary-weight compression can be
1094      applied. Most reordering tokens are script values; others are
1095      special-purpose values, such as PUNCTUATION. Beginning with
1096      CLDR 24, this table precedes the regular mappings, so that
1097      parsers can use this information while processing and
1098      optimizing mappings. Beginning with CLDR 27, most of this
1099      data is irrelevant because single scripts can be reordered.
1100      Only the "COMPRESS" data is still useful.</p>
1101    </blockquote>
1102    <pre># Reordering Tokens =&gt; Top Bytes
1103[reorderingTokens     Arab    61=910 62=910 ]
1104[reorderingTokens     Armi    7A=22 ]
1105[reorderingTokens     Armn    5F=82 ]
1106[reorderingTokens     Avst    7A=54 ]
1107...</pre>
1108    <blockquote>
1109      <p>This table is an inverse mapping from reordering token to
1110      top byte(s). In terms like "61=910", the first value is the
1111      top byte, while the second is informational, indicating the
1112      number of primaries assigned with that top byte.</p>
1113    </blockquote>
1114    <pre># General Categories =&gt; Top Byte
1115[categories   Cc      03{SPACE}=6 ]
1116[categories   Cf      77{Khmr Tale Talu Lana Cham Bali Java Mong Olck Cher Cans Ogam Runr Orkh Vaii Bamu}=2 ]
1117[categories   Lm      0D{SYMBOL}=25 0E{SYMBOL}=22 27{Latn}=12 28{Latn}=12 29{Latn}=12 2A{Latn}=12...</pre>
1118    <blockquote>
1119      <p>This table is informational, providing the top bytes,
1120      scripts, and primaries associated with each general category
1121      value.</p>
1122    </blockquote>
1123    <pre># FIXED VALUES
1124[fixed first implicit byte E0]
1125[fixed last implicit byte E4]
1126[fixed first trail byte E5]
1127[fixed last trail byte EF]
1128[fixed first special byte F0]
1129[fixed last special byte FF]
1130
1131[fixed secondary common byte 05]
1132[fixed last secondary common byte 45]
1133[fixed first ignorable secondary byte 80]
1134
1135[fixed tertiary common byte 05]
1136[fixed first ignorable tertiary byte 3C]
1137                </pre>
1138    <blockquote>
1139      <p>The final table gives certain hard-coded byte values. The
1140      "trail" area is provided for implementation of the "trailing
1141      weights" as described in the UCA.</p>
1142    </blockquote>
1143    <p class="note">Note: The particular primary lead bytes for
1144    Hani vs. IMPLICIT vs. TRAILING are only an example. An
1145    implementation is free to move them if it also moves the
1146    explicit TRAILING weights. This affects only a small number of
1147    explicit mappings in FractionalUCA.txt, such as for U+FFFD,
1148    U+FFFF, and the “unassigned first primary”. It is possible to
1149    use no SPECIAL bytes at all, and to use only the one primary
1150    lead byte FF for TRAILING weights.</p>
1151    <h4>2.6.3 <a name="File_Format_UCA_Rules_txt" href=
1152    "#File_Format_UCA_Rules_txt" id=
1153    "File_Format_UCA_Rules_txt">UCA_Rules.txt</a></h4>
1154    <p>The format for this file uses the CLDR collation syntax, see
1155    <i>Section 3, <a href="#Collation_Tailorings">Collation
1156    Tailorings</a></i> .</p>
1157    <h2>3 <a name="Collation_Tailorings" href=
1158    "#Collation_Tailorings" id="Collation_Tailorings">Collation
1159    Tailorings</a></h2>
1160    <p class="dtd">&lt;!ELEMENT collations (alias |
1161    (defaultCollation?, collation*, special*)) &gt;</p>
1162    <p class="dtd">&lt;!ELEMENT defaultCollation ( #PCDATA )
1163    &gt;</p>
1164    <p>This element of the LDML format contains one or more
1165    <span class="element">collation</span> elements, distinguished
1166    by type. Each <span class="element">collation</span> contains
1167    elements with parametric settings, or rules that specify a
1168    certain sort order, as a tailoring of the root order, or
1169    both.</p>
1170    <p class="note">Note: CLDR collation tailoring data should
1171    follow the <a href=
1172    "http://cldr.unicode.org/index/cldr-spec/collation-guidelines">CLDR
1173    Collation Guidelines</a>.</p>
1174    <h3>3.1 <a name="Collation_Types" href="#Collation_Types" id=
1175    "Collation_Types">Collation Types</a></h3>
1176    <p>Each locale may have multiple sort orders (types). The
1177    <span class="element">defaultCollation</span> element defines
1178    the default tailoring for a locale and its sublocales. For
1179    example:</p>
1180    <ul>
1181      <li>root.xml:
1182      <code>&lt;defaultCollation&gt;standard&lt;/defaultCollation&gt;</code></li>
1183      <li>zh.xml:
1184      <code>&lt;defaultCollation&gt;pinyin&lt;/defaultCollation&gt;</code></li>
1185      <li>zh_Hant.xml:
1186      <code>&lt;defaultCollation&gt;stroke&lt;/defaultCollation&gt;</code></li>
1187    </ul>
1188    <p>To allow implementations in reduced memory environments to
1189    use CJK sorting, there are also short forms of each of these
1190    collation sequences. These provide for the most common
1191    characters in common use, and are marked with <span class=
1192    "attribute">alt</span>="<span class=
1193    "attributeValue">short</span>".</p>
1194    <p>A collation type name that starts with "private-", for
1195    example, "private-kana", indicates an incomplete tailoring that
1196    is only intended for import into one or more other tailorings
1197    (usually for sharing common rules). It does not establish a
1198    complete sort order. An implementation should not build data
1199    tables for a private collation type, and should not include a
1200    private collation type in a list of available types.</p>
1201    <p class="note"><b>Note:</b></p>
1202    <ul>
1203      <li>There is an on-line demonstration of collation at
1204      [<a href="tr35.html#LocaleExplorer">LocaleExplorer</a>] that
1205      uses the same rule syntax. (Pick the locale and scroll to
1206      "Collation Rules", near the end.)</li>
1207      <li class="note">In CLDR 23 and before, LDML collation files
1208      used an XML format. Starting with CLDR 24, the XML collation
1209      syntax is deprecated and no longer used. See the <i><a href=
1210      "https://www.unicode.org/reports/tr35/tr35-31/tr35-collation.html#Collation_Tailorings">
1211      CLDR 23 version of this document</a></i> for details about
1212      the XML collation syntax.</li>
1213    </ul>
1214    <h4>3.1.1 <a name="Collation_Type_Fallback" href=
1215    "#Collation_Type_Fallback" id=
1216    "Collation_Type_Fallback">Collation Type Fallback</a></h4>
1217    <p>When loading a requested tailoring from its data file and
1218    the parent file chain, use the following type fallback to find
1219    the tailoring.</p>
1220    <ol>
1221      <li>Determine the default type from the
1222      &lt;defaultCollation&gt; element; map the default type to its
1223      alias if one is defined. If there is no
1224      &lt;defaultCollation&gt; element, then use "standard" as the
1225      default type.</li>
1226      <li>If the request language tag specifies the collation type
1227      (keyword "co"), then map it to its alias if one is defined
1228      (e.g., "-co-phonebk" → "phonebook"). If the language tag does
1229      not specify the type, then use the default type.</li>
1230      <li>Use the &lt;collation&gt; element with this type.</li>
1231      <li>If it does not exist, and the type starts with "search"
1232      but is longer, then set the type to "search" and use that
1233      &lt;collation&gt; element. (For example, "searchjl" →
1234      "search".)</li>
1235      <li>If it does not exist, and the type is not the default
1236      type, then set the type to the default type and use that
1237      &lt;collation&gt; element.</li>
1238      <li>If it does not exist, and the type is not "standard",
1239      then set the type to "standard" and use that
1240      &lt;collation&gt; element.</li>
1241      <li>If it does not exist, then use the CLDR root
1242      collation.</li>
1243    </ol>
1244    <p class="note">Note that the CLDR collation/root.xml contains
1245    &lt;defaultCollation&gt;standard&lt;/defaultCollation&gt;,
1246    &lt;collation type="standard"&gt; (with an empty tailoring, so
1247    this is the same as the CLDR root collation), and &lt;collation
1248    type="search"&gt;.</p>
1249    <p>For example, assume that we have collation data for the
1250    following tailorings. ("da/search" is shorthand for
1251    "da-u-co-search".)</p>
1252    <ul>
1253      <li>root/defaultCollation=standard</li>
1254      <li>root/standard (this is the same as “the CLDR root
1255      collator”)</li>
1256      <li>root/search</li>
1257      <li>da/standard</li>
1258      <li>da/search</li>
1259      <li>el/standard</li>
1260      <li>ko/standard</li>
1261      <li>ko/search</li>
1262      <li>ko/searchjl</li>
1263      <li>zh/defaultCollation=pinyin</li>
1264      <li>zh/pinyin</li>
1265      <li>zh/stroke</li>
1266      <li>zh-Hant/defaultCollation=stroke</li>
1267    </ul>
1268    <table>
1269      <caption>
1270        <a name=
1271        "Sample_requested_and_actual_collation_locales_and_types"
1272        href=
1273        "#Sample_requested_and_actual_collation_locales_and_types"
1274        id=
1275        "Sample_requested_and_actual_collation_locales_and_types">Sample
1276        requested and actual collation locales and types</a>
1277      </caption>
1278      <tr>
1279        <th>requested</th>
1280        <th>actual</th>
1281        <th>comment</th>
1282      </tr>
1283      <tr>
1284        <td>da/phonebook</td>
1285        <td>da/standard</td>
1286        <td>default type for Danish</td>
1287      </tr>
1288      <tr>
1289        <td>zh</td>
1290        <td>zh/pinyin</td>
1291        <td>default type for zh</td>
1292      </tr>
1293      <tr>
1294        <td>zh/standard</td>
1295        <td>root/standard</td>
1296        <td>no "standard" tailoring for zh, falls back to root</td>
1297      </tr>
1298      <tr>
1299        <td>zh/phonebook</td>
1300        <td>zh/pinyin</td>
1301        <td>default type for zh</td>
1302      </tr>
1303      <tr>
1304        <td>zh-Hant/phonebook</td>
1305        <td>zh/stroke</td>
1306        <td>default type for zh-Hant is "stroke"</td>
1307      </tr>
1308      <tr>
1309        <td>da/searchjl</td>
1310        <td>da/search</td>
1311        <td>"search.+" falls back to "search"</td>
1312      </tr>
1313      <tr>
1314        <td>el/search</td>
1315        <td>root/search</td>
1316        <td>no "search" tailoring for Greek</td>
1317      </tr>
1318      <tr>
1319        <td>el/searchjl</td>
1320        <td>root/search</td>
1321        <td>"search.+" falls back to "search", found in root</td>
1322      </tr>
1323      <tr>
1324        <td>ko/searchjl</td>
1325        <td>ko/searchjl</td>
1326        <td>requested data is actually available</td>
1327      </tr>
1328    </table>
1329    <h3>3.2 <a name="Collation_Version" href="#Collation_Version"
1330    id="Collation_Version">Version</a></h3>
1331    <p>The version attribute is used in case a specific version of
1332    the UCA is to be specified. It is optional, and is specified if
1333    the results are to be identical on different systems. If it is
1334    not supplied, then the version is assumed to be the same as the
1335    Unicode version for the system as a whole.</p>
1336    <blockquote>
1337      <p class="note"><b>Note:</b> For version 3.1.1 of the UCA,
1338      the version of Unicode must also be specified with any
1339      versioning information; an example would be "3.1.1/3.2" for
1340      version 3.1.1 of the UCA, for version 3.2 of Unicode. This
1341      was changed by decision of the UTC, so that dual versions
1342      were no longer necessary. So for UCA 4.0 and beyond, the
1343      version just has a single number.</p>
1344    </blockquote>
1345    <h3>3.3 <a name="Collation_Element" href="#Collation_Element"
1346    id="Collation_Element">Collation Element</a></h3>
1347    <p class="dtd">&lt;!ELEMENT collation (alias | (cr*, special*))
1348    &gt;</p>
1349    <p>The tailoring syntax is designed to be independent of the
1350    actual weights used in any particular UCA table. That way the
1351    same rules can be applied to UCA versions over time, even if
1352    the underlying weights change. The following illustrates the
1353    overall structure of a <span class=
1354    "element">collation</span>:</p>
1355    <pre>&lt;collation type="phonebook"&gt;
1356  &lt;cr&gt;&lt;![CDATA[
1357    [caseLevel on]
1358    &amp;c &lt; k
1359  ]]&gt;&lt;/cr&gt;
1360&lt;/collation&gt;</pre>
1361    <h3>3.4 <a name="Setting_Options" href="#Setting_Options" id=
1362    "Setting_Options">Setting Options</a></h3>
1363    <p>Parametric settings can be specified in language tags or in
1364    rule syntax (in the form <code>[keyword value]</code> ). For
1365    example, <code>-ks-level2</code> or <code>[strength 2]</code>
1366    will only compare strings based on their primary and secondary
1367    weights.</p>
1368    <p>If a setting is not present, the CLDR default (or the
1369    default for the locale, if there is one) is used. That default
1370    is listed in bold italics. Where there is a UCA default that is
1371    different, it is listed in bold with (<strong>UCA
1372    default</strong>). Note that the default value for a locale may
1373    be different than the normal default value for the setting.</p>
1374    <table>
1375      <caption>
1376        <a name="Collation_Settings" href="#Collation_Settings" id=
1377        "Collation_Settings">Collation Settings</a>
1378      </caption>
1379      <tr>
1380        <th>BCP47 Key</th>
1381        <th>BCP47 Value</th>
1382        <th>Rule Syntax</th>
1383        <th>Description</th>
1384      </tr>
1385      <tr>
1386        <td rowspan="5">ks</td>
1387        <td>level1</td>
1388        <td><code>[strength 1]</code><br>
1389        (primary)</td>
1390        <td rowspan="5">Sets the default strength for comparison,
1391        as described in the [<a href=
1392        "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].
1393        <em>Note that a strength setting of greater than 4 may have
1394        the same effect as <strong>identical</strong>, depending on
1395        the locale and implementation.</em></td>
1396      </tr>
1397      <tr>
1398        <td>level2</td>
1399        <td><code>[strength 2]</code><br>
1400        (secondary)</td>
1401      </tr>
1402      <tr>
1403        <td>level3</td>
1404        <td><em><strong><code>[strength 3]</code><br>
1405        (tertiary)</strong></em></td>
1406      </tr>
1407      <tr>
1408        <td>level4</td>
1409        <td><code>[strength 4]</code><br>
1410        (quaternary)</td>
1411      </tr>
1412      <tr>
1413        <td>identic</td>
1414        <td><code>[strength I]</code><br>
1415        (identical)</td>
1416      </tr>
1417      <tr>
1418        <td rowspan="3">ka</td>
1419        <td>noignore</td>
1420        <td><i><strong><code>[alternate
1421        non-ignorable]</code></strong></i><br></td>
1422        <td rowspan="3">Sets alternate handling for variable
1423        weights, as described in [<a href=
1424        "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>],
1425        where "shifted" causes certain characters to be ignored in
1426        comparison. <em>The default for LDML is different than it
1427        is in the UCA. In LDML, the default for alternate handling
1428        is <strong>non-ignorable</strong>, while in UCA it is
1429        <strong>shifted</strong>. In addition, in LDML only
1430        whitespace and punctuation are variable by
1431        default.</em></td>
1432      </tr>
1433      <tr>
1434        <td>shifted</td>
1435        <td><strong><code>[alternate shifted]</code><br>
1436        (UCA default)</strong></td>
1437      </tr>
1438      <tr>
1439        <td><em>n/a</em></td>
1440        <td><i>n/a</i><br>
1441        (blanked)</td>
1442      </tr>
1443      <tr>
1444        <td rowspan="2">kb</td>
1445        <td>true</td>
1446        <td><code>[backwards 2]</code></td>
1447        <td rowspan="2">Sets the comparison for the second level to
1448        be <strong>backwards</strong>, as described in [<a href=
1449        "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td>
1450      </tr>
1451      <tr>
1452        <td>false</td>
1453        <td><i><strong>n/a</strong></i></td>
1454      </tr>
1455      <tr>
1456        <td rowspan="2">kk</td>
1457        <td>true</td>
1458        <td><strong><code>[normalization on]</code><br>
1459        (UCA default)</strong></td>
1460        <td rowspan="2">If <strong>on</strong>, then the normal
1461        [<a href=
1462        "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
1463        algorithm is used. If <strong>off</strong>, then most
1464        strings should still sort correctly despite not normalizing
1465        to NFD first.<br>
1466        <em>Note that the default for CLDR locales may be different
1467        than in the UCA. The rules for particular locales have it
1468        set to <strong>on</strong>: those locales whose exemplar
1469        characters (in forms commonly interchanged) would be
1470        affected by normalization.</em></td>
1471      </tr>
1472      <tr>
1473        <td>false</td>
1474        <td><i><strong><code>[normalization
1475        off]</code></strong></i></td>
1476      </tr>
1477      <tr>
1478        <td rowspan="2">kc</td>
1479        <td>true</td>
1480        <td><code>[caseLevel on]</code></td>
1481        <td rowspan="2">If set to <strong>on</strong><i>,</i> a
1482        level consisting only of case characteristics will be
1483        inserted in front of tertiary level, as a "Level 2.5". To
1484        ignore accents but take case into account, set strength to
1485        <strong>primary</strong> and case level to
1486        <strong>on</strong>. For details, see <em>Section 3.14,
1487        <a href="#Case_Parameters">Case Parameters</a></em> .</td>
1488      </tr>
1489      <tr>
1490        <td>false</td>
1491        <td><i><strong><code>[caseLevel
1492        off]</code></strong></i></td>
1493      </tr>
1494      <tr>
1495        <td rowspan="3">kf</td>
1496        <td>upper</td>
1497        <td><code>[caseFirst upper]</code></td>
1498        <td rowspan="3">If set to <strong>upper</strong>, causes
1499        upper case to sort before lower case. If set to
1500        <strong>lower</strong>, causes lower case to sort before
1501        upper case. Useful for locales that have already supported
1502        ordering but require different order of cases. Affects case
1503        and tertiary levels. For details, see <em>Section 3.14,
1504        <a href="#Case_Parameters">Case Parameters</a></em> .</td>
1505      </tr>
1506      <tr>
1507        <td>lower</td>
1508        <td><code>[caseFirst lower]</code></td>
1509      </tr>
1510      <tr>
1511        <td>false</td>
1512        <td><i><strong><code>[caseFirst
1513        off]</code></strong></i></td>
1514      </tr>
1515      <tr>
1516        <td rowspan="2">kh</td>
1517        <td>true<br>
1518        <i><strong>Deprecated:</strong></i> Use rules with
1519        quater­nary relations instead.</td>
1520        <td><code>[hiraganaQ on]</code></td>
1521        <td rowspan="2">Controls special treatment of Hiragana code
1522        points on quaternary level. If turned <strong>on</strong>,
1523        Hiragana codepoints will get lower values than all the
1524        other non-variable code points in <strong>shifted</strong>.
1525        That is, the normal Level 4 value for a regular collation
1526        element is FFFF, as described in [<a href=
1527        "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>],
1528        <em>Section 3.6, <a href=
1529        "https://www.unicode.org/reports/tr10/#Variable_Weighting">Variable
1530        Weighting</a></em> . This is changed to FFFE for
1531        [:script=Hiragana:] characters. The strength must be
1532        greater or equal than quaternary if this attribute is to
1533        have any effect.</td>
1534      </tr>
1535      <tr>
1536        <td>false</td>
1537        <td><i><strong><code>[hiraganaQ
1538        off]</code></strong></i></td>
1539      </tr>
1540      <tr>
1541        <td rowspan="2">kn</td>
1542        <td>true</td>
1543        <td><code>[numericOrdering on]</code></td>
1544        <td rowspan="2">If set to <strong>on</strong>, any sequence
1545        of Decimal Digits (General_Category = Nd in the [<a href=
1546        "https://www.unicode.org/reports/tr41/#UAX44">UAX44</a>]) is
1547        sorted at a primary level with its numeric value. For
1548        example, "A-21" &lt; "A-123". The computed primary weights
1549        are all at the start of the <strong>digit</strong>
1550        reordering group. Thus with an untailored UCA table, "a$"
1551        &lt; "a0" &lt; "a2" &lt; "a12" &lt; "a⓪" &lt; "aa".</td>
1552      </tr>
1553      <tr>
1554        <td>false</td>
1555        <td><i><strong><code>[numericOrdering
1556        off]</code></strong></i></td>
1557      </tr>
1558      <tr>
1559        <td>kr</td>
1560        <td>a sequence of one or more reorder codes: <strong>space,
1561        punct, symbol, currency, digit</strong>, or any BCP47
1562        script ID</td>
1563        <td><code>[reorder Grek digit]</code></td>
1564        <td>Specifies a reordering of scripts or other significant
1565        blocks of characters such as symbols, punctuation, and
1566        digits. For the precise meaning and usage of the reorder
1567        codes, see <em>Section 3.13, <a href=
1568        "#Script_Reordering">Collation Reordering</a>.</em></td>
1569      </tr>
1570      <tr>
1571        <td rowspan="4">kv</td>
1572        <td>space</td>
1573        <td><code>[maxVariable space]</code></td>
1574        <td rowspan="4">Sets the variable top to the top of the
1575        specified reordering group. All code points with primary
1576        weights less than or equal to the variable top will be
1577        considered variable, and thus affected by the alternate
1578        handling. Variables are ignorable by default in [<a href=
1579        "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>], but
1580        not in CLDR.</td>
1581      </tr>
1582      <tr>
1583        <td>punct</td>
1584        <td><i><strong><code>[maxVariable
1585        punct]</code></strong></i></td>
1586      </tr>
1587      <tr>
1588        <td>symbol</td>
1589        <td><strong><code>[maxVariable symbol]</code><br>
1590        (UCA default)</strong></td>
1591      </tr>
1592      <tr>
1593        <td>currency</td>
1594        <td><code>[maxVariable currency]</code></td>
1595      </tr>
1596      <tr>
1597        <td>vt</td>
1598        <td>See <i>Part 1 Section 3.6.4, <a href=
1599        "tr35.html#Unicode_Locale_Extension_Data_Files">U Extension
1600        Data Files</a></i>.<br>
1601        <i><strong>Deprecated:</strong></i> Use maxVariable
1602        instead.</td>
1603        <td><code>&amp;\u00XX\uYYYY &lt; [variable top]</code><br>
1604        <br>
1605        (the default is set to the highest punctuation, thus
1606        including spaces and punctuation, but not symbols)</td>
1607        <td>
1608          <p>The BCP47 value is described in <i>Appendix Q:
1609          <a href="tr35.html#Locale_Extension_Key_and_Type_Data">Locale
1610          Extension Keys and Types</a>.</i></p>
1611          <p>Sets the string value for the variable top. All the
1612          code points with primary weights less than or equal to
1613          the variable top will be considered variable, and thus
1614          affected by the alternate handling.<br>
1615          An implementation that supports the variableTop setting
1616          should also support the maxVariable setting, and it
1617          should "pin" ("round up") the variableTop to the top of
1618          the containing reordering group.<br>
1619          Variables are ignorable by default in [<a href=
1620          "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>],
1621          but not in CLDR. See below for more information.</p>
1622        </td>
1623      </tr>
1624      <tr>
1625        <td><em>n/a</em></td>
1626        <td><em>n/a</em></td>
1627        <td><em>n/a</em></td>
1628        <td>match-boundaries: <em><strong>none</strong></em> |
1629        whole-character | whole-word<br>
1630        Defined by <em>Section 8, <a href=
1631        "https://www.unicode.org/reports/tr10/#Searching">Searching
1632        and Matching</a></em> of [<a href=
1633        "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td>
1634      </tr>
1635      <tr>
1636        <td><em>n/a</em></td>
1637        <td><em>n/a</em></td>
1638        <td><em>n/a</em></td>
1639        <td>match-style: <em><strong>minimal</strong></em> | medial
1640        | maximal<br>
1641        Defined by <em>Section 8, <a href=
1642        "https://www.unicode.org/reports/tr10/#Searching">Searching
1643        and Matching</a></em> of [<a href=
1644        "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>].</td>
1645      </tr>
1646    </table>
1647    <h4>3.4.1 <a name="Common_Settings" href="#Common_Settings" id=
1648    "Common_Settings">Common settings combinations</a></h4>
1649    <p>Some commonly used parametric collation settings are
1650    available via combinations of LDML settings attributes:</p>
1651    <ul>
1652      <li>“Ignore accents”: <strong>strength=primary</strong></li>
1653      <li>“Ignore accents” but take case into account:
1654      <strong>strength=primary caseLevel=on</strong></li>
1655      <li>“Ignore case”: <strong>strength=secondary</strong></li>
1656      <li>“Ignore punctuation” (completely):
1657      <strong>strength=tertiary alternate=shifted</strong></li>
1658      <li>“Ignore punctuation” but distinguish among punctuation
1659      marks: <strong>strength=quaternary
1660      alternate=shifted</strong></li>
1661    </ul>
1662    <h4>3.4.2 <a name="Normalization_Setting" href=
1663    "#Normalization_Setting" id="Normalization_Setting">Notes on
1664    the normalization setting</a></h4>
1665    <p>The UCA always normalizes input strings into NFD form before
1666    the rest of the algorithm. However, this results in poor
1667    performance.</p>
1668    <p>With <strong>normalization=off</strong>, strings that are in
1669    [<a href="tr35.html#FCD">FCD</a>] and do not contain Tibetan
1670    precomposed vowels (U+0F73, U+0F75, U+0F81) should sort
1671    correctly. With <strong>normalization=on</strong>, an
1672    implementation that does not normalize to NFD must at least
1673    perform an incremental FCD check and normalize substrings as
1674    necessary. It should also always decompose the Tibetan
1675    precomposed vowels. (Otherwise discontiguous contractions
1676    across their leading components cannot be handled
1677    correctly.)</p>
1678    <p>Another complication for an implementation that does not
1679    always use NFD arises when contraction mappings overlap with
1680    canonical Decomposition_Mapping strings. For example, the
1681    Danish contraction “aa” overlaps with the decompositions of
1682    ‘ä’, ‘å’, and other characters. In the root collation (and in
1683    the DUCET), Cyrillic ‘ӛ’ maps to a single collation element,
1684    which means that its decomposition “ә+◌̈” forms a contraction,
1685    and its second character (U+0308) is the same as the first
1686    character in the Decomposition_Mapping of U+0344
1687    ‘◌̈́’=“◌̈+◌́”.</p>
1688    <p>In order to handle strings with these characters (e.g., “aä”
1689    and “ӛ́” [which are in FCD]) exactly as with prior NFD
1690    normalization, an implementation needs to either add overlap
1691    contractions to its data (e.g., “a+ä” and “ә+◌̈́”), or it needs
1692    to decompose the relevant composites (e.g., ‘ä’ and ‘◌̈́’) as
1693    soon as they are encountered.</p>
1694    <h4>3.4.3 <a name="Variable_Top_Settings" href=
1695    "#Variable_Top_Settings" id="Variable_Top_Settings">Notes on
1696    variable top settings</a></h4>
1697    <p>Users may want to include more or fewer characters as
1698    Variable. For example, someone could want to restrict the
1699    Variable characters to just include space marks. In that case,
1700    maxVariable would be set to "space". (In CLDR 24 and earlier,
1701    the now-deprecated variableTop would be set to U+1680, see the
1702    “Whitespace” <a href="https://unicode.org/charts/collation/">UCA
1703    collation chart</a>). Alternatively, someone could want more of
1704    the Common characters in them, and include characters up to
1705    (but not including) '0', by setting maxVariable to "currency".
1706    (In CLDR 24 and earlier, the now-deprecated variableTop would
1707    be set to U+20BA, see the “Currency-Symbol” collation
1708    chart).</p>
1709    <p>The effect of these settings is to customize to ignore
1710    different sets of characters when comparing strings. For
1711    example, the locale identifier "de-u-ka-shifted-kv-currency" is
1712    requesting settings appropriate for German, including German
1713    sorting conventions, and that currency symbols and characters
1714    sorting below them are ignored in sorting.</p>
1715    <h3>3.5 <a name="Rules" href="#Rules" id="Rules">Collation Rule
1716    Syntax</a></h3>
1717    <p class="dtd">&lt;!ELEMENT cr #PCDATA &gt;</p>
1718    <p>The goal for the collation rule syntax is to have clearly
1719    expressed rules with a concise format. The CLDR rule syntax is
1720    a subset of the [<a href=
1721    "tr35.html#ICUCollation">ICUCollation</a>] syntax.</p>
1722    <p>For the CLDR root collation, the FractionalUCA.txt file
1723    defines all mappings for all of Unicode directly, and it also
1724    provides information about script boundaries, reordering
1725    groups, and other details. For tailorings, this is neither
1726    necessary nor practical. In particular, while the root
1727    collation sort order rarely changes for existing characters,
1728    their numeric collation weights change with every version. If
1729    tailorings also specified numeric weights directly, then they
1730    would have to change with every version, parallel with the root
1731    collation. Instead, for tailorings, mappings are added and
1732    modified relative to the root collation. (There is no syntax to
1733    <i>remove</i> mappings, except via <a href=
1734    "#Special_Purpose_Commands">special [suppressContractions
1735    [...]]</a> .)</p>
1736    <p>The ASCII [:P:] and [:S:] characters are reserved for
1737    collation syntax: <code>[\u0021-\u002F \u003A-\u0040
1738    \u005B-\u0060 \u007B-\u007E]</code></p>
1739    <p>Unicode Pattern_White_Space characters between tokens are
1740    ignored. Unquoted white space terminates reset and relation
1741    strings.</p>
1742    <p>A pair of ASCII apostrophes encloses quoted literal text.
1743    They are normally used to enclose a syntax character or white
1744    space, or a whole reset/relation string containing one or more
1745    such characters, so that those are parsed as part of the
1746    reset/relation strings rather than treated as syntax. A pair of
1747    immediately adjacent apostrophes is used to encode one
1748    apostrophe.</p>
1749    <p>Code points can be escaped with <code>\uhhhh</code> and
1750    <code>\U00hhhhhh</code> escapes, as well as common escapes like
1751    <code>\t</code> and <code>\n</code> . (For details see the
1752    documentation of ICU UnicodeString::unescape().) This is
1753    particularly useful for default-ignorable code points,
1754    combining marks, visually indistinct variants, hard-to-type
1755    characters, etc. These sequences are unescaped before the rules
1756    are parsed; this means that even escaped syntax and white space
1757    characters need to be enclosed in apostrophes. For example:
1758    <code>&amp;'\u0020'='\u3000'</code>. Note: The unescaping is
1759    done by ICU tools (genrb) and demos before passing rule strings
1760    into the ICU library code. The ICU collation API does not
1761    unescape rule strings.</p>
1762    <p>The ASCII double quote must be both escaped (so that the
1763    collation syntax can be enclosed in pairs of double quotes in
1764    programming environments such as ICU resource bundle .txt
1765    files) and quoted. For example:
1766    <code>&amp;'\u0022'&lt;&lt;&lt;x</code></p>
1767    <p>Comments are allowed at the beginning, and after any
1768    complete reset, relation, setting, or command. A comment begins
1769    with a <code>#</code> and extends to the end of the line
1770    (according to the Unicode Newline Guidelines).</p>
1771    <p>The collation syntax is case-sensitive.</p>
1772    <h3>3.6 <a name="Orderings" href="#Orderings" id=
1773    "Orderings">Orderings</a></h3>
1774    <p>The root collation mappings form the initial state. Mappings
1775    are added and removed via a sequence of rule chains. Each
1776    tailoring rule builds on the current state after all of the
1777    preceding rules (and is not affected by any following rules).
1778    Rule chains may alternate with comments, settings, and special
1779    commands.</p>
1780    <p>A rule chain consists of a reset followed by one or more
1781    relations. The reset position is a string which maps to one or
1782    more collation elements according to the current state. A
1783    relation consists of an operator and a string; it maps the
1784    string to the current collation elements, modified according to
1785    the operator.</p>
1786    <table>
1787      <caption>
1788        <a name="Specifying_Collation_Ordering" href=
1789        "#Specifying_Collation_Ordering" id=
1790        "Specifying_Collation_Ordering">Specifying Collation
1791        Ordering</a>
1792      </caption>
1793      <tr>
1794        <th>Relation Operator</th>
1795        <th>&nbsp;Example</th>
1796        <th>Description</th>
1797      </tr>
1798      <tr>
1799        <td><code>&amp;</code></td>
1800        <td><code>&amp; Z</code></td>
1801        <td>Map Z to collation elements according to the current
1802        state. These will be modified according to the following
1803        relation operators and then assigned to the corresponding
1804        relation strings.</td>
1805      </tr>
1806      <tr>
1807        <td><code>&lt;</code></td>
1808        <td><code>&amp; a<br>
1809        &lt; b</code></td>
1810        <td>Make 'b' sort after 'a', as a <i>primary</i>
1811        (base-character) difference</td>
1812      </tr>
1813      <tr>
1814        <td><code>&lt;&lt;</code></td>
1815        <td><code>&amp; a<br>
1816        &lt;&lt; ä</code></td>
1817        <td>Make 'ä' sort after 'a' as a <i>secondary</i> (accent)
1818        difference</td>
1819      </tr>
1820      <tr>
1821        <td><code>&lt;&lt;&lt;</code></td>
1822        <td><code>&amp; a<br>
1823        &lt;&lt;&lt; A</code></td>
1824        <td>Make 'A' sort after 'a' as a <i>tertiary</i>
1825        (case/variant) difference</td>
1826      </tr>
1827      <tr>
1828        <td><code>&lt;&lt;&lt;&lt;</code></td>
1829        <td><code>&amp; か<br>
1830        &lt;&lt;&lt;&lt; カ</code></td>
1831        <td>Make 'カ' (Katakana Ka) sort after 'か' (Hiragana Ka) as
1832        a <i>quaternary</i> difference</td>
1833      </tr>
1834      <tr>
1835        <td><code>=&nbsp;</code></td>
1836        <td><code>&amp; v<br>
1837        = w&nbsp;</code></td>
1838        <td>Make 'w' sort <i>identically</i> to 'v'</td>
1839      </tr>
1840    </table>
1841    <p>The following shows the result of serially applying three
1842    rules.</p>
1843    <table>
1844      <tr>
1845        <th>&nbsp;</th>
1846        <th>Rules</th>
1847        <th>Result</th>
1848        <th>Comment</th>
1849      </tr>
1850      <tr>
1851        <td>1</td>
1852        <td>&amp; a &lt; g</td>
1853        <td>... a <font color="red">&lt;<sub>1</sub> g</font>
1854        ...</td>
1855        <td>Put g after a.</td>
1856      </tr>
1857      <tr>
1858        <td>2</td>
1859        <td>&amp; a &lt; h &lt; k</td>
1860        <td>... a <font color="red">&lt;<sub>1</sub> h
1861        &lt;<sub>1</sub> k</font> &lt;<sub>1</sub> g ...</td>
1862        <td>Now put h and k after a (inserting before the g).</td>
1863      </tr>
1864      <tr>
1865        <td>3</td>
1866        <td>&amp; h &lt;&lt; g</td>
1867        <td>... a &lt;<sub>1</sub> h <font color=
1868        "red">&lt;<sub>1</sub> g</font> &lt;<sub>1</sub> k ...</td>
1869        <td>Now put g after h (inserting before k).</td>
1870      </tr>
1871    </table>
1872    <p>Notice that relation strings can occur multiple times, and
1873    thus override previous rules.</p>
1874    <p>Each relation uses and modifies the collation elements of
1875    the immediately preceding reset position or relation. A rule
1876    chain with two or more relations is equivalent to a sequence of
1877    “atomic rules” where each rule chain has exactly one relation,
1878    and each relation is followed by a reset to this same relation
1879    string.</p>
1880    <p><i>Example:</i></p>
1881    <table>
1882      <tr>
1883        <th>Rules</th>
1884        <th>Equivalent Atomic Rules</th>
1885      </tr>
1886      <tr>
1887        <td>&amp; b &lt; q &lt;&lt;&lt; Q<br>
1888        &amp; a &lt; x &lt;&lt;&lt; X &lt;&lt; q &lt;&lt;&lt; Q
1889        &lt; z</td>
1890        <td>&amp; b &lt; q<br>
1891        &amp; q &lt;&lt;&lt; Q<br>
1892        &amp; a &lt; x<br>
1893        &amp; x &lt;&lt;&lt; X<br>
1894        &amp; X &lt;&lt; q<br>
1895        &amp; q &lt;&lt;&lt; Q<br>
1896        &amp; Q &lt; z</td>
1897      </tr>
1898    </table>
1899    <p>This is not always possible because prefix and extension
1900    strings can occur in a relation but not in a reset (see
1901    below).</p>
1902    <p>The relation operator <code>=</code> maps its relation
1903    string to the current collation elements. Any other relation
1904    operator modifies the current collation elements as
1905    follows.</p>
1906    <ul>
1907      <li>Find the <i>last</i> collation element whose strength is
1908      at least as great as the strength of the operator. For
1909      example, for <code>&lt;&lt;</code> find the last primary or
1910      secondary CE. This CE will be modified; all following CEs
1911      should be removed. If there is no such CE, then reset the
1912      collation elements to a single completely-ignorable CE.</li>
1913      <li>Increment the collation element weight corresponding to
1914      the strength of the operator. For example, for
1915      <code>&lt;&lt;</code> increment the secondary weight.</li>
1916      <li>The new weight must be less than the next weight for the
1917      same combination of higher-level weights of any collation
1918      element according to the current state.</li>
1919      <li>Weights must be allocated in accordance with the <a href=
1920      "https://www.unicode.org/reports/tr10/#Well-Formed">UCA
1921      well-formedness conditions</a>.</li>
1922      <li>When incrementing any weight, lower-level weights should
1923      be reset to the “common” values, to help with sort key
1924      compression.</li>
1925    </ul>
1926    <p>In all cases, even for <code>=</code> , the case bits are
1927    recomputed according to <i>Section 3.13, <a href=
1928    "#Case_Parameters">Case Parameters</a></i>. (This can be
1929    skipped if an implementation does not support the caseLevel or
1930    caseFirst settings.)</p>
1931    <p>For example, <code>&amp;ae&lt;x</code> maps ‘x’ to two
1932    collation elements. The first one is the same as for ‘a’, and
1933    the second one has a primary weight between those for ‘e’ and
1934    ‘f’. As a result, ‘x’ sorts between “ae” and “af”. (If the
1935    primary of the first collation element was incremented instead,
1936    then ‘x’ would sort after “az”. While also sorting
1937    primary-after “ae” this would be surprising and
1938    sub-optimal.)</p>
1939    <p>Some additional operators are provided to save space with
1940    large tailorings. The addition of a * to the relation operator
1941    indicates that each of the following single characters are to
1942    be handled as if they were separate relations with the
1943    corresponding strength. Each of the following single characters
1944    must be NFD-inert, that is, it does not have a canonical
1945    decomposition and it does not reorder (ccc=0). This keeps
1946    abbreviated rules unambiguous.</p>
1947    <p>A starred relation operator is followed by a sequence of
1948    characters with the same quoting/escaping rules as normal
1949    relation strings. Such a sequence can also be followed by one
1950    or more pairs of ‘-’ and another sequence of characters. The
1951    single characters adjacent to the ‘-’ establish a code point
1952    order range. The same character cannot be both the end of a
1953    range and the start of another range. (For example,
1954    <code>&lt;a-d-g</code> is not allowed.)</p>
1955    <table>
1956      <caption>
1957        <a name="Abbreviating_Ordering_Specifications" href=
1958        "#Abbreviating_Ordering_Specifications" id=
1959        "Abbreviating_Ordering_Specifications">Abbreviating
1960        Ordering Specifications</a>
1961      </caption>
1962      <tr>
1963        <th>Relation Operator</th>
1964        <th>Example</th>
1965        <th>Equivalent</th>
1966      </tr>
1967      <tr>
1968        <td><code>&lt;*</code></td>
1969        <td><code>&amp; <span style="color: blue">a</span><br>
1970        &lt;* <span style=
1971        "color: blue">bcd-gp-s</span>&nbsp;</code></td>
1972        <td><code>&amp; <span style="color: blue">a</span><br>
1973        &lt; <span style="color: blue">b</span> &lt; <span style=
1974        "color: blue">c</span> &lt; <span style=
1975        "color: blue">d</span> &lt; <span style=
1976        "color: blue">e</span> &lt; <span style=
1977        "color: blue">f</span> &lt; <span style=
1978        "color: blue">g</span> &lt; <span style=
1979        "color: blue">p</span> &lt; <span style=
1980        "color: blue">q</span> &lt; <span style=
1981        "color: blue">r</span> &lt; <span style=
1982        "color: blue">s</span></code></td>
1983      </tr>
1984      <tr>
1985        <td><code>&lt;&lt;*</code></td>
1986        <td><code>&amp; <span style="color: blue">a</span><br>
1987        &lt;&lt;* <span style="color: blue">æᶏɐ</span></code></td>
1988        <td><code>&amp; <span style="color: blue">a</span><br>
1989        &lt;&lt; <span style="color: blue">æ</span> &lt;&lt;
1990        <span style="color: blue">ᶏ</span> &lt;&lt; <span style=
1991        "color: blue">ɐ</span></code></td>
1992      </tr>
1993      <tr>
1994        <td><code>&lt;&lt;&lt;*</code></td>
1995        <td><code>&amp; <span style="color: blue">p</span><br>
1996        &lt;&lt;&lt;* <span style=
1997        "color: blue">PpP</span></code></td>
1998        <td><code>&amp; <span style="color: blue">p</span><br>
1999        &lt;&lt;&lt; <span style="color: blue">P</span>
2000        &lt;&lt;&lt; <span style="color: blue">p</span>
2001        &lt;&lt;&lt; <span style="color: blue">P</span></code></td>
2002      </tr>
2003      <tr>
2004        <td><code>&lt;&lt;&lt;&lt;*</code></td>
2005        <td><code>&amp; <span style="color: blue">k</span><br>
2006        &lt;&lt;&lt;&lt;* <span style=
2007        "color: blue">qQ</span></code></td>
2008        <td><code>&amp; <span style="color: blue">k</span><br>
2009        &lt;&lt;&lt;&lt; <span style="color: blue">q</span>
2010        &lt;&lt;&lt;&lt; <span style=
2011        "color: blue">Q</span></code></td>
2012      </tr>
2013      <tr>
2014        <td><code>=*</code></td>
2015        <td><code>&amp; <span style="color: blue">v</span><br>
2016        =* <span style="color: blue">VwW</span></code></td>
2017        <td><code>&amp; <span style="color: blue">v</span><br>
2018        = <span style="color: blue">V</span> = <span style=
2019        "color: blue">w</span> = <span style=
2020        "color: blue">W</span></code></td>
2021      </tr>
2022    </table>
2023    <h3>3.7 <a name="Contractions" href="#Contractions" id=
2024    "Contractions">Contractions</a></h3>
2025    <p>A multi-character relation string defines a contraction.</p>
2026    <table>
2027      <caption>
2028        <a name="Specifying_Contractions" href=
2029        "#Specifying_Contractions" id=
2030        "Specifying_Contractions">Specifying Contractions</a>
2031      </caption>
2032      <tr>
2033        <th>Example</th>
2034        <th>Description</th>
2035      </tr>
2036      <tr>
2037        <td><code>&amp; k<br>
2038        &lt; ch</code></td>
2039        <td>Make the sequence 'ch' sort after 'k', as a primary
2040        (base-character) difference</td>
2041      </tr>
2042    </table>
2043    <h3>3.8 <a name="Expansions" href="#Expansions" id=
2044    "Expansions">Expansions</a></h3>
2045    <p>A mapping to multiple collation elements defines an
2046    expansion. This is normally the result of a reset position
2047    (and/or preceding relation) that yields multiple collation
2048    elements, for example <code>&amp;ae&lt;x</code> or
2049    <code>&amp;æ&lt;y</code> .</p>
2050    <p>A relation string can also be followed by <code>/</code> and
2051    an <i>extension string</i>. The extension string is mapped to
2052    collation elements according to the current state, and the
2053    relation string is mapped to the concatenation of the regular
2054    CEs and the extension CEs. The extension CEs are not modified,
2055    not even their case bits. The extension CEs are <i>not</i>
2056    retained for following relations.</p>
2057    <p>For example, <code>&amp;a&lt;z/e</code> maps ‘z’ to an
2058    expansion similar to <code>&amp;ae&lt;x</code> . However, the
2059    first CE of ‘z’ is primary-after that of ‘a’, and the second CE
2060    is exactly that of ‘e’, which yields the order ae &lt; x &lt;
2061    af &lt; ag &lt; ... &lt; az &lt; z &lt; b.</p>
2062    <p>The choice of reset-to-expansion vs. use of an extension
2063    string can be exploited to affect contextual mappings. For
2064    example, <code>&amp;L·=x</code> yields a second CE for ‘x’
2065    equal to the context-sensitive middle-dot-after-L (which is a
2066    secondary CE in the root collation). On the other hand,
2067    <code>&amp;L=x/·</code> yields a second CE of the middle dot by
2068    itself (which is a primary CE).</p>
2069    <p>The two ways of specifying expansions also differ in how
2070    case bits are computed. When some of the CEs are copied
2071    verbatim from an extension string, then the relation string’s
2072    case bits are distributed over a smaller number of normal CEs.
2073    For example, <code>&amp;aE=Ch</code> yields an uppercase CE and
2074    a lowercase CE, but <code>&amp;a=Ch/E</code> yields a
2075    mixed-case CE (for ‘C’ and ‘h’ together) followed by an
2076    uppercase CE (copied from ‘E’).</p>
2077    <p>In summary, there are two ways of specifying expansions
2078    which produce subtly different mappings. The use of extension
2079    strings is unusual but sometimes necessary.</p>
2080    <h3>3.9 <a name="Context_Before" href="#Context_Before" id=
2081    "Context_Before">Context Before</a></h3>
2082    <p>A relation string can have a prefix (context before) which
2083    makes the mapping from the relation string to its tailored
2084    position conditional on the string occurring after that prefix.
2085    For details see the specification of <i><a href=
2086    "#Context_Sensitive_Mappings">Context-Sensitive
2087    Mappings</a></i>.</p>
2088    <p>For example, suppose that "-" is sorted like the previous
2089    vowel. Then one could have rules that take "a-", "e-", and so
2090    on. However, that means that every time a very common character
2091    (a, e, ...) is encountered, a system will slow down as it looks
2092    for possible contractions. An alternative is to indicate that
2093    when "-" is encountered, and it comes after an 'a', it sorts
2094    like an 'a', and so on.</p>
2095    <table>
2096      <caption>
2097        <a name="Specifying_Previous_Context" href=
2098        "#Specifying_Previous_Context" id=
2099        "Specifying_Previous_Context">Specifying Previous
2100        Context</a>
2101      </caption>
2102      <tr>
2103        <th>Rules</th>
2104      </tr>
2105      <tr>
2106        <td><code>&amp; a &lt;&lt;&lt; a | '-'<br>
2107        &amp; e &lt;&lt;&lt; e | '-'<br>
2108        ...</code></td>
2109      </tr>
2110    </table>
2111    <p>Both the prefix and extension strings can occur in a
2112    relation. For example, the following are allowed:</p>
2113    <ul>
2114      <li><code>&lt; abc | def / ghi</code></li>
2115      <li><code>&lt; def / ghi</code></li>
2116      <li><code>&lt; abc | def</code></li>
2117    </ul>
2118    <h3>3.10 <a name="Placing_Characters_Before_Others" href=
2119    "#Placing_Characters_Before_Others" id=
2120    "Placing_Characters_Before_Others">Placing Characters Before
2121    Others</a></h3>
2122    <p>There are certain circumstances where characters need to be
2123    placed before a given character, rather than after. This is the
2124    case with Pinyin, for example, where certain accented letters
2125    are positioned before the base letter. That is accomplished
2126    with the following syntax.</p>
2127    <pre>&amp;[before 2] a &lt;&lt; à</pre>
2128    <p>The before-strength can be 1 (primary), 2 (secondary), or 3
2129    (tertiary).</p>
2130    <p>It is an error if the strength of the reset-before differs
2131    from the strength of the immediately following relation. Thus
2132    the following are errors.</p>
2133    <ul>
2134      <li><code>&amp;[before 2] a &lt; à # error</code></li>
2135      <li><code>&amp;[before 2] a &lt;&lt;&lt; à #
2136      error</code></li>
2137    </ul>
2138    <h3>3.11 <a name="Logical_Reset_Positions" href=
2139    "#Logical_Reset_Positions" id="Logical_Reset_Positions">Logical
2140    Reset Positions</a></h3>
2141    <p>The CLDR table (based on UCA) has the following overall
2142    structure for weights, going from low to high.</p>
2143    <table>
2144      <caption>
2145        <a name="Specifying_Logical_Positions" href=
2146        "#Specifying_Logical_Positions" id=
2147        "Specifying_Logical_Positions">Specifying Logical
2148        Positions</a>
2149      </caption>
2150      <tr>
2151        <th>Name</th>
2152        <th>Description</th>
2153        <th>UCA Examples</th>
2154      </tr>
2155      <tr>
2156        <td>first tertiary ignorable<br>
2157        ...<br>
2158        last tertiary ignorable</td>
2159        <td>p, s, t = ignore</td>
2160        <td>Control Codes<br>
2161        Format Characters<br>
2162        Hebrew Points<br>
2163        Tibetan Signs<br>
2164        ...</td>
2165      </tr>
2166      <tr>
2167        <td>first secondary ignorable<br>
2168        ...<br>
2169        last secondary ignorable</td>
2170        <td>p, s = ignore</td>
2171        <td>None in UCA</td>
2172      </tr>
2173      <tr>
2174        <td>first primary ignorable<br>
2175        ...<br>
2176        last primary ignorable</td>
2177        <td>p = ignore</td>
2178        <td>Most combining marks</td>
2179      </tr>
2180      <tr>
2181        <td>first variable<br>
2182        ...<br>
2183        last variable</td>
2184        <td><i><b>if</b> alternate = non-ignorable<br></i> p !=
2185        ignore,<br>
2186        <i><b>if</b> alternate = shifted</i><br>
2187        p, s, t = ignore</td>
2188        <td>Whitespace,<br>
2189        Punctuation</td>
2190      </tr>
2191      <tr>
2192        <td>first regular<br>
2193        ...<br>
2194        last regular</td>
2195        <td>p != ignore</td>
2196        <td>General Symbols<br>
2197        Currency Symbols<br>
2198        Numbers<br>
2199        Latin<br>
2200        Greek<br>
2201        ...</td>
2202      </tr>
2203      <tr>
2204        <td>first implicit<br>
2205        ...<br>
2206        last implicit</td>
2207        <td>p != ignore, assigned automatically</td>
2208        <td>CJK, CJK compatibility (those that are not
2209        decomposed)<br>
2210        CJK Extension A, B, C, ...<br>
2211        Unassigned</td>
2212      </tr>
2213      <tr>
2214        <td>first trailing<br>
2215        ...<br>
2216        last trailing</td>
2217        <td>p != ignore,<br>
2218        used for trailing syllable components</td>
2219        <td>Jamo Trailing<br>
2220        Jamo Leading<br>
2221        U+FFFD<br>
2222        U+FFFF</td>
2223      </tr>
2224    </table>
2225    <p>Each of the above Names can be used with a reset to position
2226    characters relative to that logical position. That allows
2227    characters to be ordered before or after a <i>logical</i>
2228    position rather than a specific character.</p>
2229    <blockquote>
2230      <p class="note"><b>Note:</b> The reason for this is so that
2231      tailorings can be more stable. A future version of the UCA
2232      might add characters at any point in the above list. Suppose
2233      that you set character X to be after Y. It could be that you
2234      want X to come after Y, no matter what future characters are
2235      added; or it could be that you just want Y to come after a
2236      given logical position, for example, after the last primary
2237      ignorable.</p>
2238    </blockquote>
2239    <p>Each of these special reset positions always maps to a
2240    single collation element.</p>
2241    <p>Here is an example of the syntax:</p>
2242    <pre>&amp; [first tertiary ignorable] &lt;&lt; à </pre>
2243    <p>For example, to make a character be a secondary ignorable,
2244    one can make it be immediately after (at a secondary level) a
2245    specific character (like a combining diaeresis), or one can
2246    make it be immediately after the last secondary ignorable.</p>
2247    <p>Each special reset position adjusts to the effects of
2248    preceding rules, just like normal reset position strings. For
2249    example, if a tailoring rule creates a new collation element
2250    after <code>&amp;[last variable]</code> (via explicit tailoring
2251    after that, or via tailoring after the relevant character),
2252    then this new CE becomes the new <i>last variable</i> CE, and
2253    is used in following resets to <code>[last variable]</code>
2254    .</p>
2255    <p>[first variable] and [first regular] and [first trailing]
2256    should be the first real such CEs (e.g., CE(U+0060 `)), as
2257    adjusted according to the tailoring, not the boundary CEs (see
2258    the FractionalUCA.txt “first primary” mappings starting with
2259    U+FDD1).</p>
2260    <p><code>[last regular]</code> is not actually the last normal
2261    CE with a primary weight before implicit primaries. It is used
2262    to tailor large numbers of characters, usually CJK, into the
2263    script=Hani range between the last regular script and the first
2264    implicit CE. (The first group of implicit CEs is for Han
2265    characters.) Therefore, <code>[last regular]</code> is set to
2266    the first Hani CE, the artificial script boundary CE at the
2267    beginning of this range. For example: <code>&amp;[last
2268    regular]&lt;*亜唖娃阿...</code></p>
2269    <p>The [last trailing] is the CE of U+FFFF. Tailoring to that
2270    is not allowed.</p>
2271    <p>The <code>[last variable]</code> indicates the "highest"
2272    character that is treated as punctuation with alternate
2273    handling.</p>
2274    <p>The value can be changed by using the maxVariable setting.
2275    This takes effect, however, after the rules have been built,
2276    and does not affect any characters that are reset relative to
2277    the <code>[last variable]</code> value when the rules are being
2278    built. The maxVariable setting might also be changed via a
2279    runtime parameter. That also does not affect the rules.<br>
2280    (In CLDR 24 and earlier, the variable top could also be set by
2281    using a tailoring rule with <code>[variable top]</code> in the
2282    place of a relation string.)</p>
2283    <h3>3.12 <a name="Special_Purpose_Commands" href=
2284    "#Special_Purpose_Commands" id=
2285    "Special_Purpose_Commands">Special-Purpose Commands</a></h3>
2286    <p>The import command imports rules from another collation.
2287    This allows for better maintenance and smaller rule sizes. The
2288    source is a BCP 47 language tag with an optional collation type
2289    but without other extensions. The collation type is the BCP 47
2290    form of the collation type in the source; it defaults to
2291    "standard".</p>
2292    <p><em>Examples:</em></p>
2293    <ul>
2294      <li><code>[import de-u-co-phonebk]</code> &nbsp; (not
2295      "...-co-phonebook")</li>
2296      <li><code>[import und-u-co-search]</code> &nbsp; (not
2297      "root-...")</li>
2298      <li><code>[import ja-u-co-private-kana]</code> &nbsp;
2299      (language "ja" required even when this import itself is in
2300      another "ja" tailoring.)</li>
2301    </ul>
2302    <table>
2303      <caption>
2304        <a name="Special_Purpose_Elements" href=
2305        "#Special_Purpose_Elements" id=
2306        "Special_Purpose_Elements">Special-Purpose Elements</a>
2307      </caption>
2308      <tr>
2309        <th>Rule Syntax</th>
2310      </tr>
2311      <tr>
2312        <td>[suppressContractions [Љ-ґ]]</td>
2313      </tr>
2314      <tr>
2315        <td>[optimize [Ά-ώ]]</td>
2316      </tr>
2317    </table>
2318    <p>The <i>suppress contractions</i> tailoring command turns off
2319    any existing contractions that begin with those characters, as
2320    well as any prefixes for those characters. It is typically used
2321    to turn off the Cyrillic contractions in the UCA, since they
2322    are not used in many languages and have a considerable
2323    performance penalty. The argument is a <a href=
2324    "tr35.html#Unicode_Sets">Unicode Set</a>.</p>
2325    <p>The <i>suppress contractions</i> command has immediate
2326    effect on the current set of mappings, including mappings added
2327    by preceding rules. Following rules are processed after
2328    removing any context-sensitive mappings originating from any of
2329    the characters in the set.</p>
2330    <p>The <i>optimize</i> tailoring command is purely for
2331    performance. It indicates that those characters are
2332    sufficiently common in the target language for the tailoring
2333    that their performance should be enhanced.</p>
2334    <p>The reason that these are not settings is so that their
2335    contents can be arbitrary characters.</p>
2336    <hr width="50%">
2337    <p><i>Example:</i></p>
2338    <p>The following is a simple example that combines portions of
2339    different tailorings for illustration. For more complete
2340    examples, see the actual locale data: <a href=
2341    "https://github.com/unicode-org/cldr/tree/latest/common/collation/ja.xml">
2342    Japanese</a>, <a href=
2343    "https://github.com/unicode-org/cldr/tree/latest/common/collation/zh.xml">
2344    Chinese</a>, <a href=
2345    "https://github.com/unicode-org/cldr/tree/latest/common/collation/sv.xml">
2346    Swedish</a>, and <a href=
2347    "https://github.com/unicode-org/cldr/tree/latest/common/collation/de.xml">
2348    German</a> (type="phonebook") are particularly
2349    illustrative.</p>
2350    <pre>&lt;collation&gt;
2351  &lt;cr&gt;&lt;![CDATA[
2352    [caseLevel on]
2353    &amp;Z
2354    &lt; æ &lt;&lt;&lt; Æ
2355    &lt; å &lt;&lt;&lt; Å &lt;&lt;&lt; aa &lt;&lt;&lt; aA &lt;&lt;&lt; Aa &lt;&lt;&lt; AA
2356    &lt; ä &lt;&lt;&lt; Ä
2357    &lt; ö &lt;&lt;&lt; Ö &lt;&lt; ű &lt;&lt;&lt; Ű
2358    &lt; ő &lt;&lt;&lt; Ő &lt;&lt; ø &lt;&lt;&lt; Ø
2359    &amp;V &lt;&lt;&lt;* wW
2360    &amp;Y &lt;&lt;&lt;* üÜ
2361    &amp;[last non-ignorable]
2362    <span style=
2363"color: green"># The following is equivalent to &lt;亜&lt;唖&lt;娃...</span>
2364    &lt;* 亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦
2365    &lt;* 鯵梓圧斡扱
2366  ]]&gt;&lt;/cr&gt;
2367&lt;/collation&gt;</pre>
2368    <h3>3.13 <a name="Script_Reordering" href="#Script_Reordering"
2369    id="Script_Reordering">Collation Reordering</a></h3>
2370    <p>Collation reordering allows scripts and certain other
2371    defined blocks of characters to be moved relative to each other
2372    parametrically, without changing the detailed rules for all the
2373    characters involved. This reordering is done on top of any
2374    specific ordering rules within the script or block currently in
2375    effect. Reordering can specify groups to be placed at the start
2376    and/or the end of the collation order. For example, to reorder
2377    Greek characters before Latin characters, and digits afterwards
2378    (but before other scripts), the following can be used:</p>
2379    <table>
2380      <tr>
2381        <th>Rule Syntax</th>
2382        <th>Locale Identifier</th>
2383      </tr>
2384      <tr>
2385        <td><code>[reorder Grek Latn digit]</code></td>
2386        <td><code>en-u-kr-grek-latn-digit</code></td>
2387      </tr>
2388    </table>
2389    <p>In each case, a sequence of
2390    <em><strong>reorder_codes</strong></em> is used, separated by
2391    spaces in the settings attribute and in rule syntax, and by
2392    hyphens in locale identifiers.</p>
2393    <p>A <strong><em>reorder_code</em></strong> is any of the
2394    following special codes:</p>
2395    <ol>
2396      <li><strong>space, punct, symbol, currency, digit</strong> -
2397      core groups of characters below 'a'</li>
2398      <li>
2399        <strong>any script code</strong> except
2400        <strong>Common</strong> and <strong>Inherited</strong>.
2401        <ul>
2402          <li>Some pairs of scripts sort primary-equal and always
2403          reorder together. For example, Katakana characters are
2404          are always reordered with Hiragana.</li>
2405        </ul>
2406      </li>
2407      <li><strong>others</strong> - where all codes not explicitly
2408      mentioned should be ordered. The script code
2409      <strong>Zzzz</strong> (Unknown Script) is a synonym for
2410      <strong>others</strong>.</li>
2411    </ol>
2412    <p>It is an error if a code occurs multiple times.</p>
2413    <p>It is an error if the sequence of reorder codes is empty in
2414    the XML attribute or in the locale identifier. Some
2415    implementations may interpret an empty sequence in the
2416    <code>[reorder]</code> rule syntax as a reset to the DUCET
2417    ordering, synonymous with <code>[reorder others]</code> ; other
2418    implementations may forbid an empty sequence in the rule syntax
2419    as well.</p>
2420    <p>Interaction with <strong>alternate=shifted</strong>: Whether
2421    a primary weight is “variable” is determined according to the
2422    “variable top”, before applying script reordering. Once that is
2423    determined, script reordering is applied to the primary weight
2424    regardless of whether it is “regular” (used in the primary
2425    level) or “shifted” (used in the quaternary level).</p>
2426    <h4>3.13.1 <a name="Interpretation_reordering" href=
2427    "#Interpretation_reordering" id=
2428    "Interpretation_reordering">Interpretation of a reordering
2429    list</a></h4>
2430    <p>The reordering list is interpreted as if it were processed
2431    in the following way.</p>
2432    <ol>
2433      <li>If any core code is not present, then it is inserted at
2434      the front of the list in the order given above.</li>
2435      <li>If the <strong>others</strong> code is not present, then
2436      it is inserted at the end of the list.</li>
2437      <li>The <strong>others</strong> code is replaced by the list
2438      of all script codes not explicitly mentioned, in DUCET
2439      order.</li>
2440      <li>The reordering list is now complete, and used to reorder
2441      characters in collation accordingly.</li>
2442    </ol>
2443    <p>The locale data may have a particular ordering. For example,
2444    the Czech locale data could put digits after all letters, with
2445    <code>[reorder others digit]</code> . Any reordering codes
2446    specified on top of that (such as with a bcp47 locale
2447    identifier) completely replace what was there. To specify a
2448    version of collation that completely resets any existing
2449    reordering to the DUCET ordering, the single code
2450    <strong>Zzzz</strong> or <strong>others</strong> can be used,
2451    as below.</p>
2452    <p><em>Examples:</em></p>
2453    <table cellpadding="0" cellspacing="0">
2454      <tbody>
2455        <tr>
2456          <th>Locale Identifier</th>
2457          <th>Effect</th>
2458        </tr>
2459        <tr>
2460          <td><code>en-u-kr-latn-digit</code></td>
2461          <td>Reorder digits after Latin characters (but before
2462          other scripts like Cyrillic).</td>
2463        </tr>
2464        <tr>
2465          <td><code>en-u-kr-others-digit</code></td>
2466          <td>Reorder digits after all other characters.</td>
2467        </tr>
2468        <tr>
2469          <td><code>en-u-kr-arab-cyrl-others-symbol</code></td>
2470          <td>Reorder Arabic characters first, then Cyrillic, and
2471          put symbols at the end—after all other characters.</td>
2472        </tr>
2473        <tr>
2474          <td><code>en-u-kr-others</code></td>
2475          <td>Remove any locale-specific reordering, and use DUCET
2476          order for reordering blocks.</td>
2477        </tr>
2478      </tbody>
2479    </table>
2480    <p>The default reordering groups are defined by the
2481    FractionalUCA.txt file, based on the primary weights of
2482    associated collation elements. The file contains special
2483    mappings for the start of each group, script, and
2484    reorder-reserved range, see <i>Section 2.6.2, <a href=
2485    "#File_Format_FractionalUCA_txt">FractionalUCA.txt</a></i>.</p>
2486    <p>There are some special cases:</p>
2487    <ul>
2488      <li>The <strong>Hani</strong> group includes implicit weights
2489      for <em>Han characters</em> according to the UCA as well as
2490      any characters tailored relative to a Han character, or after
2491      <code>&amp;[first Hani]</code>.</li>
2492      <li>Implicit weights for <em>unassigned code points</em>
2493      according to the UCA reorder as the last weights in the
2494      <strong>others</strong> (<strong>Zzzz</strong>) group.<br>
2495      There is no script code to explicitly reorder the
2496      unassigned-implicit weights into a particular position.
2497      (Unassigned-implicit weights are used for non-Hani code
2498      points without any mappings. For a given Unicode version they
2499      are the code points with General_Category values Cn, Co,
2500      Cs.)</li>
2501      <li>The TRAILING group, the FIELD-SEPARATOR (associated with
2502      U+FFFE), and collation elements with only zero primary
2503      weights are not reordered.</li>
2504      <li>The TERMINATOR, LEVEL-SEPARATOR, and SPECIAL groups are
2505      never associated with characters.</li>
2506    </ul>
2507    <p>For example, <code>reorder="Hani Zzzz Grek"</code> sorts
2508    Hani, Latin, Cyrillic, ... (all other scripts) ..., unassigned,
2509    Greek, TRAILING.</p>
2510    <p>Notes for implementations that write sort keys:</p>
2511    <ul>
2512      <li>Primaries must always be offset by one or more whole
2513      primary lead bytes. (Otherwise the number of bytes in a
2514      fractional weight may change, compressible scripts may span
2515      multiple lead bytes, or trailing primary bytes may collide
2516      with separators and primary-compression terminators.)</li>
2517      <li>When a script is reordered that does not start and end on
2518      whole-primary-lead-byte boundaries, then the lead byte needs
2519      to be “split”, and a reserved byte is used up. The data
2520      supports this via reorder-reserved ranges of primary weights
2521      that are not used for collation elements.</li>
2522      <li>Primary weights from different original lead bytes can be
2523      reordered to a shared lead byte, as long as they do not
2524      overlap. Primary compression ends when the target lead byte
2525      differs or when the original lead byte of the next primary is
2526      not compressible.</li>
2527      <li>Non-compressible groups and scripts begin or end on
2528      whole-primary-lead-byte boundaries (or both), so that
2529      reordering cannot surround a non-compressible script by two
2530      compressible ones within the same target lead byte. This is
2531      so that primary compression can be terminated reliably
2532      (choosing the low or high terminator byte) simply by
2533      comparing the previous and current primary weights. Otherwise
2534      it would have to also check for another condition (e.g.,
2535      equal scripts).</li>
2536    </ul>
2537    <h4>3.13.2 <a name="Reordering_Groups_allkeys" href=
2538    "#Reordering_Groups_allkeys" id=
2539    "Reordering_Groups_allkeys">Reordering Groups for
2540    allkeys.txt</a></h4>
2541    <p>For allkeys_CLDR.txt, the start of each reordering group can
2542    be determined from FractionalUCA.txt, by finding the first real
2543    mapping (after “xyz first primary”) of that group (e.g.,
2544    <code>0060; [0D 07, 05, 05] # Zyyy Sk [0312.0020.0002] * GRAVE
2545    ACCENT</code> ), and looking for that mapping's character
2546    sequence ( <code>0060</code> ) in allkeys_CLDR.txt. The comment
2547    in FractionalUCA.txt ( <code>[0312.0020.0002]</code> ) also
2548    shows the allkeys_CLDR.txt collation elements.</p>
2549    <p>The DUCET ordering of some characters is slightly different
2550    from the CLDR root collation order. The reordering groups for
2551    the DUCET are not specified. The following describes how
2552    reordering groups for the DUCET can be derived.</p>
2553    <p>For allkeys_DUCET.txt, the start of each reordering group is
2554    normally the primary weight corresponding to the same character
2555    sequence as for allkeys_CLDR.txt. In a few cases this requires
2556    adjustment, especially for the special reordering groups, due
2557    to CLDR’s ordering the common characters more strictly by
2558    category than the DUCET (as described in <i>Section 2, <a href=
2559    "#Root_Collation">Root Collation</a></i>). The necessary
2560    adjustment would set the start of each allkeys_DUCET.txt
2561    reordering group to the primary weight of the first mapping for
2562    the relevant General_Category for a special reordering group
2563    (for characters that sort before ‘a’), or the primary weight of
2564    the first mapping for the first script (e.g., sc=Grek) of an
2565    “alphabetic” group (for characters that sort at or after
2566    ‘a’).</p>
2567    <p>Note that the following only applies to primary weights
2568    greater than the one for U+FFFE and less than "trailing"
2569    weights.</p>
2570    <p>The special reordering groups correspond to General_Category
2571    values as follows:</p>
2572    <ul>
2573      <li>punct: P</li>
2574      <li>symbol: Sk, Sm, So</li>
2575      <li>space: Z, Cc</li>
2576      <li>currency: Sc</li>
2577      <li>digit: Nd</li>
2578    </ul>
2579    <p>In the DUCET, some characters that sort below ‘a’ and have
2580    other General_Category values not mentioned above (e.g., gc=Lm)
2581    are also grouped with symbols. Variants of numbers (gc=No or
2582    Nl) can be found among punctuation, symbols, and digits.</p>
2583    <p>Each collation element of an expansion may be in a different
2584    reordering group, for example for parenthesized characters.</p>
2585    <h3>3.14 <a name="Case_Parameters" href="#Case_Parameters" id=
2586    "Case_Parameters">Case Parameters</a></h3>
2587    <p>The <strong>case level</strong> is an <em>optional</em>
2588    intermediate level ("2.5") between Level 2 and Level 3 (or
2589    after Level 1, if there is no Level 2 due to strength
2590    settings). The case level is used to support two parametric
2591    features: ignoring non-case variants (Level 3 differences)
2592    except for case, and giving case differences a higher-level
2593    priority than other tertiary differences. Distinctions between
2594    small and large Kana characters are also included as case
2595    differences, to support Japanese collation.</p>
2596    <p>The <strong>case first</strong> parameter controls whether
2597    to swap the order of upper and lowercase. It can be used with
2598    or without the case level.</p>
2599    <p>Importantly, the case parameters have no effect in many
2600    instances. For example, they have no effect on the comparison
2601    of two non-ignorable characters with different primary weights,
2602    or with different secondary weights if the strength =
2603    <strong>secondary (or higher).</strong></p>
2604    <p>When either the <strong>case level</strong> or <strong>case
2605    first</strong> parameters are set, the following describes the
2606    derivation of the modified collation elements. It assumes the
2607    original levels for the code point are [p.s.t] (primary,
2608    secondary, tertiary). This derivation may change in future
2609    versions of LDML, to track the case characteristics more
2610    closely.</p>
2611    <h4>3.14.1 <a name="Case_Untailored" href="#Case_Untailored"
2612    id="Case_Untailored">Untailored Characters</a></h4>
2613    <p>For untailored characters and strings, that is, for mappings
2614    in the root collation, the case value for each collation
2615    element is computed from the tertiary weight listed in
2616    allkeys_CLDR.txt. This is used to modify the collation
2617    element.</p>
2618    <p>Look up a case value for the tertiary weight x of each
2619    collation element:</p>
2620    <ol>
2621      <li>UPPER if x ∈ {08-0C, 0E, 11, 12, 1D}</li>
2622      <li>UNCASED otherwise</li>
2623      <li>FractionalUCA.txt encodes the case information in bits 6
2624      and 7 of the first byte in each tertiary weight. The case
2625      bits are set to 00 for UNCASED and LOWERCASE, and 10 for
2626      UPPER. There is no MIXED case value (01) in the root
2627      collation.</li>
2628    </ol>
2629    <h4>3.14.2 <a name="Case_Weights" href="#Case_Weights" id=
2630    "Case_Weights">Compute Modified Collation Elements</a></h4>
2631    <p>From a computed case value, set a weight <strong>c</strong>
2632    according to the following.</p>
2633    <ol>
2634      <li>If <strong>CaseFirst=UpperFirst</strong>, set
2635      <strong>c</strong> = UPPER ? <strong>1</strong> : MIXED ? 2 :
2636      <strong>3</strong></li>
2637      <li>Otherwise set <strong>c</strong> = UPPER ?
2638      <strong>3</strong> : MIXED ? 2 : <strong>1</strong></li>
2639    </ol>
2640    <p>Compute a new collation element according to the following
2641    table. The notation <em>xt</em> means that the values are
2642    numerically combined into a single level, such that xt &lt; yu
2643    whenever x &lt; y. The fourth level (if it exists) is
2644    unaffected. Note that a secondary CE must have a secondary
2645    weight S which is greater than the secondary weight s of any
2646    primary CE; and a tertiary CE must have a tertiary weight T
2647    which is greater than the tertiary weight t of any primary or
2648    secondary CE ([<a href=
2649    "https://www.unicode.org/reports/tr41/#UTS10">UCA</a>] <a href=
2650    "https://www.unicode.org/reports/tr10/#WF2">WF2</a>).</p>
2651    <div align="center">
2652      <table>
2653        <tbody>
2654          <tr>
2655            <th>Case Level</th>
2656            <th>Strength</th>
2657            <th>Original CE</th>
2658            <th>Modified CE</th>
2659            <th>Comment</th>
2660          </tr>
2661          <tr>
2662            <td rowspan="5"><strong>on</strong></td>
2663            <td rowspan="2"><strong>primary</strong></td>
2664            <td><code>0.S.t</code></td>
2665            <td><code>0.0</code></td>
2666            <td rowspan="2">ignore case level weights of
2667            primary-ignorable CEs</td>
2668          </tr>
2669          <tr>
2670            <td><code>p.s.t</code></td>
2671            <td><code>p.c</code></td>
2672          </tr>
2673          <tr>
2674            <td rowspan="3"><strong>secondary<br></strong> or
2675            higher</td>
2676            <td><code>0.0.T</code></td>
2677            <td><code>0.0.0.T</code></td>
2678            <td rowspan="3">ignore case level weights of
2679            secondary-ignorable CEs</td>
2680          </tr>
2681          <tr>
2682            <td><code>0.S.t</code></td>
2683            <td><code>0.S.c.t</code></td>
2684          </tr>
2685          <tr>
2686            <td><code>p.s.t</code></td>
2687            <td><code>p.s.c.t</code></td>
2688          </tr>
2689          <tr>
2690            <td rowspan="4"><strong>off</strong></td>
2691            <td rowspan="4">any</td>
2692            <td><code>0.0.0</code></td>
2693            <td><code>0.0.00</code></td>
2694            <td rowspan="4">ignore case level weights of
2695            tertiary-ignorable CEs</td>
2696          </tr>
2697          <tr>
2698            <td><code>0.0.T</code></td>
2699            <td><code>0.0.3T</code></td>
2700          </tr>
2701          <tr>
2702            <td><code>0.S.t</code></td>
2703            <td><code>0.S.ct</code></td>
2704          </tr>
2705          <tr>
2706            <td><code>p.s.t</code></td>
2707            <td><code>p.s.ct</code></td>
2708          </tr>
2709        </tbody>
2710      </table>
2711    </div>
2712    <p>For primary+case, which is used for “ignore accents but not
2713    case” collation, primary ignorables are ignored so that a = ä.
2714    For secondary+case, which would by analogy mean “ignore
2715    variants but not case”, secondary ignorables are ignored for
2716    equivalent behavior.</p>
2717    <p>When using <strong>caseFirst</strong> but not
2718    <strong>caseLevel</strong>, the combined case+tertiary weight
2719    of a tertiary CE must be greater than the combined
2720    case+tertiary weight of any primary or secondary CE so that
2721    [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
2722    <a href=
2723    "https://www.unicode.org/reports/tr10/#WF2">well-formedness
2724    condition 2</a> is fulfilled. Since the tertiary CE’s tertiary
2725    weight T is already greater than any t of primary or secondary
2726    CEs, it is sufficient to set its case weight to UPPER=3. It
2727    must not be affected by <strong>caseFirst=upper</strong>. (The
2728    table uses the constant 3 in this case rather than the computed
2729    c.)</p>
2730    <p>The case weight of a tertiary-ignorable CE must be 0 so that
2731    [<a href="https://www.unicode.org/reports/tr41/#UTS10">UCA</a>]
2732    <a href=
2733    "https://www.unicode.org/reports/tr10/#WF1">well-formedness
2734    condition 1</a> is fulfilled.</p>
2735    <h4>3.14.3 <a name="Case_Tailored" href="#Case_Tailored" id=
2736    "Case_Tailored">Tailored Strings</a></h4>
2737    <p>Characters and strings that are tailored have case values
2738    computed from their root collation case bits.</p>
2739    <ol>
2740      <li>Look up the tailored string’s root CEs. (Ignore any
2741      prefix or extension strings.) N=number of primary root
2742      CEs.</li>
2743      <li>Determine the number and type (primary vs. weaker) of CEs
2744      a tailored string maps to. M=number of primary tailored
2745      CEs.</li>
2746      <li>If N&lt;=M (no more root than tailoring primary CEs):
2747      Copy the root case bits for primary CEs 0..N-1.
2748        <ul>
2749          <li>If N&lt;M (fewer root primary CEs): Clear the case
2750          bits of the remaining tailored primary CEs.
2751          (uncased/lowercase/small Kana)</li>
2752        </ul>
2753      </li>
2754      <li>If N&gt;M (more root primary CEs): Copy the root case
2755      bits for primary CEs 0..M-2. Set the case bits for tailored
2756      primary CE M-1 according to the remaining root primary CEs
2757      M-1..N-1:
2758        <ul>
2759          <li>Set to uncased/lower if all remaining root primary
2760          CEs have uncased/lower.</li>
2761          <li>Set to uppercase if all remaining root primary CEs
2762          have uppercase.</li>
2763          <li>Otherwise, set to mixed.</li>
2764        </ul>
2765      </li>
2766      <li>Clear the case bits for secondary CEs 0.s.t.</li>
2767      <li>Tertiary CEs 0.0.t must get uppercase bits.</li>
2768      <li>Tertiary-ignorable CEs 0.0.0 must get
2769      ignorable-case=lowercase bits.</li>
2770    </ol>
2771    <p class="note">Note: Almost all Cased characters have primary
2772    (non-ignorable) root collation CEs, except for U+0345 Combining
2773    Ypogegrammeni which is Lowercase. All Uppercase characters have
2774    primary root collation CEs.</p>
2775    <h3>3.15 <a name="Visibility" href="#Visibility" id=
2776    "Visibility">Visibility</a></h3>
2777    <p>Collations have external visibility by default, meaning that
2778    they can be displayed in a list of collation options for users
2779    to choose from. A collation whose type name starts with
2780    "private-" is internal and should not be shown in such a list.
2781    Collations are typically internal when they are partial
2782    sequences included in other collations. See <i>Section 3.1,
2783    <a href="#Collation_Types">Collation Types</a></i> .</p>
2784    <h3>3.16 <a name="Collation_Indexes" href="#Collation_Indexes"
2785    id="Collation_Indexes">Collation Indexes</a></h3>
2786    <h4>3.16.1 <a name="Index_Characters" href="#Index_Characters"
2787    id="Index_Characters">Index Characters</a></h4>
2788    <p>The main data includes &lt;exemplarCharacters&gt; for
2789    collation indexes. See <i>Part 2 General, Section 3, <a href=
2790    "tr35-general.html#Character_Elements">Character
2791    Elements</a></i>, for general information about exemplar
2792    characters.</p>
2793    <p>The index characters are a set of characters for use as a UI
2794    "index", that is, a list of clickable characters (or character
2795    sequences) that allow the user to see a segment of a larger
2796    "target" list. Each character corresponds to a bucket in the
2797    target list. One may have different kinds of index lists; one
2798    that produces an index list that is relatively static, and the
2799    other is a list that produces roughly equally-sized buckets.
2800    While CLDR is mostly focused on the first, there is provision
2801    for supporting the second as well.</p>
2802    <p>The index characters need to be used in conjunction with a
2803    collation for the locale, which will determine the order of the
2804    characters. It will also determine which index characters show
2805    up.</p>
2806    <p>The static list would be presented as something like the
2807    following (either vertically or horizontally):</p>
2808    <p align="center">…&nbsp;A B C D E F G H CH I J K L M N O P Q R
2809    S T U V W X Y Z&nbsp;…</p>
2810    <p>In the "A" bucket, you would find all items that are primary
2811    greater than or equal to "A" in collation order, and primary
2812    less than "B". The use of the list requires that the target
2813    list be sorted according to the locale that is used to create
2814    that list. Although we say "character" above, the index
2815    character could be a sequence, like "CH" above. The index
2816    exemplar characters must always be used with a collation
2817    appropriate for the locale. Any characters that do not have
2818    primary differences from others in the set should be
2819    removed.</p>
2820    <p>Details:</p>
2821    <ol>
2822      <li>The primary weight (according to the collation) is used
2823      to determine which bucket a string is in. There are special
2824      buckets for before the first character, between buckets of
2825      different scripts, and after the last bucket (and of a
2826      different script).</li>
2827      <li>Characters in the <em>index characters</em> do not need
2828      to have distinct primary weights. That is, the <em>index
2829      characters</em> are adapted to the underlying collation:
2830      normally Ё is in the Е bucket for Russian, but if someone
2831      used a variant of Russian collation that distinguished them
2832      on a primary level, then Ё would show up as its own
2833      bucket.</li>
2834      <li>If an <em>index character</em> string ends with a single
2835      "*" (U+002A), for example "Sch*" and "St*" in German, then
2836      there will be a separate bucket for the string minus the "*",
2837      for example "Sch" and "St", even if that string does not sort
2838      distinctly.</li>
2839      <li>An <em>index character</em> can have multiple primary
2840      weights, for example "Æ" and "Sch". Names that have the same
2841      initial primary weights sort into this <em>index
2842      character</em>’s bucket. This can be achieved by using an
2843      upper-boundary string that is the concatenation of the
2844      <em>index character</em> and U+FFFF, for example "Æ\uFFFF"
2845      and "Sch\uFFFF". Names that sort greater than this upper
2846      boundary but less than the next index character are
2847      redirected to the last preceding single-primary index
2848      character (A and S for the examples here).</li>
2849    </ol>
2850    <p>For example, for index characters <code>[A Æ B R S {Sch*}
2851    {St*} T]</code> the following sample names are sorted into an
2852    index as shown.</p>
2853    <ul>
2854      <li>A — Adelbert, Afrika</li>
2855      <li>Æ — Æsculap, Aesthet</li>
2856      <li>B — Berlin</li>
2857      <li>R — Rilke</li>
2858      <li>S — Sacher, Seiler, Sultan</li>
2859      <li>Sch — Schiller</li>
2860      <li>St — Steiff</li>
2861      <li>T — Thomas</li>
2862    </ul>
2863    <p>The&nbsp;…&nbsp;items are special: each is a bucket for
2864    everything else, either less or greater. They are inserted at
2865    the start and end of the index list, <em>and</em> on script
2866    boundaries. Each script has its own range, except where scripts
2867    sort primary-equal (e.g., Hira &amp; Kana). All characters that
2868    sort in one of the low reordering groups (whitespace,
2869    punctuation, symbols, currency symbols, digits) are treated as
2870    a single script for this purpose.</p>
2871    <p>If you tailor a Greek character into the Cyrillic script,
2872    that Greek character will be bucketed (and sorted) among the
2873    Cyrillic ones.</p>
2874    <p>Even in an implementation that reorders groups of scripts
2875    rather than single scripts, for example Hebrew together with
2876    Phoenician and Samaritan, the index boundaries are really
2877    script boundaries, <em>not</em> multi-script-group boundaries.
2878    So if you had a collation that reordered Hebrew after Ethiopic,
2879    you would still get index boundaries between the following (and
2880    in that order):</p>
2881    <ol>
2882      <li>Ethiopic</li>
2883      <li>Hebrew</li>
2884      <li>Phoenician<em>&nbsp;// included in the Hebrew reordering
2885      group</em></li>
2886      <li>Samaritan<em>&nbsp;// included in the Hebrew reordering
2887      group</em></li>
2888      <li>Devanagari</li>
2889    </ol>
2890    <p>(Beginning with CLDR 27, single scripts can be
2891    reordered.)</p>
2892    <p>In the UI, an index character could also be omitted or
2893    grayed out if its bucket is empty. For example, if there is
2894    nothing in the bucket for Q, then Q could be omitted. That
2895    would be up to the implementation. Additional buckets could be
2896    added if other characters are present. For example, we might
2897    see something like the following:</p>
2898    <table border="1" cellspacing="0">
2899      <tbody>
2900        <tr align="center">
2901          <td>
2902            <div align="center">
2903              <strong>Sample Greek Index<br></strong>
2904            </div>
2905          </td>
2906          <td><strong>Contents<br></strong></td>
2907        </tr>
2908        <tr align="center">
2909          <td>
2910            <div align="center">
2911              &nbsp;Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω
2912            </div>
2913          </td>
2914          <td>With only content beginning with Greek
2915          letters&nbsp;<br></td>
2916        </tr>
2917        <tr align="center">
2918          <td>
2919            <div align="center">
2920              &nbsp;… Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ
2921              Ω …
2922            </div>
2923          </td>
2924          <td>With some content before or after</td>
2925        </tr>
2926        <tr align="center">
2927          <td>
2928            <div align="center">
2929              &nbsp;… 9 Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ
2930              Ψ Ω …
2931            </div>
2932          </td>
2933          <td>With numbers, and nothing between 9 and Alpha</td>
2934        </tr>
2935        <tr align="center">
2936          <td>
2937            <div align="center">
2938              &nbsp; … 9&nbsp;<em>A-Z</em>&nbsp;Α Β Γ Δ Ε Ζ Η Θ Ι Κ
2939              Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω …
2940            </div>
2941          </td>
2942          <td>With numbers, some Latin</td>
2943        </tr>
2944      </tbody>
2945    </table>
2946    <p>Here is a sample of the XML structure:</p>
2947    <pre>
2948    &lt;exemplarCharacters type="index"&gt;[A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]&lt;/exemplarCharacters&gt;</pre>
2949    <p>The display of the index characters can be modified with the
2950    Index labels elements, discussed in the <i>Part 2 General,
2951    Section 3.3, <a href="tr35-general.html#IndexLabels">Index
2952    Labels</a></i> .</p>
2953    <h4>3.16.2 <a name="CJK_Index_Markers" href=
2954    "#CJK_Index_Markers" id="CJK_Index_Markers">CJK Index
2955    Markers</a></h4>
2956    <p>Special index markers have been added to the CJK collations
2957    for stroke, pinyin, zhuyin, and unihan. These markers allow for
2958    effective and robust use of indexes for these collations.</p>
2959    <p>The per-language index exemplar characters are not useful
2960    for collation indexes for CJK because for each such language
2961    there are multiple sort orders in use (for example, Chinese
2962    pinyin vs. stroke vs. unihan vs. zhuyin), and these sort orders
2963    use very different index characters. In addition, sometimes the
2964    boundary strings are different from the bucket label strings.
2965    For collations that contain index markers, the boundary strings
2966    and bucket labels should be derived from those index markers,
2967    ignoring the index exemplar characters.</p>
2968    <p>For example, near the start of the pinyin tailoring there is
2969    the following:</p>
2970    <p>&lt;p&gt; A&lt;/p&gt;&lt;!-- INDEX A --&gt;<br>
2971    &lt;pc&gt;阿呵��锕����&lt;/pc&gt;&lt;!-- ā --&gt;</p>
2972    <p>…</p>
2973    <p>&lt;pc&gt;翶&lt;/pc&gt;&lt;!-- ao --&gt;<br>
2974    &lt;p&gt; B&lt;/p&gt;&lt;!-- INDEX B --&gt;</p>
2975    <p>These indicate the boundaries of "buckets" that can be used
2976    for indexing. They are always two characters starting with the
2977    noncharacter U+FDD0, and thus will not occur in normal text.
2978    For pinyin the second character is A-Z; for unihan it is one of
2979    the radicals; and for stroke it is a character after U+2800
2980    indicating the number of strokes, such as ⠁. For zhuyin the
2981    second character is one of the standard Bopomofo characters in
2982    the range U+3105 through U+3129.</p>
2983    <p>The corresponding bucket label strings are the boundary
2984    strings with the leading U+FDD0 removed. For example, the
2985    Pinyin boundary string "\uFDD0A" yields the label string
2986    "A".</p>
2987    <p>However, for stroke order, the label string is the stroke
2988    count (second character minus U+2800) as a decimal-digit number
2989    followed by 劃 (U+5283). For example, the stroke order boundary
2990    string "\uFDD0\u2805" yields the label string "5劃".</p>
2991    <hr>
2992    <p class="copyright">Copyright © 2001–2020 Unicode, Inc. All
2993    Rights Reserved. The Unicode Consortium makes no expressed or
2994    implied warranty of any kind, and assumes no liability for
2995    errors or omissions. No liability is assumed for incidental and
2996    consequential damages in connection with or arising out of the
2997    use of the information or programs contained or accompanying
2998    this technical report. The Unicode <a href=
2999    "https://unicode.org/copyright.html">Terms of Use</a> apply.</p>
3000    <p class="copyright">Unicode and the Unicode logo are
3001    trademarks of Unicode, Inc., and are registered in some
3002    jurisdictions.</p>
3003  </div>
3004</body>
3005</html>
3006