1# IDNA CONTEXT RULES (including BIDI)
2# Mark Davis
3# Provides a table-based mechanism for determining whether a label is a U-Label or not.
4
5# For testing, this turns on a Verbose mode, that displays the resolved sets and rules.
6
7VERBOSE:true
8
9# If any of the following regex expressions is found in the label, then the label is not a valid U-Label.
10# These rules provide a machine-readable way to test that. This is intended for a reference (test) version;
11# implementations would typically use hand-coded versions that would be much more optimized.
12# The context rules are derived from http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#appendix-A
13# However, they do contain quite a number of corrections and proposed changes.
14
15# FILE FORMAT
16
17# Everything at and after # is a comment, and ignored.
18# Blank lines are ignored.
19# Leading and trailing spaces are ignored.
20# There are 3 kinds of lines: titles, rules and variable definitions
21
22# A variable is defined with a line of the form $X = <unicodeSet>
23# They are a single unicodeSet (character range) according to http://www.unicode.org/reports/tr18/
24# These variables are substituted in the rules before evaluation
25
26# Rules have the following formats:
27# <before>; <at>; <result>
28# <before> ; <result>
29# Key:
30#   <before> and <at> are both regex expressions
31#   <result> is either "fail" or "next"
32# Everything other kind of line is an error
33
34# A title is of the form "Title: ...". It is just informational, but allows the test to show why a character causes a failure.
35
36# Function
37# Logically, a label is processed by iterating through its character positions
38# In each iteration, each rule is checked.
39# If <before> and <at> both match, then the result is applied as follows:
40#   fail: stop, the label is invalid
41#   next: skip to the next rule that has a "next" result (skipping any "fail" or "next2" results)
42#   next2: skip to the next rule that has a "next" or "next2" result (skipping any "fail" results)
43# If the processing reaches the end of the string, then the label is valid.
44
45# The regex expressions use Java / Perl syntax, with Unicode properties.
46# If the regex does not support Unicode Properties (the latest version!), then explicit ranges can be substituted.
47# For example, for [:bc=nsm:], using the set on http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:bc=nsm:]
48# Interior spaces are ignored, and may be used for readability.
49
50# The expressions are limited to a basic format which should work in any regex engine (perhaps with some syntax tweeks).
51# In particular, the <before> and <at> split is used to avoid lookbehind, which can vary in results depending on regex engines.
52# <before> is matched before the current position. Logically, it is equivalent to matching label[0,n] against /.*<before>/
53# <at> is matched at the current position. Logically, it is equivalent to matching label[n,end] against /<at>.*/
54
55# ===================================
56# Mapping would be done before this table, and only for Lookup
57# ===================================
58
59Title: 4.2.2. Rejection of Characters that are not Permitted: fail if DISALLOWED or UNASSIGNED
60
61# Tables
62
63# 2.1. LetterDigits (A)
64
65$LetterDigits = [[:Ll:] [:Lu:] [:Lo:] [:Nd:] [:Lm:] [:Mn:] [:Mc:]]
66
67# 2.2. Unstable (B)
68
69$Unstable = [:^nfkc_casefolded:]
70
71# 2.3. IgnorableProperties (C) // default ignoreable or whitespace or nonchar
72
73$Ignorable = [[:di:] [:WSpace:] [:NChar:]]
74
75# 2.4. IgnorableBlocks (D)
76
77$IgnorableBlocks = [[:block=Combining_Diacritical_Marks_For_Symbols:] [:block=Musical_Symbols:] [:block=Ancient_Greek_Musical_Notation:]]
78
79# 2.5. LDH (E)
80
81$LDH = [\u002D\u0030-\u0039\u0061-\u007A]
82
83# 2.6. Exceptions (F)
84# Note: added Tatweel to these, since that seems to be the current consensus.
85
86$ExceptionDisallowed = [\u302E \u302F \u0640]
87$ExceptionPvalid = [\u00DF \u03C2 \u06FD \u06FE \u0F0B \u3007]
88$ExceptionContexto = [\u002D \u00B7 \u02B9 \u0375 \u0483 \u05F3 \u05F4 \u0660 \u0661 \u0662 \u0663 \u0664 \u0665 \u0666 \u0667 \u0668 \u0669 \u06F0 \u06F1 \u06F2 \u06F3 \u06F4 \u06F5 \u06F6 \u06F7 \u06F8 \u06F9 \u3005 \u303B \u30FB]
89
90# 2.7. BackwardCompatible (G)
91
92$BackwardCompatibleDisallowed = []
93$BackwardCompatiblePvalid = []
94$BackwardCompatibleContexto = []
95
96# 2.8. JoinControl (H)
97
98$JoinControl = [:JoinControl:]
99
100# 2.9. OldHangulJamo (I)
101
102$OldHangulJamo = [[:HST=L:] [:HST=V:] [:HST=T:]]
103
104# 2.10. Unassigned (J)
105
106$Unassigned = [[:unassigned:] - [:nchar:]]
107
108# If .cp. .in.  Exceptions Then Exceptions(cp);
109# Else If .cp. .in.  BackwardCompatible Then BackwardCompatible(cp);
110# Else If .cp. .in.  Unassigned Then UNASSIGNED;
111# Else If .cp. .in.  LDH Then PVALID;
112# Else If .cp. .in.  JoinControl Then CONTEXTJ;
113# Else If .cp. .in.  Unstable Then DISALLOWED;
114# Else If .cp. .in.  IgnorableProperties Then DISALLOWED;
115# Else If .cp. .in.  IgnorableBlocks Then DISALLOWED;
116# Else If .cp. .in.  OldHangulJamo Then DISALLOWED;
117# Else If .cp. .in.  LetterDigits Then PVALID;
118# Else DISALLOWED;
119
120# We compute when invalid.
121# There is no functional difference between DISALLOWED and UNASSIGNED: we can just call them invalid.
122
123# The rules in Tables obfuscates the true situation, which is that:
124#   A. some characters are always valid
125#   B. otherwise LetterDigits are valid - with some subtractions
126#   C. otherwise everything is invalid
127
128# There is no functional difference between CONTEXTJ and CONTEXTO, and none at this point from PVALID either
129# We record all of them for later
130
131$Context = [$ExceptionContexto $BackwardCompatibleContexto $JoinControl]
132
133$ValidAlways = [$ExceptionPvalid $BackwardCompatiblePvalid $LDH]
134
135$InvalidLetterDigits = [$ExceptionDisallowed $BackwardCompatibleDisallowed $Unassigned $Unstable $IgnorableProperties $IgnorableBlocks $OldHangulJamo]
136
137$Valid = [$ValidAlways $Context [$LetterDigits - $InvalidLetterDigits]]
138$Valid2 = [[:nfkc_casefolded:]-[:c:]-[:z:]-[:s:]-[:p:]-[:nl:]-[:no:]-[:me:]-[:di:]-[:HST=L:]-[:HST=V:]-[:HST=V:]-[:block=Combining_Diacritical_Marks_For_Symbols:]-[:block=Musical_Symbols:]-[:block=Ancient_Greek_Musical_Notation:]-[\u0640\u07FA\u302E\u302F\u3031-\u3035\u303B][:JoinControl:][\u00DF\u03C2\u06FD\u06FE\u0F0B\u3007][\u002D\u00B7\u0375\u05F3\u05F4\u30FB]]
139
140$Invalid = [^ $Valid]
141
142# ===================================
143
144# At this point, we have the final list of everything that is invalid, so there is a single test
145
146$Invalid ; fail
147
148# ===================================
149
150Title: 4.2.3.1. Rejection of Hyphen Sequences in U-labels // just xx--
151
152^.. ; -- ; fail
153
154# ===================================
155
156Title: 4.2.3.2. Leading Combining Marks
157
158$M = [:M:]
159
160^ ; $M ; fail
161
162# ===================================
163
164# CONTEXT: 4.2.3.3. Contextual Rules
165# The details here are all from Tables
166
167Title: Appendix A.1. HYPHEN-MINUS - Can't be at start or end; that is, ok only if medial
168
169. ; -. ; next
170- ; fail
171
172# See comment in http://www.alvestrand.no/pipermail/idna-update/2008-November/003021.html
173
174# ==========================
175
176# ZWNJ and ZWJ is the trickiest section
177# Tables is all messed up. Following UAX 31 instead, from which these are derived
178# http://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters
179
180# There are two different kinds of rules that have to be combined.
181# We don't try a script test for Arabic, because it is not needed (and is complicated)
182
183# Variables for Arabic
184
185$T = [:Joining_Type=Transparent:]
186$R = [[:Joining_Type=Dual_Joining:][:Joining_Type=Right_Joining:]]
187$L = [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]]
188
189# Appendix A.2. ZERO WIDTH NON-JOINER
190
191Title: A1. Allow ZWNJ in the following context: /$L $T* ZWNJ $T* $R/
192
193$L $T* ; \u200C $T* $R ; next
194
195# Variables for Indic
196
197$Lt = [:General_Category=Letter:]
198$V = [:Canonical_Combining_Class=Virama:]
199
200# $deva = [:sc=deva:]
201# $Ndeva = [^$deva]
202# $beng = [:sc=beng:]
203# $Nbeng = [^$beng]
204# $guru = [:sc=guru:]
205# $Nguru = [^$guru]
206# $noVirama = [^ $deva $beng $guru]
207
208# WARNING: There is a nasty bug in Java regex before 1.6; the work-around is to do the negations
209# above, instead of in the regexes.
210
211# To do the script test for both ZWNJ and ZWJ, use the following.
212# The first line lists all the acceptable scripts
213
214Title: ZWJ/ZWNJ apply to letter+virama
215# WARNING: rule must come after Arabic!
216
217# $noVirama ; [\u200C\u200D] ; fail
218
219# The remainder makes sure that each script is paired
220# Subsequent rules will make sure that there are two characters
221
222# $Ndeva $deva; [\u200C\u200D] ; fail
223# $Nbeng $beng; [\u200C\u200D] ; fail
224# $Nguru $guru; [\u200C\u200D] ; fail
225
226# Now we do the script-independent rules
227
228Title: A2. Allow ZWNJ in the following context: /$L $V ZWNJ/
229
230$Lt $V ; \u200C ; next2
231\u200C ; fail
232
233# Appendix A.3. ZERO WIDTH JOINER
234
235Title: B. Allow ZWJ (U+200D) in the following context:  /$L $V ZWJ/
236
237$Lt $V ; \u200D ; next2
238\u200D ; fail
239
240# ==========================
241
242Title: Appendix A.4. MIDDLE DOT
243
244l ; \u00B7 l ; next
245\u00B7 ; fail
246
247# Appendix A.5. GREEK LOWER NUMERAL SIGN (KERAIA)
248Title: The script of the following character MUST be Greek.
249$grek = [:script=greek:]
250\u0375 ; $grek ; next
251\u0375 ; fail
252
253Title: Appendix A.6. HEBREW PUNCTUATION GERESH - \u05F3 - The script of the preceding character MUST be Hebrew.
254
255$hebr = [:script=hebrew:]
256$hebr ; \u05F3 ; next
257\u05F3 ; fail
258
259Title: Appendix A.7. HEBREW PUNCTUATION GERSHAYIM - \u05F4 - The script of the preceding character MUST be Hebrew.
260
261$hebr ; \u05F4 ; next
262\u05F4 ; fail
263
264Title: Appendix A.12. KATAKANA MIDDLE DOT - \u30FB - Adjacent characters MUST be Hiragana, Katakana, or Han.
265$han = [[:script=Hiragana:][:script=Katakana:][:script=Han:]]
266$han ; \u30FB $han; next
267\u30FB ; fail
268
269Title: Appendix A.13. ARABIC-INDIC DIGITS - 0660..0669
270# Rules broken, since they simply forbid them entirely
271# Rewrite to exclude mixing with western (ASCII) or extended Arabic digits
272
273$WD = [0-9]
274$AD = [\u0660-\u0669]
275$EAD = [\u06F0-\u06F9]
276
277[$WD $EAD].*$AD ; fail
278$AD.*[$WD $EAD] ; fail
279
280Title: Appendix A.14. EXTENDED ARABIC-INDIC DIGITS
281# Rules broken, since they simply forbid them entirely
282# Rewrite to exclude mixing with western (ASCII) or Arabic digits
283
284$EAD .* [$WD $AD] ; fail
285[$WD $AD] .* $EAD ; fail
286
287# ==========================
288
289# BIDI Rules: 4.2.3.4. Labels Containing Characters Written Right to Left
290# The details here are all from http://tools.ietf.org/html/draft-ietf-idnabis-bidi-03
291
292# Note that $NSM != Non-spacing Marks in the general sense
293# See http://unicode.org/cldr/utility/unicodeset.jsp?a=[:bc=nsm:]&b=[[:me:][:mn:]]
294
295$NSM = [:bc=NSM:]
296$ESON = [[:bc=ES:][:bc=ON:]]
297$ENAN = [[:bc=EN:][:bc=AN:]]
298$RALAN = [[:bc=R:][:bc=AL:][:bc=AN:]]
299$BCL = [:bc=L:]
300$BDisallowed = [^[:bc=L:][:bc=R:][:bc=AL:][:bc=AN:][:bc=EN:][:bc=ES:][:bc=BN:][:bc=ON:][:bc=NSM:]]
301
302# Note:
303# The only tables-valid [:bc=ES:] character is: -
304# The only tables-valid [:bc=BN:] characters are: [\u200C \u200D]
305# The only tables-valid [:bc=ON:] characters are: [・ · ʹ ͵ ʺ ˆ-ˏ ˬ ꜗ-ꜟ ꞈ ⸯ ꙿ]
306
307Title: 1.  Only characters with the BIDI properties L, R, AL, AN, EN, ES, BN, ON and NSM are allowed.
308
309# No rules really necessary, since anything else is excluded by Tables
310
311$BDisallowed ; fail
312
313Title: 2.  ES and ON are not allowed in the first position
314
315^ ; $ESON .* $RALAN ; fail
316
317Title: 3.  ES and ON, followed by zero or more NSM, is not allowed in the last position
318
319$RALAN .* ; $ESON $NSM* $ ; fail
320
321Title: 4.  If an R, AL or AN is present, no L may be present.
322
323$RALAN .* $BCL ; fail
324$BCL .* $RALAN ; fail
325
326Title: 5.  If an EN is present, no AN may be present, and vice versa.
327
328# Overlaps with A.13/14 above, not necessary to restate
329
330Title: 6.  The first character may not be an NSM.
331
332^ ; $NSM ; fail
333
334Title: 7.  The first character may not be an EN (European Number) or an AN (Arabic Number).
335
336^ ; $ENAN .* $RALAN ; fail
337
338# NOTE: all of the "not allowed in first position" rules could be combined together
339
340# ===================================
341
342# 4.2.4. Registration Validation Summary: if at least one non-ASCII then <= 59 bytes of PunyCode
343# Needs to be done outside of this table
344