1# IDNA CONTEXT RULES (including BIDI) 2# Mark Davis 3# Provides a table-based mechanism for determining whether a label is a U-Label or not. 4 5# For testing, this turns on a Verbose mode, that displays the resolved sets and rules. 6 7VERBOSE:true 8 9# If any of the following regex expressions is found in the label, then the label is not a valid U-Label. 10# These rules provide a machine-readable way to test that. This is intended for a reference (test) version; 11# implementations would typically use hand-coded versions that would be much more optimized. 12# The context rules are derived from http://tools.ietf.org/html/draft-ietf-idnabis-tables-05#appendix-A 13# However, they do contain quite a number of corrections and proposed changes. 14 15# FILE FORMAT 16 17# Everything at and after # is a comment, and ignored. 18# Blank lines are ignored. 19# Leading and trailing spaces are ignored. 20# There are 3 kinds of lines: titles, rules and variable definitions 21 22# A variable is defined with a line of the form $X = <unicodeSet> 23# They are a single unicodeSet (character range) according to http://www.unicode.org/reports/tr18/ 24# These variables are substituted in the rules before evaluation 25 26# Rules have the following formats: 27# <before>; <at>; <result> 28# <before> ; <result> 29# Key: 30# <before> and <at> are both regex expressions 31# <result> is either "fail" or "next" 32# Everything other kind of line is an error 33 34# A title is of the form "Title: ...". It is just informational, but allows the test to show why a character causes a failure. 35 36# Function 37# Logically, a label is processed by iterating through its character positions 38# In each iteration, each rule is checked. 39# If <before> and <at> both match, then the result is applied as follows: 40# fail: stop, the label is invalid 41# next: skip to the next rule that has a "next" result (skipping any "fail" or "next2" results) 42# next2: skip to the next rule that has a "next" or "next2" result (skipping any "fail" results) 43# If the processing reaches the end of the string, then the label is valid. 44 45# The regex expressions use Java / Perl syntax, with Unicode properties. 46# If the regex does not support Unicode Properties (the latest version!), then explicit ranges can be substituted. 47# For example, for [:bc=nsm:], using the set on http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:bc=nsm:] 48# Interior spaces are ignored, and may be used for readability. 49 50# The expressions are limited to a basic format which should work in any regex engine (perhaps with some syntax tweeks). 51# In particular, the <before> and <at> split is used to avoid lookbehind, which can vary in results depending on regex engines. 52# <before> is matched before the current position. Logically, it is equivalent to matching label[0,n] against /.*<before>/ 53# <at> is matched at the current position. Logically, it is equivalent to matching label[n,end] against /<at>.*/ 54 55# =================================== 56# Mapping would be done before this table, and only for Lookup 57# =================================== 58 59Title: 4.2.2. Rejection of Characters that are not Permitted: fail if DISALLOWED or UNASSIGNED 60 61# Tables 62 63# 2.1. LetterDigits (A) 64 65$LetterDigits = [[:Ll:] [:Lu:] [:Lo:] [:Nd:] [:Lm:] [:Mn:] [:Mc:]] 66 67# 2.2. Unstable (B) 68 69$Unstable = [:^nfkc_casefolded:] 70 71# 2.3. IgnorableProperties (C) // default ignoreable or whitespace or nonchar 72 73$Ignorable = [[:di:] [:WSpace:] [:NChar:]] 74 75# 2.4. IgnorableBlocks (D) 76 77$IgnorableBlocks = [[:block=Combining_Diacritical_Marks_For_Symbols:] [:block=Musical_Symbols:] [:block=Ancient_Greek_Musical_Notation:]] 78 79# 2.5. LDH (E) 80 81$LDH = [\u002D\u0030-\u0039\u0061-\u007A] 82 83# 2.6. Exceptions (F) 84# Note: added Tatweel to these, since that seems to be the current consensus. 85 86$ExceptionDisallowed = [\u302E \u302F \u0640] 87$ExceptionPvalid = [\u00DF \u03C2 \u06FD \u06FE \u0F0B \u3007] 88$ExceptionContexto = [\u002D \u00B7 \u02B9 \u0375 \u0483 \u05F3 \u05F4 \u0660 \u0661 \u0662 \u0663 \u0664 \u0665 \u0666 \u0667 \u0668 \u0669 \u06F0 \u06F1 \u06F2 \u06F3 \u06F4 \u06F5 \u06F6 \u06F7 \u06F8 \u06F9 \u3005 \u303B \u30FB] 89 90# 2.7. BackwardCompatible (G) 91 92$BackwardCompatibleDisallowed = [] 93$BackwardCompatiblePvalid = [] 94$BackwardCompatibleContexto = [] 95 96# 2.8. JoinControl (H) 97 98$JoinControl = [:JoinControl:] 99 100# 2.9. OldHangulJamo (I) 101 102$OldHangulJamo = [[:HST=L:] [:HST=V:] [:HST=T:]] 103 104# 2.10. Unassigned (J) 105 106$Unassigned = [[:unassigned:] - [:nchar:]] 107 108# If .cp. .in. Exceptions Then Exceptions(cp); 109# Else If .cp. .in. BackwardCompatible Then BackwardCompatible(cp); 110# Else If .cp. .in. Unassigned Then UNASSIGNED; 111# Else If .cp. .in. LDH Then PVALID; 112# Else If .cp. .in. JoinControl Then CONTEXTJ; 113# Else If .cp. .in. Unstable Then DISALLOWED; 114# Else If .cp. .in. IgnorableProperties Then DISALLOWED; 115# Else If .cp. .in. IgnorableBlocks Then DISALLOWED; 116# Else If .cp. .in. OldHangulJamo Then DISALLOWED; 117# Else If .cp. .in. LetterDigits Then PVALID; 118# Else DISALLOWED; 119 120# We compute when invalid. 121# There is no functional difference between DISALLOWED and UNASSIGNED: we can just call them invalid. 122 123# The rules in Tables obfuscates the true situation, which is that: 124# A. some characters are always valid 125# B. otherwise LetterDigits are valid - with some subtractions 126# C. otherwise everything is invalid 127 128# There is no functional difference between CONTEXTJ and CONTEXTO, and none at this point from PVALID either 129# We record all of them for later 130 131$Context = [$ExceptionContexto $BackwardCompatibleContexto $JoinControl] 132 133$ValidAlways = [$ExceptionPvalid $BackwardCompatiblePvalid $LDH] 134 135$InvalidLetterDigits = [$ExceptionDisallowed $BackwardCompatibleDisallowed $Unassigned $Unstable $IgnorableProperties $IgnorableBlocks $OldHangulJamo] 136 137$Valid = [$ValidAlways $Context [$LetterDigits - $InvalidLetterDigits]] 138$Valid2 = [[:nfkc_casefolded:]-[:c:]-[:z:]-[:s:]-[:p:]-[:nl:]-[:no:]-[:me:]-[:di:]-[:HST=L:]-[:HST=V:]-[:HST=V:]-[:block=Combining_Diacritical_Marks_For_Symbols:]-[:block=Musical_Symbols:]-[:block=Ancient_Greek_Musical_Notation:]-[\u0640\u07FA\u302E\u302F\u3031-\u3035\u303B][:JoinControl:][\u00DF\u03C2\u06FD\u06FE\u0F0B\u3007][\u002D\u00B7\u0375\u05F3\u05F4\u30FB]] 139 140$Invalid = [^ $Valid] 141 142# =================================== 143 144# At this point, we have the final list of everything that is invalid, so there is a single test 145 146$Invalid ; fail 147 148# =================================== 149 150Title: 4.2.3.1. Rejection of Hyphen Sequences in U-labels // just xx-- 151 152^.. ; -- ; fail 153 154# =================================== 155 156Title: 4.2.3.2. Leading Combining Marks 157 158$M = [:M:] 159 160^ ; $M ; fail 161 162# =================================== 163 164# CONTEXT: 4.2.3.3. Contextual Rules 165# The details here are all from Tables 166 167Title: Appendix A.1. HYPHEN-MINUS - Can't be at start or end; that is, ok only if medial 168 169. ; -. ; next 170- ; fail 171 172# See comment in http://www.alvestrand.no/pipermail/idna-update/2008-November/003021.html 173 174# ========================== 175 176# ZWNJ and ZWJ is the trickiest section 177# Tables is all messed up. Following UAX 31 instead, from which these are derived 178# http://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters 179 180# There are two different kinds of rules that have to be combined. 181# We don't try a script test for Arabic, because it is not needed (and is complicated) 182 183# Variables for Arabic 184 185$T = [:Joining_Type=Transparent:] 186$R = [[:Joining_Type=Dual_Joining:][:Joining_Type=Right_Joining:]] 187$L = [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]] 188 189# Appendix A.2. ZERO WIDTH NON-JOINER 190 191Title: A1. Allow ZWNJ in the following context: /$L $T* ZWNJ $T* $R/ 192 193$L $T* ; \u200C $T* $R ; next 194 195# Variables for Indic 196 197$Lt = [:General_Category=Letter:] 198$V = [:Canonical_Combining_Class=Virama:] 199 200# $deva = [:sc=deva:] 201# $Ndeva = [^$deva] 202# $beng = [:sc=beng:] 203# $Nbeng = [^$beng] 204# $guru = [:sc=guru:] 205# $Nguru = [^$guru] 206# $noVirama = [^ $deva $beng $guru] 207 208# WARNING: There is a nasty bug in Java regex before 1.6; the work-around is to do the negations 209# above, instead of in the regexes. 210 211# To do the script test for both ZWNJ and ZWJ, use the following. 212# The first line lists all the acceptable scripts 213 214Title: ZWJ/ZWNJ apply to letter+virama 215# WARNING: rule must come after Arabic! 216 217# $noVirama ; [\u200C\u200D] ; fail 218 219# The remainder makes sure that each script is paired 220# Subsequent rules will make sure that there are two characters 221 222# $Ndeva $deva; [\u200C\u200D] ; fail 223# $Nbeng $beng; [\u200C\u200D] ; fail 224# $Nguru $guru; [\u200C\u200D] ; fail 225 226# Now we do the script-independent rules 227 228Title: A2. Allow ZWNJ in the following context: /$L $V ZWNJ/ 229 230$Lt $V ; \u200C ; next2 231\u200C ; fail 232 233# Appendix A.3. ZERO WIDTH JOINER 234 235Title: B. Allow ZWJ (U+200D) in the following context: /$L $V ZWJ/ 236 237$Lt $V ; \u200D ; next2 238\u200D ; fail 239 240# ========================== 241 242Title: Appendix A.4. MIDDLE DOT 243 244l ; \u00B7 l ; next 245\u00B7 ; fail 246 247# Appendix A.5. GREEK LOWER NUMERAL SIGN (KERAIA) 248Title: The script of the following character MUST be Greek. 249$grek = [:script=greek:] 250\u0375 ; $grek ; next 251\u0375 ; fail 252 253Title: Appendix A.6. HEBREW PUNCTUATION GERESH - \u05F3 - The script of the preceding character MUST be Hebrew. 254 255$hebr = [:script=hebrew:] 256$hebr ; \u05F3 ; next 257\u05F3 ; fail 258 259Title: Appendix A.7. HEBREW PUNCTUATION GERSHAYIM - \u05F4 - The script of the preceding character MUST be Hebrew. 260 261$hebr ; \u05F4 ; next 262\u05F4 ; fail 263 264Title: Appendix A.12. KATAKANA MIDDLE DOT - \u30FB - Adjacent characters MUST be Hiragana, Katakana, or Han. 265$han = [[:script=Hiragana:][:script=Katakana:][:script=Han:]] 266$han ; \u30FB $han; next 267\u30FB ; fail 268 269Title: Appendix A.13. ARABIC-INDIC DIGITS - 0660..0669 270# Rules broken, since they simply forbid them entirely 271# Rewrite to exclude mixing with western (ASCII) or extended Arabic digits 272 273$WD = [0-9] 274$AD = [\u0660-\u0669] 275$EAD = [\u06F0-\u06F9] 276 277[$WD $EAD].*$AD ; fail 278$AD.*[$WD $EAD] ; fail 279 280Title: Appendix A.14. EXTENDED ARABIC-INDIC DIGITS 281# Rules broken, since they simply forbid them entirely 282# Rewrite to exclude mixing with western (ASCII) or Arabic digits 283 284$EAD .* [$WD $AD] ; fail 285[$WD $AD] .* $EAD ; fail 286 287# ========================== 288 289# BIDI Rules: 4.2.3.4. Labels Containing Characters Written Right to Left 290# The details here are all from http://tools.ietf.org/html/draft-ietf-idnabis-bidi-03 291 292# Note that $NSM != Non-spacing Marks in the general sense 293# See http://unicode.org/cldr/utility/unicodeset.jsp?a=[:bc=nsm:]&b=[[:me:][:mn:]] 294 295$NSM = [:bc=NSM:] 296$ESON = [[:bc=ES:][:bc=ON:]] 297$ENAN = [[:bc=EN:][:bc=AN:]] 298$RALAN = [[:bc=R:][:bc=AL:][:bc=AN:]] 299$BCL = [:bc=L:] 300$BDisallowed = [^[:bc=L:][:bc=R:][:bc=AL:][:bc=AN:][:bc=EN:][:bc=ES:][:bc=BN:][:bc=ON:][:bc=NSM:]] 301 302# Note: 303# The only tables-valid [:bc=ES:] character is: - 304# The only tables-valid [:bc=BN:] characters are: [\u200C \u200D] 305# The only tables-valid [:bc=ON:] characters are: [・ · ʹ ͵ ʺ ˆ-ˏ ˬ ꜗ-ꜟ ꞈ ⸯ ꙿ] 306 307Title: 1. Only characters with the BIDI properties L, R, AL, AN, EN, ES, BN, ON and NSM are allowed. 308 309# No rules really necessary, since anything else is excluded by Tables 310 311$BDisallowed ; fail 312 313Title: 2. ES and ON are not allowed in the first position 314 315^ ; $ESON .* $RALAN ; fail 316 317Title: 3. ES and ON, followed by zero or more NSM, is not allowed in the last position 318 319$RALAN .* ; $ESON $NSM* $ ; fail 320 321Title: 4. If an R, AL or AN is present, no L may be present. 322 323$RALAN .* $BCL ; fail 324$BCL .* $RALAN ; fail 325 326Title: 5. If an EN is present, no AN may be present, and vice versa. 327 328# Overlaps with A.13/14 above, not necessary to restate 329 330Title: 6. The first character may not be an NSM. 331 332^ ; $NSM ; fail 333 334Title: 7. The first character may not be an EN (European Number) or an AN (Arabic Number). 335 336^ ; $ENAN .* $RALAN ; fail 337 338# NOTE: all of the "not allowed in first position" rules could be combined together 339 340# =================================== 341 342# 4.2.4. Registration Validation Summary: if at least one non-ASCII then <= 59 bytes of PunyCode 343# Needs to be done outside of this table 344