1   #8.2 Parsing HTML documents Table of contents 8.2.5 Tree construction
2
3   WHATWG
4
5HTML 5
6
7Draft Recommendation — 7 February 2009
8
9   ← 8.2 Parsing HTML documents – Table of contents – 8.2.5 Tree
10   construction →
11
12    8.2.4 Tokenization
13
14   Implementations must act as if they used the following state machine to
15   tokenise HTML. The state machine must start in the data state. Most
16   states consume a single character, which may have various side-effects,
17   and either switches the state machine to a new state to reconsume the
18   same character, or switches it to a new state (to consume the next
19   character), or repeats the same state (to consume the next character).
20   Some states have more complicated behavior and can consume several
21   characters before switching to another state.
22
23   The exact behavior of certain states depends on a content model flag
24   that is set after certain tokens are emitted. The flag has several
25   states: PCDATA, RCDATA, CDATA, and PLAINTEXT. Initially it must be in
26   the PCDATA state. In the RCDATA and CDATA states, a further escape flag
27   is used to control the behavior of the tokeniser. It is either true or
28   false, and initially must be set to the false state. The insertion mode
29   and the stack of open elements also affects tokenization.
30
31   The output of the tokenization step is a series of zero or more of the
32   following tokens: DOCTYPE, start tag, end tag, comment, character,
33   end-of-file. DOCTYPE tokens have a name, a public identifier, a system
34   identifier, and a force-quirks flag. When a DOCTYPE token is created,
35   its name, public identifier, and system identifier must be marked as
36   missing (which is a distinct state from the empty string), and the
37   force-quirks flag must be set to off (its other state is on). Start and
38   end tag tokens have a tag name, a self-closing flag, and a list of
39   attributes, each of which has a name and a value. When a start or end
40   tag token is created, its self-closing flag must be unset (its other
41   state is that it be set), and its attributes list must be empty.
42   Comment and character tokens have data.
43
44   When a token is emitted, it must immediately be handled by the tree
45   construction stage. The tree construction stage can affect the state of
46   the content model flag, and can insert additional characters into the
47   stream. (For example, the script element can result in scripts
48   executing and using the dynamic markup insertion APIs to insert
49   characters into the stream being tokenised.)
50
51   When a start tag token is emitted with its self-closing flag set, if
52   the flag is not acknowledged when it is processed by the tree
53   construction stage, that is a parse error.
54
55   When an end tag token is emitted, the content model flag must be
56   switched to the PCDATA state.
57
58   When an end tag token is emitted with attributes, that is a parse
59   error.
60
61   When an end tag token is emitted with its self-closing flag set, that
62   is a parse error.
63
64   Before each step of the tokeniser, the user agent must first check the
65   parser pause flag. If it is true, then the tokeniser must abort the
66   processing of any nested invocations of the tokeniser, yielding control
67   back to the caller. If it is false, then the user agent may then check
68   to see if either one of the scripts in the list of scripts that will
69   execute as soon as possible or the first script in the list of scripts
70   that will execute asynchronously, has completed loading. If one has,
71   then it must be executed and removed from its list.
72
73   The tokeniser state machine consists of the states defined in the
74   following subsections.
75
76      8.2.4.1 Data state
77
78   Consume the next input character:
79
80   U+0026 AMPERSAND (&)
81          When the content model flag is set to one of the PCDATA or
82          RCDATA states and the escape flag is false: switch to the
83          character reference data state.
84          Otherwise: treat it as per the "anything else" entry below.
85
86   U+002D HYPHEN-MINUS (-)
87          If the content model flag is set to either the RCDATA state or
88          the CDATA state, and the escape flag is false, and there are at
89          least three characters before this one in the input stream, and
90          the last four characters in the input stream, including this
91          one, are U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D
92          HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the
93          escape flag to true.
94
95          In any case, emit the input character as a character token. Stay
96          in the data state.
97
98   U+003C LESS-THAN SIGN (<)
99          When the content model flag is set to the PCDATA state: switch
100          to the tag open state.
101          When the content model flag is set to either the RCDATA state or
102          the CDATA state, and the escape flag is false: switch to the tag
103          open state.
104          Otherwise: treat it as per the "anything else" entry below.
105
106   U+003E GREATER-THAN SIGN (>)
107          If the content model flag is set to either the RCDATA state or
108          the CDATA state, and the escape flag is true, and the last three
109          characters in the input stream including this one are U+002D
110          HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN
111          ("-->"), set the escape flag to false.
112
113          In any case, emit the input character as a character token. Stay
114          in the data state.
115
116   EOF
117          Emit an end-of-file token.
118
119   Anything else
120          Emit the input character as a character token. Stay in the data
121          state.
122
123      8.2.4.2 Character reference data state
124
125   (This cannot happen if the content model flag is set to the CDATA
126   state.)
127
128   Attempt to consume a character reference, with no additional allowed
129   character.
130
131   If nothing is returned, emit a U+0026 AMPERSAND character token.
132
133   Otherwise, emit the character token that was returned.
134
135   Finally, switch to the data state.
136
137      8.2.4.3 Tag open state
138
139   The behavior of this state depends on the content model flag.
140
141   If the content model flag is set to the RCDATA or CDATA states
142          Consume the next input character. If it is a U+002F SOLIDUS (/)
143          character, switch to the close tag open state. Otherwise, emit a
144          U+003C LESS-THAN SIGN character token and reconsume the current
145          input character in the data state.
146
147   If the content model flag is set to the PCDATA state
148          Consume the next input character:
149
150        U+0021 EXCLAMATION MARK (!)
151                Switch to the markup declaration open state.
152
153        U+002F SOLIDUS (/)
154                Switch to the close tag open state.
155
156        U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL
157                LETTER Z
158                Create a new start tag token, set its tag name to the
159                lowercase version of the input character (add 0x0020 to
160                the character's code point), then switch to the tag name
161                state. (Don't emit the token yet; further details will be
162                filled in before it is emitted.)
163
164        U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
165                Create a new start tag token, set its tag name to the
166                input character, then switch to the tag name state. (Don't
167                emit the token yet; further details will be filled in
168                before it is emitted.)
169
170        U+003E GREATER-THAN SIGN (>)
171                Parse error. Emit a U+003C LESS-THAN SIGN character token
172                and a U+003E GREATER-THAN SIGN character token. Switch to
173                the data state.
174
175        U+003F QUESTION MARK (?)
176                Parse error. Switch to the bogus comment state.
177
178        Anything else
179                Parse error. Emit a U+003C LESS-THAN SIGN character token
180                and reconsume the current input character in the data
181                state.
182
183      8.2.4.4 Close tag open state
184
185   If the content model flag is set to the RCDATA or CDATA states but no
186   start tag token has ever been emitted by this instance of the tokeniser
187   (fragment case), or, if the content model flag is set to the RCDATA or
188   CDATA states and the next few characters do not match the tag name of
189   the last start tag token emitted (compared in an ASCII case-insensitive
190   manner), or if they do but they are not immediately followed by one of
191   the following characters:
192     * U+0009 CHARACTER TABULATION
193     * U+000A LINE FEED (LF)
194     * U+000C FORM FEED (FF)
195     * U+0020 SPACE
196     * U+003E GREATER-THAN SIGN (>)
197     * U+002F SOLIDUS (/)
198     * EOF
199
200   ...then emit a U+003C LESS-THAN SIGN character token, a U+002F SOLIDUS
201   character token, and switch to the data state to process the next input
202   character.
203
204   Otherwise, if the content model flag is set to the PCDATA state, or if
205   the next few characters do match that tag name, consume the next input
206   character:
207
208   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
209          Create a new end tag token, set its tag name to the lowercase
210          version of the input character (add 0x0020 to the character's
211          code point), then switch to the tag name state. (Don't emit the
212          token yet; further details will be filled in before it is
213          emitted.)
214
215   U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
216          Create a new end tag token, set its tag name to the input
217          character, then switch to the tag name state. (Don't emit the
218          token yet; further details will be filled in before it is
219          emitted.)
220
221   U+003E GREATER-THAN SIGN (>)
222          Parse error. Switch to the data state.
223
224   EOF
225          Parse error. Emit a U+003C LESS-THAN SIGN character token and a
226          U+002F SOLIDUS character token. Reconsume the EOF character in
227          the data state.
228
229   Anything else
230          Parse error. Switch to the bogus comment state.
231
232      8.2.4.5 Tag name state
233
234   Consume the next input character:
235
236   U+0009 CHARACTER TABULATION
237   U+000A LINE FEED (LF)
238   U+000C FORM FEED (FF)
239   U+0020 SPACE
240          Switch to the before attribute name state.
241
242   U+002F SOLIDUS (/)
243          Switch to the self-closing start tag state.
244
245   U+003E GREATER-THAN SIGN (>)
246          Emit the current tag token. Switch to the data state.
247
248   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
249          Append the lowercase version of the current input character (add
250          0x0020 to the character's code point) to the current tag token's
251          tag name. Stay in the tag name state.
252
253   EOF
254          Parse error. Emit the current tag token. Reconsume the EOF
255          character in the data state.
256
257   Anything else
258          Append the current input character to the current tag token's
259          tag name. Stay in the tag name state.
260
261      8.2.4.6 Before attribute name state
262
263   Consume the next input character:
264
265   U+0009 CHARACTER TABULATION
266   U+000A LINE FEED (LF)
267   U+000C FORM FEED (FF)
268   U+0020 SPACE
269          Stay in the before attribute name state.
270
271   U+002F SOLIDUS (/)
272          Switch to the self-closing start tag state.
273
274   U+003E GREATER-THAN SIGN (>)
275          Emit the current tag token. Switch to the data state.
276
277   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
278          Start a new attribute in the current tag token. Set that
279          attribute's name to the lowercase version of the current input
280          character (add 0x0020 to the character's code point), and its
281          value to the empty string. Switch to the attribute name state.
282
283   U+0022 QUOTATION MARK (")
284   U+0027 APOSTROPHE (')
285   U+003D EQUALS SIGN (=)
286          Parse error. Treat it as per the "anything else" entry below.
287
288   EOF
289          Parse error. Emit the current tag token. Reconsume the EOF
290          character in the data state.
291
292   Anything else
293          Start a new attribute in the current tag token. Set that
294          attribute's name to the current input character, and its value
295          to the empty string. Switch to the attribute name state.
296
297      8.2.4.7 Attribute name state
298
299   Consume the next input character:
300
301   U+0009 CHARACTER TABULATION
302   U+000A LINE FEED (LF)
303   U+000C FORM FEED (FF)
304   U+0020 SPACE
305          Switch to the after attribute name state.
306
307   U+002F SOLIDUS (/)
308          Switch to the self-closing start tag state.
309
310   U+003D EQUALS SIGN (=)
311          Switch to the before attribute value state.
312
313   U+003E GREATER-THAN SIGN (>)
314          Emit the current tag token. Switch to the data state.
315
316   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
317          Append the lowercase version of the current input character (add
318          0x0020 to the character's code point) to the current attribute's
319          name. Stay in the attribute name state.
320
321   U+0022 QUOTATION MARK (")
322   U+0027 APOSTROPHE (')
323          Parse error. Treat it as per the "anything else" entry below.
324
325   EOF
326          Parse error. Emit the current tag token. Reconsume the EOF
327          character in the data state.
328
329   Anything else
330          Append the current input character to the current attribute's
331          name. Stay in the attribute name state.
332
333   When the user agent leaves the attribute name state (and before
334   emitting the tag token, if appropriate), the complete attribute's name
335   must be compared to the other attributes on the same token; if there is
336   already an attribute on the token with the exact same name, then this
337   is a parse error and the new attribute must be dropped, along with the
338   value that gets associated with it (if any).
339
340      8.2.4.8 After attribute name state
341
342   Consume the next input character:
343
344   U+0009 CHARACTER TABULATION
345   U+000A LINE FEED (LF)
346   U+000C FORM FEED (FF)
347   U+0020 SPACE
348          Stay in the after attribute name state.
349
350   U+002F SOLIDUS (/)
351          Switch to the self-closing start tag state.
352
353   U+003D EQUALS SIGN (=)
354          Switch to the before attribute value state.
355
356   U+003E GREATER-THAN SIGN (>)
357          Emit the current tag token. Switch to the data state.
358
359   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
360          Start a new attribute in the current tag token. Set that
361          attribute's name to the lowercase version of the current input
362          character (add 0x0020 to the character's code point), and its
363          value to the empty string. Switch to the attribute name state.
364
365   U+0022 QUOTATION MARK (")
366   U+0027 APOSTROPHE (')
367          Parse error. Treat it as per the "anything else" entry below.
368
369   EOF
370          Parse error. Emit the current tag token. Reconsume the EOF
371          character in the data state.
372
373   Anything else
374          Start a new attribute in the current tag token. Set that
375          attribute's name to the current input character, and its value
376          to the empty string. Switch to the attribute name state.
377
378      8.2.4.9 Before attribute value state
379
380   Consume the next input character:
381
382   U+0009 CHARACTER TABULATION
383   U+000A LINE FEED (LF)
384   U+000C FORM FEED (FF)
385   U+0020 SPACE
386          Stay in the before attribute value state.
387
388   U+0022 QUOTATION MARK (")
389          Switch to the attribute value (double-quoted) state.
390
391   U+0026 AMPERSAND (&)
392          Switch to the attribute value (unquoted) state and reconsume
393          this input character.
394
395   U+0027 APOSTROPHE (')
396          Switch to the attribute value (single-quoted) state.
397
398   U+003E GREATER-THAN SIGN (>)
399          Parse error. Emit the current tag token. Switch to the data
400          state.
401
402   U+003D EQUALS SIGN (=)
403          Parse error. Treat it as per the "anything else" entry below.
404
405   EOF
406          Parse error. Emit the current tag token. Reconsume the character
407          in the data state.
408
409   Anything else
410          Append the current input character to the current attribute's
411          value. Switch to the attribute value (unquoted) state.
412
413      8.2.4.10 Attribute value (double-quoted) state
414
415   Consume the next input character:
416
417   U+0022 QUOTATION MARK (")
418          Switch to the after attribute value (quoted) state.
419
420   U+0026 AMPERSAND (&)
421          Switch to the character reference in attribute value state, with
422          the additional allowed character being U+0022 QUOTATION MARK
423          (").
424
425   EOF
426          Parse error. Emit the current tag token. Reconsume the character
427          in the data state.
428
429   Anything else
430          Append the current input character to the current attribute's
431          value. Stay in the attribute value (double-quoted) state.
432
433      8.2.4.11 Attribute value (single-quoted) state
434
435   Consume the next input character:
436
437   U+0027 APOSTROPHE (')
438          Switch to the after attribute value (quoted) state.
439
440   U+0026 AMPERSAND (&)
441          Switch to the character reference in attribute value state, with
442          the additional allowed character being U+0027 APOSTROPHE (').
443
444   EOF
445          Parse error. Emit the current tag token. Reconsume the character
446          in the data state.
447
448   Anything else
449          Append the current input character to the current attribute's
450          value. Stay in the attribute value (single-quoted) state.
451
452      8.2.4.12 Attribute value (unquoted) state
453
454   Consume the next input character:
455
456   U+0009 CHARACTER TABULATION
457   U+000A LINE FEED (LF)
458   U+000C FORM FEED (FF)
459   U+0020 SPACE
460          Switch to the before attribute name state.
461
462   U+0026 AMPERSAND (&)
463          Switch to the character reference in attribute value state, with
464          no additional allowed character.
465
466   U+003E GREATER-THAN SIGN (>)
467          Emit the current tag token. Switch to the data state.
468
469   U+0022 QUOTATION MARK (")
470   U+0027 APOSTROPHE (')
471   U+003D EQUALS SIGN (=)
472          Parse error. Treat it as per the "anything else" entry below.
473
474   EOF
475          Parse error. Emit the current tag token. Reconsume the character
476          in the data state.
477
478   Anything else
479          Append the current input character to the current attribute's
480          value. Stay in the attribute value (unquoted) state.
481
482      8.2.4.13 Character reference in attribute value state
483
484   Attempt to consume a character reference.
485
486   If nothing is returned, append a U+0026 AMPERSAND character to the
487   current attribute's value.
488
489   Otherwise, append the returned character token to the current
490   attribute's value.
491
492   Finally, switch back to the attribute value state that you were in when
493   were switched into this state.
494
495      8.2.4.14 After attribute value (quoted) state
496
497   Consume the next input character:
498
499   U+0009 CHARACTER TABULATION
500   U+000A LINE FEED (LF)
501   U+000C FORM FEED (FF)
502   U+0020 SPACE
503          Switch to the before attribute name state.
504
505   U+002F SOLIDUS (/)
506          Switch to the self-closing start tag state.
507
508   U+003E GREATER-THAN SIGN (>)
509          Emit the current tag token. Switch to the data state.
510
511   EOF
512          Parse error. Emit the current tag token. Reconsume the EOF
513          character in the data state.
514
515   Anything else
516          Parse error. Reconsume the character in the before attribute
517          name state.
518
519      8.2.4.15 Self-closing start tag state
520
521   Consume the next input character:
522
523   U+003E GREATER-THAN SIGN (>)
524          Set the self-closing flag of the current tag token. Emit the
525          current tag token. Switch to the data state.
526
527   EOF
528          Parse error. Emit the current tag token. Reconsume the EOF
529          character in the data state.
530
531   Anything else
532          Parse error. Reconsume the character in the before attribute
533          name state.
534
535      8.2.4.16 Bogus comment state
536
537   (This can only happen if the content model flag is set to the PCDATA
538   state.)
539
540   Consume every character up to and including the first U+003E
541   GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever
542   comes first. Emit a comment token whose data is the concatenation of
543   all the characters starting from and including the character that
544   caused the state machine to switch into the bogus comment state, up to
545   and including the character immediately before the last consumed
546   character (i.e. up to the character just before the U+003E or EOF
547   character). (If the comment was started by the end of the file (EOF),
548   the token is empty.)
549
550   Switch to the data state.
551
552   If the end of the file was reached, reconsume the EOF character.
553
554      8.2.4.17 Markup declaration open state
555
556   (This can only happen if the content model flag is set to the PCDATA
557   state.)
558
559   If the next two characters are both U+002D HYPHEN-MINUS (-) characters,
560   consume those two characters, create a comment token whose data is the
561   empty string, and switch to the comment start state.
562
563   Otherwise, if the next seven characters are an ASCII case-insensitive
564   match for the word "DOCTYPE", then consume those characters and switch
565   to the DOCTYPE state.
566
567   Otherwise, if the insertion mode is "in foreign content" and the
568   current node is not an element in the HTML namespace and the next seven
569   characters are an ASCII case-sensitive match for the string "[CDATA["
570   (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET
571   character before and after), then consume those characters and switch
572   to the CDATA section state (which is unrelated to the content model
573   flag's CDATA state).
574
575   Otherwise, this is a parse error. Switch to the bogus comment state.
576   The next character that is consumed, if any, is the first character
577   that will be in the comment.
578
579      8.2.4.18 Comment start state
580
581   Consume the next input character:
582
583   U+002D HYPHEN-MINUS (-)
584          Switch to the comment start dash state.
585
586   U+003E GREATER-THAN SIGN (>)
587          Parse error. Emit the comment token. Switch to the data state.
588
589   EOF
590          Parse error. Emit the comment token. Reconsume the EOF character
591          in the data state.
592
593   Anything else
594          Append the input character to the comment token's data. Switch
595          to the comment state.
596
597      8.2.4.19 Comment start dash state
598
599   Consume the next input character:
600
601   U+002D HYPHEN-MINUS (-)
602          Switch to the comment end state
603
604   U+003E GREATER-THAN SIGN (>)
605          Parse error. Emit the comment token. Switch to the data state.
606
607   EOF
608          Parse error. Emit the comment token. Reconsume the EOF character
609          in the data state.
610
611   Anything else
612          Append a U+002D HYPHEN-MINUS (-) character and the input
613          character to the comment token's data. Switch to the comment
614          state.
615
616      8.2.4.20 Comment state
617
618   Consume the next input character:
619
620   U+002D HYPHEN-MINUS (-)
621          Switch to the comment end dash state
622
623   EOF
624          Parse error. Emit the comment token. Reconsume the EOF character
625          in the data state.
626
627   Anything else
628          Append the input character to the comment token's data. Stay in
629          the comment state.
630
631      8.2.4.21 Comment end dash state
632
633   Consume the next input character:
634
635   U+002D HYPHEN-MINUS (-)
636          Switch to the comment end state
637
638   EOF
639          Parse error. Emit the comment token. Reconsume the EOF character
640          in the data state.
641
642   Anything else
643          Append a U+002D HYPHEN-MINUS (-) character and the input
644          character to the comment token's data. Switch to the comment
645          state.
646
647      8.2.4.22 Comment end state
648
649   Consume the next input character:
650
651   U+003E GREATER-THAN SIGN (>)
652          Emit the comment token. Switch to the data state.
653
654   U+002D HYPHEN-MINUS (-)
655          Parse error. Append a U+002D HYPHEN-MINUS (-) character to the
656          comment token's data. Stay in the comment end state.
657
658   EOF
659          Parse error. Emit the comment token. Reconsume the EOF character
660          in the data state.
661
662   Anything else
663          Parse error. Append two U+002D HYPHEN-MINUS (-) characters and
664          the input character to the comment token's data. Switch to the
665          comment state.
666
667      8.2.4.23 DOCTYPE state
668
669   Consume the next input character:
670
671   U+0009 CHARACTER TABULATION
672   U+000A LINE FEED (LF)
673   U+000C FORM FEED (FF)
674   U+0020 SPACE
675          Switch to the before DOCTYPE name state.
676
677   Anything else
678          Parse error. Reconsume the current character in the before
679          DOCTYPE name state.
680
681      8.2.4.24 Before DOCTYPE name state
682
683   Consume the next input character:
684
685   U+0009 CHARACTER TABULATION
686   U+000A LINE FEED (LF)
687   U+000C FORM FEED (FF)
688   U+0020 SPACE
689          Stay in the before DOCTYPE name state.
690
691   U+003E GREATER-THAN SIGN (>)
692          Parse error. Create a new DOCTYPE token. Set its force-quirks
693          flag to on. Emit the token. Switch to the data state.
694
695   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
696          Create a new DOCTYPE token. Set the token's name to the
697          lowercase version of the input character (add 0x0020 to the
698          character's code point). Switch to the DOCTYPE name state.
699
700   EOF
701          Parse error. Create a new DOCTYPE token. Set its force-quirks
702          flag to on. Emit the token. Reconsume the EOF character in the
703          data state.
704
705   Anything else
706          Create a new DOCTYPE token. Set the token's name to the current
707          input character. Switch to the DOCTYPE name state.
708
709      8.2.4.25 DOCTYPE name state
710
711   Consume the next input character:
712
713   U+0009 CHARACTER TABULATION
714   U+000A LINE FEED (LF)
715   U+000C FORM FEED (FF)
716   U+0020 SPACE
717          Switch to the after DOCTYPE name state.
718
719   U+003E GREATER-THAN SIGN (>)
720          Emit the current DOCTYPE token. Switch to the data state.
721
722   U+0041 LATIN CAPITAL LETTER A through to U+005A LATIN CAPITAL LETTER Z
723          Append the lowercase version of the input character (add 0x0020
724          to the character's code point) to the current DOCTYPE token's
725          name. Stay in the DOCTYPE name state.
726
727   EOF
728          Parse error. Set the DOCTYPE token's force-quirks flag to on.
729          Emit that DOCTYPE token. Reconsume the EOF character in the data
730          state.
731
732   Anything else
733          Append the current input character to the current DOCTYPE
734          token's name. Stay in the DOCTYPE name state.
735
736      8.2.4.26 After DOCTYPE name state
737
738   Consume the next input character:
739
740   U+0009 CHARACTER TABULATION
741   U+000A LINE FEED (LF)
742   U+000C FORM FEED (FF)
743   U+0020 SPACE
744          Stay in the after DOCTYPE name state.
745
746   U+003E GREATER-THAN SIGN (>)
747          Emit the current DOCTYPE token. Switch to the data state.
748
749   EOF
750          Parse error. Set the DOCTYPE token's force-quirks flag to on.
751          Emit that DOCTYPE token. Reconsume the EOF character in the data
752          state.
753
754   Anything else
755          If the six characters starting from the current input character
756          are an ASCII case-insensitive match for the word "PUBLIC", then
757          consume those characters and switch to the before DOCTYPE public
758          identifier state.
759
760          Otherwise, if the six characters starting from the current input
761          character are an ASCII case-insensitive match for the word
762          "SYSTEM", then consume those characters and switch to the before
763          DOCTYPE system identifier state.
764
765          Otherwise, this is the parse error. Set the DOCTYPE token's
766          force-quirks flag to on. Switch to the bogus DOCTYPE state.
767
768      8.2.4.27 Before DOCTYPE public identifier state
769
770   Consume the next input character:
771
772   U+0009 CHARACTER TABULATION
773   U+000A LINE FEED (LF)
774   U+000C FORM FEED (FF)
775   U+0020 SPACE
776          Stay in the before DOCTYPE public identifier state.
777
778   U+0022 QUOTATION MARK (")
779          Set the DOCTYPE token's public identifier to the empty string
780          (not missing), then switch to the DOCTYPE public identifier
781          (double-quoted) state.
782
783   U+0027 APOSTROPHE (')
784          Set the DOCTYPE token's public identifier to the empty string
785          (not missing), then switch to the DOCTYPE public identifier
786          (single-quoted) state.
787
788   U+003E GREATER-THAN SIGN (>)
789          Parse error. Set the DOCTYPE token's force-quirks flag to on.
790          Emit that DOCTYPE token. Switch to the data state.
791
792   EOF
793          Parse error. Set the DOCTYPE token's force-quirks flag to on.
794          Emit that DOCTYPE token. Reconsume the EOF character in the data
795          state.
796
797   Anything else
798          Parse error. Set the DOCTYPE token's force-quirks flag to on.
799          Switch to the bogus DOCTYPE state.
800
801      8.2.4.28 DOCTYPE public identifier (double-quoted) state
802
803   Consume the next input character:
804
805   U+0022 QUOTATION MARK (")
806          Switch to the after DOCTYPE public identifier state.
807
808   U+003E GREATER-THAN SIGN (>)
809          Parse error. Set the DOCTYPE token's force-quirks flag to on.
810          Emit that DOCTYPE token. Switch to the data state.
811
812   EOF
813          Parse error. Set the DOCTYPE token's force-quirks flag to on.
814          Emit that DOCTYPE token. Reconsume the EOF character in the data
815          state.
816
817   Anything else
818          Append the current input character to the current DOCTYPE
819          token's public identifier. Stay in the DOCTYPE public identifier
820          (double-quoted) state.
821
822      8.2.4.29 DOCTYPE public identifier (single-quoted) state
823
824   Consume the next input character:
825
826   U+0027 APOSTROPHE (')
827          Switch to the after DOCTYPE public identifier state.
828
829   U+003E GREATER-THAN SIGN (>)
830          Parse error. Set the DOCTYPE token's force-quirks flag to on.
831          Emit that DOCTYPE token. Switch to the data state.
832
833   EOF
834          Parse error. Set the DOCTYPE token's force-quirks flag to on.
835          Emit that DOCTYPE token. Reconsume the EOF character in the data
836          state.
837
838   Anything else
839          Append the current input character to the current DOCTYPE
840          token's public identifier. Stay in the DOCTYPE public identifier
841          (single-quoted) state.
842
843      8.2.4.30 After DOCTYPE public identifier state
844
845   Consume the next input character:
846
847   U+0009 CHARACTER TABULATION
848   U+000A LINE FEED (LF)
849   U+000C FORM FEED (FF)
850   U+0020 SPACE
851          Stay in the after DOCTYPE public identifier state.
852
853   U+0022 QUOTATION MARK (")
854          Set the DOCTYPE token's system identifier to the empty string
855          (not missing), then switch to the DOCTYPE system identifier
856          (double-quoted) state.
857
858   U+0027 APOSTROPHE (')
859          Set the DOCTYPE token's system identifier to the empty string
860          (not missing), then switch to the DOCTYPE system identifier
861          (single-quoted) state.
862
863   U+003E GREATER-THAN SIGN (>)
864          Emit the current DOCTYPE token. Switch to the data state.
865
866   EOF
867          Parse error. Set the DOCTYPE token's force-quirks flag to on.
868          Emit that DOCTYPE token. Reconsume the EOF character in the data
869          state.
870
871   Anything else
872          Parse error. Set the DOCTYPE token's force-quirks flag to on.
873          Switch to the bogus DOCTYPE state.
874
875      8.2.4.31 Before DOCTYPE system identifier state
876
877   Consume the next input character:
878
879   U+0009 CHARACTER TABULATION
880   U+000A LINE FEED (LF)
881   U+000C FORM FEED (FF)
882   U+0020 SPACE
883          Stay in the before DOCTYPE system identifier state.
884
885   U+0022 QUOTATION MARK (")
886          Set the DOCTYPE token's system identifier to the empty string
887          (not missing), then switch to the DOCTYPE system identifier
888          (double-quoted) state.
889
890   U+0027 APOSTROPHE (')
891          Set the DOCTYPE token's system identifier to the empty string
892          (not missing), then switch to the DOCTYPE system identifier
893          (single-quoted) state.
894
895   U+003E GREATER-THAN SIGN (>)
896          Parse error. Set the DOCTYPE token's force-quirks flag to on.
897          Emit that DOCTYPE token. Switch to the data state.
898
899   EOF
900          Parse error. Set the DOCTYPE token's force-quirks flag to on.
901          Emit that DOCTYPE token. Reconsume the EOF character in the data
902          state.
903
904   Anything else
905          Parse error. Set the DOCTYPE token's force-quirks flag to on.
906          Switch to the bogus DOCTYPE state.
907
908      8.2.4.32 DOCTYPE system identifier (double-quoted) state
909
910   Consume the next input character:
911
912   U+0022 QUOTATION MARK (")
913          Switch to the after DOCTYPE system identifier state.
914
915   U+003E GREATER-THAN SIGN (>)
916          Parse error. Set the DOCTYPE token's force-quirks flag to on.
917          Emit that DOCTYPE token. Switch to the data state.
918
919   EOF
920          Parse error. Set the DOCTYPE token's force-quirks flag to on.
921          Emit that DOCTYPE token. Reconsume the EOF character in the data
922          state.
923
924   Anything else
925          Append the current input character to the current DOCTYPE
926          token's system identifier. Stay in the DOCTYPE system identifier
927          (double-quoted) state.
928
929      8.2.4.33 DOCTYPE system identifier (single-quoted) state
930
931   Consume the next input character:
932
933   U+0027 APOSTROPHE (')
934          Switch to the after DOCTYPE system identifier state.
935
936   U+003E GREATER-THAN SIGN (>)
937          Parse error. Set the DOCTYPE token's force-quirks flag to on.
938          Emit that DOCTYPE token. Switch to the data state.
939
940   EOF
941          Parse error. Set the DOCTYPE token's force-quirks flag to on.
942          Emit that DOCTYPE token. Reconsume the EOF character in the data
943          state.
944
945   Anything else
946          Append the current input character to the current DOCTYPE
947          token's system identifier. Stay in the DOCTYPE system identifier
948          (single-quoted) state.
949
950      8.2.4.34 After DOCTYPE system identifier state
951
952   Consume the next input character:
953
954   U+0009 CHARACTER TABULATION
955   U+000A LINE FEED (LF)
956   U+000C FORM FEED (FF)
957   U+0020 SPACE
958          Stay in the after DOCTYPE system identifier state.
959
960   U+003E GREATER-THAN SIGN (>)
961          Emit the current DOCTYPE token. Switch to the data state.
962
963   EOF
964          Parse error. Set the DOCTYPE token's force-quirks flag to on.
965          Emit that DOCTYPE token. Reconsume the EOF character in the data
966          state.
967
968   Anything else
969          Parse error. Switch to the bogus DOCTYPE state. (This does not
970          set the DOCTYPE token's force-quirks flag to on.)
971
972      8.2.4.35 Bogus DOCTYPE state
973
974   Consume the next input character:
975
976   U+003E GREATER-THAN SIGN (>)
977          Emit the DOCTYPE token. Switch to the data state.
978
979   EOF
980          Emit the DOCTYPE token. Reconsume the EOF character in the data
981          state.
982
983   Anything else
984          Stay in the bogus DOCTYPE state.
985
986      8.2.4.36 CDATA section state
987
988   (This can only happen if the content model flag is set to the PCDATA
989   state, and is unrelated to the content model flag's CDATA state.)
990
991   Consume every character up to the next occurrence of the three
992   character sequence U+005D RIGHT SQUARE BRACKET U+005D RIGHT SQUARE
993   BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF),
994   whichever comes first. Emit a series of character tokens consisting of
995   all the characters consumed except the matching three character
996   sequence at the end (if one was found before the end of the file).
997
998   Switch to the data state.
999
1000   If the end of the file was reached, reconsume the EOF character.
1001
1002      8.2.4.37 Tokenizing character references
1003
1004   This section defines how to consume a character reference. This
1005   definition is used when parsing character references in text and in
1006   attributes.
1007
1008   The behavior depends on the identity of the next character (the one
1009   immediately after the U+0026 AMPERSAND character):
1010
1011   U+0009 CHARACTER TABULATION
1012   U+000A LINE FEED (LF)
1013   U+000C FORM FEED (FF)
1014   U+0020 SPACE
1015   U+003C LESS-THAN SIGN
1016   U+0026 AMPERSAND
1017   EOF
1018   The additional allowed character, if there is one
1019          Not a character reference. No characters are consumed, and
1020          nothing is returned. (This is not an error, either.)
1021
1022   U+0023 NUMBER SIGN (#)
1023          Consume the U+0023 NUMBER SIGN.
1024
1025          The behavior further depends on the character after the U+0023
1026          NUMBER SIGN:
1027
1028        U+0078 LATIN SMALL LETTER X
1029        U+0058 LATIN CAPITAL LETTER X
1030                Consume the X.
1031
1032                Follow the steps below, but using the range of characters
1033                U+0030 DIGIT ZERO through to U+0039 DIGIT NINE, U+0061
1034                LATIN SMALL LETTER A through to U+0066 LATIN SMALL LETTER
1035                F, and U+0041 LATIN CAPITAL LETTER A, through to U+0046
1036                LATIN CAPITAL LETTER F (in other words, 0-9, A-F, a-f).
1037
1038                When it comes to interpreting the number, interpret it as
1039                a hexadecimal number.
1040
1041        Anything else
1042                Follow the steps below, but using the range of characters
1043                U+0030 DIGIT ZERO through to U+0039 DIGIT NINE (i.e. just
1044                0-9).
1045
1046                When it comes to interpreting the number, interpret it as
1047                a decimal number.
1048
1049          Consume as many characters as match the range of characters
1050          given above.
1051
1052          If no characters match the range, then don't consume any
1053          characters (and unconsume the U+0023 NUMBER SIGN character and,
1054          if appropriate, the X character). This is a parse error; nothing
1055          is returned.
1056
1057          Otherwise, if the next character is a U+003B SEMICOLON, consume
1058          that too. If it isn't, there is a parse error.
1059
1060          If one or more characters match the range, then take them all
1061          and interpret the string of characters as a number (either
1062          hexadecimal or decimal as appropriate).
1063
1064          If that number is one of the numbers in the first column of the
1065          following table, then this is a parse error. Find the row with
1066          that number in the first column, and return a character token
1067          for the Unicode character given in the second column of that
1068          row.
1069
1070          Number                   Unicode character
1071          0x0D   U+000A LINE FEED (LF)
1072          0x80   U+20AC EURO SIGN ('€')
1073          0x81   U+FFFD REPLACEMENT CHARACTER
1074          0x82   U+201A SINGLE LOW-9 QUOTATION MARK ('‚')
1075          0x83   U+0192 LATIN SMALL LETTER F WITH HOOK ('ƒ')
1076          0x84   U+201E DOUBLE LOW-9 QUOTATION MARK ('„')
1077          0x85   U+2026 HORIZONTAL ELLIPSIS ('…')
1078          0x86   U+2020 DAGGER ('†')
1079          0x87   U+2021 DOUBLE DAGGER ('‡')
1080          0x88   U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT ('ˆ')
1081          0x89   U+2030 PER MILLE SIGN ('‰')
1082          0x8A   U+0160 LATIN CAPITAL LETTER S WITH CARON ('Š')
1083          0x8B   U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK ('‹')
1084          0x8C   U+0152 LATIN CAPITAL LIGATURE OE ('Œ')
1085          0x8D   U+FFFD REPLACEMENT CHARACTER
1086          0x8E   U+017D LATIN CAPITAL LETTER Z WITH CARON ('Ž')
1087          0x8F   U+FFFD REPLACEMENT CHARACTER
1088          0x90   U+FFFD REPLACEMENT CHARACTER
1089          0x91   U+2018 LEFT SINGLE QUOTATION MARK ('‘')
1090          0x92   U+2019 RIGHT SINGLE QUOTATION MARK ('’')
1091          0x93   U+201C LEFT DOUBLE QUOTATION MARK ('“')
1092          0x94   U+201D RIGHT DOUBLE QUOTATION MARK ('”')
1093          0x95   U+2022 BULLET ('•')
1094          0x96   U+2013 EN DASH ('–')
1095          0x97   U+2014 EM DASH ('—')
1096          0x98   U+02DC SMALL TILDE ('˜')
1097          0x99   U+2122 TRADE MARK SIGN ('™')
1098          0x9A   U+0161 LATIN SMALL LETTER S WITH CARON ('š')
1099          0x9B   U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK ('›')
1100          0x9C   U+0153 LATIN SMALL LIGATURE OE ('œ')
1101          0x9D   U+FFFD REPLACEMENT CHARACTER
1102          0x9E   U+017E LATIN SMALL LETTER Z WITH CARON ('ž')
1103          0x9F   U+0178 LATIN CAPITAL LETTER Y WITH DIAERESIS ('Ÿ')
1104
1105          Otherwise, if the number is in the range 0x0000 to 0x0008,
1106          0x000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to
1107          0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
1108          0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
1109          0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
1110          0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
1111          0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
1112          0x10FFFE, or 0x10FFFF, or is higher than 0x10FFFF, then this is
1113          a parse error; return a character token for the U+FFFD
1114          REPLACEMENT CHARACTER character instead.
1115
1116          Otherwise, return a character token for the Unicode character
1117          whose code point is that number.
1118
1119   Anything else
1120          Consume the maximum number of characters possible, with the
1121          consumed characters matching one of the identifiers in the first
1122          column of the named character references table (in a
1123          case-sensitive manner).
1124
1125          If no match can be made, then this is a parse error. No
1126          characters are consumed, and nothing is returned.
1127
1128          If the last character matched is not a U+003B SEMICOLON (;),
1129          there is a parse error.
1130
1131          If the character reference is being consumed as part of an
1132          attribute, and the last character matched is not a U+003B
1133          SEMICOLON (;), and the next character is in the range U+0030
1134          DIGIT ZERO to U+0039 DIGIT NINE, U+0041 LATIN CAPITAL LETTER A
1135          to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A
1136          to U+007A LATIN SMALL LETTER Z, then, for historical reasons,
1137          all the characters that were matched after the U+0026 AMPERSAND
1138          (&) must be unconsumed, and nothing is returned.
1139
1140          Otherwise, return a character token for the character
1141          corresponding to the character reference name (as given by the
1142          second column of the named character references table).
1143
1144          If the markup contains I'm &notit; I tell you, the character
1145          reference is parsed as "not", as in, I'm ¬it; I tell you. But if
1146          the markup was I'm &notin; I tell you, the character reference
1147          would be parsed as "notin;", resulting in I'm ∉ I tell you.
1148