Lines Matching full:the

14    Implementations must act as if they used the following state machine to
15 tokenise HTML. The state machine must start in the data state. Most
17 and either switches the state machine to a new state to reconsume the
18 same character, or switches it to a new state (to consume the next
19 character), or repeats the same state (to consume the next character).
23 The exact behavior of certain states depends on a content model flag
24 that is set after certain tokens are emitted. The flag has several
26 the PCDATA state. In the RCDATA and CDATA states, a further escape flag
27 is used to control the behavior of the tokeniser. It is either true or
28 false, and initially must be set to the false state. The insertion mode
29 and the stack of open elements also affects tokenization.
31 The output of the tokenization step is a series of zero or more of the
36 missing (which is a distinct state from the empty string), and the
44 When a token is emitted, it must immediately be handled by the tree
45 construction stage. The tree construction stage can affect the state of
46 the content model flag, and can insert additional characters into the
47 stream. (For example, the script element can result in scripts
48 executing and using the dynamic markup insertion APIs to insert
49 characters into the stream being tokenised.)
52 the flag is not acknowledged when it is processed by the tree
55 When an end tag token is emitted, the content model flag must be
56 switched to the PCDATA state.
64 Before each step of the tokeniser, the user agent must first check the
65 parser pause flag. If it is true, then the tokeniser must abort the
66 processing of any nested invocations of the tokeniser, yielding control
67 back to the caller. If it is false, then the user agent may then check
68 to see if either one of the scripts in the list of scripts that will
69 execute as soon as possible or the first script in the list of scripts
73 The tokeniser state machine consists of the states defined in the
78 Consume the next input character:
81 When the content model flag is set to one of the PCDATA or
82 RCDATA states and the escape flag is false: switch to the
84 Otherwise: treat it as per the "anything else" entry below.
87 If the content model flag is set to either the RCDATA state or
88 the CDATA state, and the escape flag is false, and there are at
89 least three characters before this one in the input stream, and
90 the last four characters in the input stream, including this
92 HYPHEN-MINUS, and U+002D HYPHEN-MINUS ("<!--"), then set the
95 In any case, emit the input character as a character token. Stay
96 in the data state.
99 When the content model flag is set to the PCDATA state: switch
100 to the tag open state.
101 When the content model flag is set to either the RCDATA state or
102 the CDATA state, and the escape flag is false: switch to the tag
104 Otherwise: treat it as per the "anything else" entry below.
107 If the content model flag is set to either the RCDATA state or
108 the CDATA state, and the escape flag is true, and the last three
109 characters in the input stream including this one are U+002D
111 ("-->"), set the escape flag to false.
113 In any case, emit the input character as a character token. Stay
114 in the data state.
120 Emit the input character as a character token. Stay in the data
125 (This cannot happen if the content model flag is set to the CDATA
133 Otherwise, emit the character token that was returned.
135 Finally, switch to the data state.
139 The behavior of this state depends on the content model flag.
141 If the content model flag is set to the RCDATA or CDATA states
142 Consume the next input character. If it is a U+002F SOLIDUS (/)
143 character, switch to the close tag open state. Otherwise, emit a
144 U+003C LESS-THAN SIGN character token and reconsume the current
145 input character in the data state.
147 If the content model flag is set to the PCDATA state
148 Consume the next input character:
151 Switch to the markup declaration open state.
154 Switch to the close tag open state.
158 Create a new start tag token, set its tag name to the
159 lowercase version of the input character (add 0x0020 to
160 the character's code point), then switch to the tag name
161 state. (Don't emit the token yet; further details will be
165 Create a new start tag token, set its tag name to the
166 input character, then switch to the tag name state. (Don't
167 emit the token yet; further details will be filled in
173 the data state.
176 Parse error. Switch to the bogus comment state.
180 and reconsume the current input character in the data
185 If the content model flag is set to the RCDATA or CDATA states but no
186 start tag token has ever been emitted by this instance of the tokeniser
187 (fragment case), or, if the content model flag is set to the RCDATA or
188 CDATA states and the next few characters do not match the tag name of
189 the last start tag token emitted (compared in an ASCII case-insensitive
191 the following characters:
201 character token, and switch to the data state to process the next input
204 Otherwise, if the content model flag is set to the PCDATA state, or if
205 the next few characters do match that tag name, consume the next input
209 Create a new end tag token, set its tag name to the lowercase
210 version of the input character (add 0x0020 to the character's
211 code point), then switch to the tag name state. (Don't emit the
216 Create a new end tag token, set its tag name to the input
217 character, then switch to the tag name state. (Don't emit the
222 Parse error. Switch to the data state.
226 U+002F SOLIDUS character token. Reconsume the EOF character in
227 the data state.
230 Parse error. Switch to the bogus comment state.
234 Consume the next input character:
240 Switch to the before attribute name state.
243 Switch to the self-closing start tag state.
246 Emit the current tag token. Switch to the data state.
249 Append the lowercase version of the current input character (add
250 0x0020 to the character's code point) to the current tag token's
251 tag name. Stay in the tag name state.
254 Parse error. Emit the current tag token. Reconsume the EOF
255 character in the data state.
258 Append the current input character to the current tag token's
259 tag name. Stay in the tag name state.
263 Consume the next input character:
269 Stay in the before attribute name state.
272 Switch to the self-closing start tag state.
275 Emit the current tag token. Switch to the data state.
278 Start a new attribute in the current tag token. Set that
279 attribute's name to the lowercase version of the current input
280 character (add 0x0020 to the character's code point), and its
281 value to the empty string. Switch to the attribute name state.
286 Parse error. Treat it as per the "anything else" entry below.
289 Parse error. Emit the current tag token. Reconsume the EOF
290 character in the data state.
293 Start a new attribute in the current tag token. Set that
294 attribute's name to the current input character, and its value
295 to the empty string. Switch to the attribute name state.
299 Consume the next input character:
305 Switch to the after attribute name state.
308 Switch to the self-closing start tag state.
311 Switch to the before attribute value state.
314 Emit the current tag token. Switch to the data state.
317 Append the lowercase version of the current input character (add
318 0x0020 to the character's code point) to the current attribute's
319 name. Stay in the attribute name state.
323 Parse error. Treat it as per the "anything else" entry below.
326 Parse error. Emit the current tag token. Reconsume the EOF
327 character in the data state.
330 Append the current input character to the current attribute's
331 name. Stay in the attribute name state.
333 When the user agent leaves the attribute name state (and before
334 emitting the tag token, if appropriate), the complete attribute's name
335 must be compared to the other attributes on the same token; if there is
336 already an attribute on the token with the exact same name, then this
337 is a parse error and the new attribute must be dropped, along with the
342 Consume the next input character:
348 Stay in the after attribute name state.
351 Switch to the self-closing start tag state.
354 Switch to the before attribute value state.
357 Emit the current tag token. Switch to the data state.
360 Start a new attribute in the current tag token. Set that
361 attribute's name to the lowercase version of the current input
362 character (add 0x0020 to the character's code point), and its
363 value to the empty string. Switch to the attribute name state.
367 Parse error. Treat it as per the "anything else" entry below.
370 Parse error. Emit the current tag token. Reconsume the EOF
371 character in the data state.
374 Start a new attribute in the current tag token. Set that
375 attribute's name to the current input character, and its value
376 to the empty string. Switch to the attribute name state.
380 Consume the next input character:
386 Stay in the before attribute value state.
389 Switch to the attribute value (double-quoted) state.
392 Switch to the attribute value (unquoted) state and reconsume
396 Switch to the attribute value (single-quoted) state.
399 Parse error. Emit the current tag token. Switch to the data
403 Parse error. Treat it as per the "anything else" entry below.
406 Parse error. Emit the current tag token. Reconsume the character
407 in the data state.
410 Append the current input character to the current attribute's
411 value. Switch to the attribute value (unquoted) state.
415 Consume the next input character:
418 Switch to the after attribute value (quoted) state.
421 Switch to the character reference in attribute value state, with
422 the additional allowed character being U+0022 QUOTATION MARK
426 Parse error. Emit the current tag token. Reconsume the character
427 in the data state.
430 Append the current input character to the current attribute's
431 value. Stay in the attribute value (double-quoted) state.
435 Consume the next input character:
438 Switch to the after attribute value (quoted) state.
441 Switch to the character reference in attribute value state, with
442 the additional allowed character being U+0027 APOSTROPHE (').
445 Parse error. Emit the current tag token. Reconsume the character
446 in the data state.
449 Append the current input character to the current attribute's
450 value. Stay in the attribute value (single-quoted) state.
454 Consume the next input character:
460 Switch to the before attribute name state.
463 Switch to the character reference in attribute value state, with
467 Emit the current tag token. Switch to the data state.
472 Parse error. Treat it as per the "anything else" entry below.
475 Parse error. Emit the current tag token. Reconsume the character
476 in the data state.
479 Append the current input character to the current attribute's
480 value. Stay in the attribute value (unquoted) state.
486 If nothing is returned, append a U+0026 AMPERSAND character to the
489 Otherwise, append the returned character token to the current
492 Finally, switch back to the attribute value state that you were in when
497 Consume the next input character:
503 Switch to the before attribute name state.
506 Switch to the self-closing start tag state.
509 Emit the current tag token. Switch to the data state.
512 Parse error. Emit the current tag token. Reconsume the EOF
513 character in the data state.
516 Parse error. Reconsume the character in the before attribute
521 Consume the next input character:
524 Set the self-closing flag of the current tag token. Emit the
525 current tag token. Switch to the data state.
528 Parse error. Emit the current tag token. Reconsume the EOF
529 character in the data state.
532 Parse error. Reconsume the character in the before attribute
537 (This can only happen if the content model flag is set to the PCDATA
540 Consume every character up to and including the first U+003E
541 GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever
542 comes first. Emit a comment token whose data is the concatenation of
543 all the characters starting from and including the character that
544 caused the state machine to switch into the bogus comment state, up to
545 and including the character immediately before the last consumed
546 character (i.e. up to the character just before the U+003E or EOF
547 character). (If the comment was started by the end of the file (EOF),
548 the token is empty.)
550 Switch to the data state.
552 If the end of the file was reached, reconsume the EOF character.
556 (This can only happen if the content model flag is set to the PCDATA
559 If the next two characters are both U+002D HYPHEN-MINUS (-) characters,
560 consume those two characters, create a comment token whose data is the
561 empty string, and switch to the comment start state.
563 Otherwise, if the next seven characters are an ASCII case-insensitive
564 match for the word "DOCTYPE", then consume those characters and switch
565 to the DOCTYPE state.
567 Otherwise, if the insertion mode is "in foreign content" and the
568 current node is not an element in the HTML namespace and the next seven
569 characters are an ASCII case-sensitive match for the string "[CDATA["
570 (the five uppercase letters "CDATA" with a U+005B LEFT SQUARE BRACKET
572 to the CDATA section state (which is unrelated to the content model
575 Otherwise, this is a parse error. Switch to the bogus comment state.
576 The next character that is consumed, if any, is the first character
577 that will be in the comment.
581 Consume the next input character:
584 Switch to the comment start dash state.
587 Parse error. Emit the comment token. Switch to the data state.
590 Parse error. Emit the comment token. Reconsume the EOF character
591 in the data state.
594 Append the input character to the comment token's data. Switch
595 to the comment state.
599 Consume the next input character:
602 Switch to the comment end state
605 Parse error. Emit the comment token. Switch to the data state.
608 Parse error. Emit the comment token. Reconsume the EOF character
609 in the data state.
612 Append a U+002D HYPHEN-MINUS (-) character and the input
613 character to the comment token's data. Switch to the comment
618 Consume the next input character:
621 Switch to the comment end dash state
624 Parse error. Emit the comment token. Reconsume the EOF character
625 in the data state.
628 Append the input character to the comment token's data. Stay in
629 the comment state.
633 Consume the next input character:
636 Switch to the comment end state
639 Parse error. Emit the comment token. Reconsume the EOF character
640 in the data state.
643 Append a U+002D HYPHEN-MINUS (-) character and the input
644 character to the comment token's data. Switch to the comment
649 Consume the next input character:
652 Emit the comment token. Switch to the data state.
655 Parse error. Append a U+002D HYPHEN-MINUS (-) character to the
656 comment token's data. Stay in the comment end state.
659 Parse error. Emit the comment token. Reconsume the EOF character
660 in the data state.
664 the input character to the comment token's data. Switch to the
669 Consume the next input character:
675 Switch to the before DOCTYPE name state.
678 Parse error. Reconsume the current character in the before
683 Consume the next input character:
689 Stay in the before DOCTYPE name state.
693 flag to on. Emit the token. Switch to the data state.
696 Create a new DOCTYPE token. Set the token's name to the
697 lowercase version of the input character (add 0x0020 to the
698 character's code point). Switch to the DOCTYPE name state.
702 flag to on. Emit the token. Reconsume the EOF character in the
706 Create a new DOCTYPE token. Set the token's name to the current
707 input character. Switch to the DOCTYPE name state.
711 Consume the next input character:
717 Switch to the after DOCTYPE name state.
720 Emit the current DOCTYPE token. Switch to the data state.
723 Append the lowercase version of the input character (add 0x0020
724 to the character's code point) to the current DOCTYPE token's
725 name. Stay in the DOCTYPE name state.
728 Parse error. Set the DOCTYPE token's force-quirks flag to on.
729 Emit that DOCTYPE token. Reconsume the EOF character in the data
733 Append the current input character to the current DOCTYPE
734 token's name. Stay in the DOCTYPE name state.
738 Consume the next input character:
744 Stay in the after DOCTYPE name state.
747 Emit the current DOCTYPE token. Switch to the data state.
750 Parse error. Set the DOCTYPE token's force-quirks flag to on.
751 Emit that DOCTYPE token. Reconsume the EOF character in the data
755 If the six characters starting from the current input character
756 are an ASCII case-insensitive match for the word "PUBLIC", then
757 consume those characters and switch to the before DOCTYPE public
760 Otherwise, if the six characters starting from the current input
761 character are an ASCII case-insensitive match for the word
762 "SYSTEM", then consume those characters and switch to the before
765 Otherwise, this is the parse error. Set the DOCTYPE token's
766 force-quirks flag to on. Switch to the bogus DOCTYPE state.
770 Consume the next input character:
776 Stay in the before DOCTYPE public identifier state.
779 Set the DOCTYPE token's public identifier to the empty string
780 (not missing), then switch to the DOCTYPE public identifier
784 Set the DOCTYPE token's public identifier to the empty string
785 (not missing), then switch to the DOCTYPE public identifier
789 Parse error. Set the DOCTYPE token's force-quirks flag to on.
790 Emit that DOCTYPE token. Switch to the data state.
793 Parse error. Set the DOCTYPE token's force-quirks flag to on.
794 Emit that DOCTYPE token. Reconsume the EOF character in the data
798 Parse error. Set the DOCTYPE token's force-quirks flag to on.
799 Switch to the bogus DOCTYPE state.
803 Consume the next input character:
806 Switch to the after DOCTYPE public identifier state.
809 Parse error. Set the DOCTYPE token's force-quirks flag to on.
810 Emit that DOCTYPE token. Switch to the data state.
813 Parse error. Set the DOCTYPE token's force-quirks flag to on.
814 Emit that DOCTYPE token. Reconsume the EOF character in the data
818 Append the current input character to the current DOCTYPE
819 token's public identifier. Stay in the DOCTYPE public identifier
824 Consume the next input character:
827 Switch to the after DOCTYPE public identifier state.
830 Parse error. Set the DOCTYPE token's force-quirks flag to on.
831 Emit that DOCTYPE token. Switch to the data state.
834 Parse error. Set the DOCTYPE token's force-quirks flag to on.
835 Emit that DOCTYPE token. Reconsume the EOF character in the data
839 Append the current input character to the current DOCTYPE
840 token's public identifier. Stay in the DOCTYPE public identifier
845 Consume the next input character:
851 Stay in the after DOCTYPE public identifier state.
854 Set the DOCTYPE token's system identifier to the empty string
855 (not missing), then switch to the DOCTYPE system identifier
859 Set the DOCTYPE token's system identifier to the empty string
860 (not missing), then switch to the DOCTYPE system identifier
864 Emit the current DOCTYPE token. Switch to the data state.
867 Parse error. Set the DOCTYPE token's force-quirks flag to on.
868 Emit that DOCTYPE token. Reconsume the EOF character in the data
872 Parse error. Set the DOCTYPE token's force-quirks flag to on.
873 Switch to the bogus DOCTYPE state.
877 Consume the next input character:
883 Stay in the before DOCTYPE system identifier state.
886 Set the DOCTYPE token's system identifier to the empty string
887 (not missing), then switch to the DOCTYPE system identifier
891 Set the DOCTYPE token's system identifier to the empty string
892 (not missing), then switch to the DOCTYPE system identifier
896 Parse error. Set the DOCTYPE token's force-quirks flag to on.
897 Emit that DOCTYPE token. Switch to the data state.
900 Parse error. Set the DOCTYPE token's force-quirks flag to on.
901 Emit that DOCTYPE token. Reconsume the EOF character in the data
905 Parse error. Set the DOCTYPE token's force-quirks flag to on.
906 Switch to the bogus DOCTYPE state.
910 Consume the next input character:
913 Switch to the after DOCTYPE system identifier state.
916 Parse error. Set the DOCTYPE token's force-quirks flag to on.
917 Emit that DOCTYPE token. Switch to the data state.
920 Parse error. Set the DOCTYPE token's force-quirks flag to on.
921 Emit that DOCTYPE token. Reconsume the EOF character in the data
925 Append the current input character to the current DOCTYPE
926 token's system identifier. Stay in the DOCTYPE system identifier
931 Consume the next input character:
934 Switch to the after DOCTYPE system identifier state.
937 Parse error. Set the DOCTYPE token's force-quirks flag to on.
938 Emit that DOCTYPE token. Switch to the data state.
941 Parse error. Set the DOCTYPE token's force-quirks flag to on.
942 Emit that DOCTYPE token. Reconsume the EOF character in the data
946 Append the current input character to the current DOCTYPE
947 token's system identifier. Stay in the DOCTYPE system identifier
952 Consume the next input character:
958 Stay in the after DOCTYPE system identifier state.
961 Emit the current DOCTYPE token. Switch to the data state.
964 Parse error. Set the DOCTYPE token's force-quirks flag to on.
965 Emit that DOCTYPE token. Reconsume the EOF character in the data
969 Parse error. Switch to the bogus DOCTYPE state. (This does not
970 set the DOCTYPE token's force-quirks flag to on.)
974 Consume the next input character:
977 Emit the DOCTYPE token. Switch to the data state.
980 Emit the DOCTYPE token. Reconsume the EOF character in the data
984 Stay in the bogus DOCTYPE state.
988 (This can only happen if the content model flag is set to the PCDATA
989 state, and is unrelated to the content model flag's CDATA state.)
991 Consume every character up to the next occurrence of the three
993 BRACKET U+003E GREATER-THAN SIGN (]]>), or the end of the file (EOF),
995 all the characters consumed except the matching three character
996 sequence at the end (if one was found before the end of the file).
998 Switch to the data state.
1000 If the end of the file was reached, reconsume the EOF character.
1008 The behavior depends on the identity of the next character (the one
1009 immediately after the U+0026 AMPERSAND character):
1018 The additional allowed character, if there is one
1023 Consume the U+0023 NUMBER SIGN.
1025 The behavior further depends on the character after the U+0023
1030 Consume the X.
1032 Follow the steps below, but using the range of characters
1038 When it comes to interpreting the number, interpret it as
1042 Follow the steps below, but using the range of characters
1046 When it comes to interpreting the number, interpret it as
1049 Consume as many characters as match the range of characters
1052 If no characters match the range, then don't consume any
1053 characters (and unconsume the U+0023 NUMBER SIGN character and,
1054 if appropriate, the X character). This is a parse error; nothing
1057 Otherwise, if the next character is a U+003B SEMICOLON, consume
1060 If one or more characters match the range, then take them all
1061 and interpret the string of characters as a number (either
1064 If that number is one of the numbers in the first column of the
1065 following table, then this is a parse error. Find the row with
1066 that number in the first column, and return a character token
1067 for the Unicode character given in the second column of that
1105 Otherwise, if the number is in the range 0x0000 to 0x0008,
1113 a parse error; return a character token for the U+FFFD
1116 Otherwise, return a character token for the Unicode character
1120 Consume the maximum number of characters possible, with the
1121 consumed characters matching one of the identifiers in the first
1122 column of the named character references table (in a
1128 If the last character matched is not a U+003B SEMICOLON (;),
1131 If the character reference is being consumed as part of an
1132 attribute, and the last character matched is not a U+003B
1133 SEMICOLON (;), and the next character is in the range U+0030
1137 all the characters that were matched after the U+0026 AMPERSAND
1140 Otherwise, return a character token for the character
1141 corresponding to the character reference name (as given by the
1142 second column of the named character references table).
1144 If the markup contains I'm &notit; I tell you, the character
1146 the markup was I'm &notin; I tell you, the character reference