• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..--

demo/23-Nov-2023-1,1621,044

fuzzing/23-Nov-2023-230204

include/webm/23-Nov-2023-4,162957

src/23-Nov-2023-6,3514,002

test_utils/23-Nov-2023-702430

tests/23-Nov-2023-8,5475,803

README.mdD23-Nov-202315.4 KiB326277

doxygen.configD23-Nov-202311.3 KiB320318

README.md

1# WebM Parser {#mainpage}
2
3# Introduction
4
5This WebM parser is a C++11-based parser that aims to be a safe and complete
6parser for WebM. It supports all WebM elements (from the old deprecated ones to
7the newest ones like `Colour`), including recursive elements like `ChapterAtom`
8and `SimpleTag`. It supports incremental parsing; parsing may be stopped at any
9point and resumed later as needed. It also supports starting at an arbitrary
10WebM element, so parsing need not start from the beginning of the file.
11
12The parser (`WebmParser`) works by being fed input data from a data source (an
13instance of `Reader`) that represents a WebM file. The parser will parse the
14WebM data into various data structures that represent the encoded WebM elements,
15and then call corresponding `Callback` event methods as the data structures are
16parsed.
17
18# Building
19
20CMake support has been added to the root libwebm `CMakeLists.txt` file. Simply
21enable the `ENABLE_WEBM_PARSER` feature if using the interactive CMake builder,
22or alternatively pass the `-DENABLE_WEBM_PARSER:BOOL=ON` flag from the command
23line. By default, this parser is not enabled when building libwebm, so you must
24explicitly enable it.
25
26Alternatively, the following illustrates the minimal commands necessary to
27compile the code into a static library without CMake:
28
29```.sh
30c++ -Iinclude -I. -std=c++11 -c src/*.cc
31ar rcs libwebm.a *.o
32```
33
34# Using the parser
35
36There are 3 basic components in the parser that are used: `Reader`, `Callback`,
37and `WebmParser`.
38
39## `Reader`
40
41The `Reader` interface acts as a data source for the parser. You may subclass it
42and implement your own data source if you wish. Alternatively, use the
43`FileReader`, `IstreamReader`, or `BufferReader` if you wish to read from a
44`FILE*`, `std::istream`, or `std::vector<std::uint8_t>`, respectively.
45
46The parser supports `Reader` implementations that do short reads. If
47`Reader::Skip()` or `Reader::Read()` do a partial read (returning
48`Status::kOkPartial`), the parser will call them again in an attempt to read
49more data. If no data is available, the `Reader` may return some other status
50(like `Status::kWouldBlock`) to indicate that no data is available. In this
51situation, the parser will stop parsing and return the status it received.
52Parsing may be resumed later when more data is available.
53
54When the `Reader` has reached the end of the WebM document and no more data is
55available, it should return `Status::kEndOfFile`. This will cause parsing to
56stop. If the file ends at a valid location (that is, there aren't any elements
57that have specified a size that indicates the file ended prematurely), the
58parser will translate `Status::kEndOfFile` into `Status::kOkCompleted` and
59return it. If the file ends prematurely, the parser will return
60`Status::kEndOfFile` to indicate that.
61
62Note that if the WebM file contains elements that have an unknown size (or a
63seek has been performed and the parser doesn't know the size of the root
64element(s)), and the parser is parsing them and hits end-of-file, the parser may
65still call `Reader::Read()`/`Reader::Skip()` multiple times (even though they've
66already reported `Status::kEndOfFile`) as nested parsers terminate parsing.
67Because of this, `Reader::Read()`/`Reader::Skip()` implementations should be
68able to handle being called multiple times after the file's end has been
69reached, and they should consistently return `Status::kEndOfFile`.
70
71The three provided readers (`FileReader`, `IstreamReader`, and `BufferReader`)
72are blocking implementations (they won't return `Status::kWouldBlock`), so if
73you're using them the parser will run until it entirely consumes all their data
74(unless, of course, you request the parser to stop via `Callback`... see the
75next section).
76
77## `Callback`
78
79As the parser progresses through the file, it builds objects (see
80`webm/dom_types.h`) that represent parsed data structures. The parser then
81notifies the `Callback` implementation as objects complete parsing. For some
82data structures (like frames or Void elements), the parser notifies the
83`Callback` and requests it to consume the data directly from the `Reader` (this
84is done for structures that can be large/frequent binary blobs in order to allow
85you to read the data directly into the object/type of your choice, rather than
86just reading them into a `std::vector<std::uint8_t>` and making you copy it into
87a different object if you wanted to work with something other than
88`std::vector<std::uint8_t>`).
89
90The parser was designed to parse the data into objects that are small enough
91that the `Callback` can be quickly and frequently notified as soon as the object
92is ready, but large enough that the objects received by the `Callback` are still
93useful. Having `Callback` events for every tiny integer/float/string/etc.
94element would require too much assembly and work to be useful to most users, and
95pasing the file into a single DOM tree (or a small handful of large conglomerate
96structures) would unnecessarily delay video playback or consume too much memory
97on smaller devices.
98
99The parser may call the following methods while nearly anywhere in the file:
100
101-   `Callback::OnElementBegin()`: This is called for every element that the
102    parser encounters. This is primarily useful if you want to skip some
103    elements or build a map of every element in the file.
104-   `Callback::OnUnknownElement()`: This is called when an element is either not
105    a valid/recognized WebM element, or it is a WebM element but is improperly
106    nested (e.g. an EBMLVersion element inside of a Segment element). The parser
107    doesn't know how to handle the element; it could just skip it but instead
108    defers to the `Callback` to decide how it should be handled. The default
109    implementation just skips the element.
110-   `Callback::OnVoid()`: Void elements can appear anywhere in any master
111    element. This method will be called to handle the Void element.
112
113The parser may call the following methods in the proper nesting order, as shown
114in the list. A `*Begin()` method will always be matched up with its
115corresponding `*End()` method (unless a seek has been performed). The parser
116will only call the methods in the proper nesting order as specified in the WebM
117DOM. For example, `Callback::OnEbml()` will never be called in between
118`Callback::OnSegmentBegin()`/`Callback::OnSegmentEnd()` (since the EBML element
119is not a child of the Segment element), and `Callback::OnTrackEntry()` will only
120ever be called in between
121`Callback::OnSegmentBegin()`/`Callback::OnSegmentEnd()` (since the TrackEntry
122element is a (grand-)child of the Segment element and must be contained by a
123Segment element). `Callback::OnFrame()` is listed twice because it will be
124called to handle frames contained in both SimpleBlock and Block elements.
125
126-   `Callback::OnEbml()`
127-   `Callback::OnSegmentBegin()`
128    -   `Callback::OnSeek()`
129    -   `Callback::OnInfo()`
130    -   `Callback::OnClusterBegin()`
131        -   `Callback::OnSimpleBlockBegin()`
132            -   `Callback::OnFrame()`
133        -   `Callback::OnSimpleBlockEnd()`
134        -   `Callback::OnBlockGroupBegin()`
135            -   `Callback::OnBlockBegin()`
136                -   `Callback::OnFrame()`
137            -   `Callback::OnBlockEnd()`
138        -   `Callback::OnBlockGroupEnd()`
139    -   `Callback::OnClusterEnd()`
140    -   `Callback::OnTrackEntry()`
141    -   `Callback::OnCuePoint()`
142    -   `Callback::OnEditionEntry()`
143    -   `Callback::OnTag()`
144-   `Callback::OnSegmentEnd()`
145
146Only `Callback::OnFrame()` (and no other `Callback` methods) will be called in
147between `Callback::OnSimpleBlockBegin()`/`Callback::OnSimpleBlockEnd()` or
148`Callback::OnBlockBegin()`/`Callback::OnBlockEnd()`, since the SimpleBlock and
149Block elements are not master elements only contain frames.
150
151Note that seeking into the middle of the file may cause the parser to skip some
152`*Begin()` methods. For example, if a seek is performed to a SimpleBlock
153element, `Callback::OnSegmentBegin()` and `Callback::OnClusterBegin()` will not
154be called. In this situation, the full sequence of callback events would be
155(assuming the file ended after the SimpleBlock):
156`Callback::OnSimpleBlockBegin()`, `Callback::OnFrame()` (for every frame in the
157SimpleBlock), `Callback::OnSimpleBlockEnd()`, `Callback::OnClusterEnd()`, and
158`Callback::OnSegmentEnd()`. Since the Cluster and Segment elements were skipped,
159the `Cluster` DOM object may have some members marked as absent, and the
160`*End()` events for the Cluster and Segment elements will have metadata with
161unknown header position, header length, and body size (see `kUnknownHeaderSize`,
162`kUnknownElementSize`, and `kUnknownElementPosition`).
163
164When a `Callback` method has completed, it should return `Status::kOkCompleted`
165to allow parsing to continue. If you would like parsing to stop, return any
166other status code (except `Status::kEndOfFile`, since that's treated somewhat
167specially and is intended for `Reader`s to use), which the parser will return.
168If you return a non-parsing-error status code (.e.g. `Status::kOkPartial`,
169`Status::kWouldBlock`, etc. or your own status code with a value > 0), parsing
170may be resumed again. When parsing is resumed, the parser will call the same
171callback method again (and once again, you may return `Status::kOkCompleted` to
172let parsing continue or some other value to stop parsing).
173
174You may subclass the `Callback` element and override methods which you are
175interested in receiving events for. By default, methods taking an `Action`
176parameter will set it to `Action::kRead` so the entire file is parsed. The
177`Callback::OnFrame()` method will just skip over the frame bytes by default.
178
179## `WebmParser`
180
181The actual parsing work is done with `WebmParser`. Simply construct a
182`WebmParser` and call `WebmParser::Feed()` (providing it a `Callback` and
183`Reader` instance) to parse a file. It will return `Status::kOkCompleted` when
184the entire file has been successfully parsed. `WebmParser::Feed()` doesn't store
185any internal references to the `Callback` or `Reader`.
186
187If you wish to start parsing from the middle of a file, call
188`WebmParser::DidSeek()` before calling `WebmParser::Feed()` to prepare the
189parser to receive data starting at an arbitrary point in the file. When seeking,
190you should seek to the beginning of a WebM element; seeking to a location that
191is not the start of a WebM element (e.g. seeking to a frame, rather than its
192containing SimpleBlock/Block element) will cause parsing to fail. Calling
193`WebmParser::DidSeek()` will reset the state of the parser and clear any
194internal errors, so a `WebmParser` instance may be reused (even if it has
195previously failed to parse a file).
196
197## Building your program
198
199The following program is a small program that completely parses a file from
200stdin:
201
202```.cc
203#include <webm/callback.h>
204#include <webm/file_reader.h>
205#include <webm/webm_parser.h>
206
207int main() {
208  webm::Callback callback;
209  webm::FileReader reader(std::freopen(nullptr, "rb", stdin));
210  webm::WebmParser parser;
211  parser.Feed(&callback, &reader);
212}
213```
214
215It completely parses the input file, but we need to make a new class that
216derives from `Callback` if we want to receive any parsing events. So if we
217change it to:
218
219```.cc
220#include <iomanip>
221#include <iostream>
222
223#include <webm/callback.h>
224#include <webm/file_reader.h>
225#include <webm/status.h>
226#include <webm/webm_parser.h>
227
228class MyCallback : public webm::Callback {
229 public:
230  webm::Status OnElementBegin(const webm::ElementMetadata& metadata,
231                              webm::Action* action) override {
232    std::cout << "Element ID = 0x"
233              << std::hex << static_cast<std::uint32_t>(metadata.id);
234    std::cout << std::dec;  // Reset to decimal mode.
235    std::cout << " at position ";
236    if (metadata.position == webm::kUnknownElementPosition) {
237      // The position will only be unknown if we've done a seek. But since we
238      // aren't seeking in this demo, this will never be the case. However, this
239      // if-statement is included for completeness.
240      std::cout << "<unknown>";
241    } else {
242      std::cout << metadata.position;
243    }
244    std::cout << " with header size ";
245    if (metadata.header_size == webm::kUnknownHeaderSize) {
246      // The header size will only be unknown if we've done a seek. But since we
247      // aren't seeking in this demo, this will never be the case. However, this
248      // if-statement is included for completeness.
249      std::cout << "<unknown>";
250    } else {
251      std::cout << metadata.header_size;
252    }
253    std::cout << " and body size ";
254    if (metadata.size == webm::kUnknownElementSize) {
255      // WebM master elements may have an unknown size, though this is rare.
256      std::cout << "<unknown>";
257    } else {
258      std::cout << metadata.size;
259    }
260    std::cout << '\n';
261
262    *action = webm::Action::kRead;
263    return webm::Status(webm::Status::kOkCompleted);
264  }
265};
266
267int main() {
268  MyCallback callback;
269  webm::FileReader reader(std::freopen(nullptr, "rb", stdin));
270  webm::WebmParser parser;
271  webm::Status status = parser.Feed(&callback, &reader);
272  if (status.completed_ok()) {
273    std::cout << "Parsing successfully completed\n";
274  } else {
275    std::cout << "Parsing failed with status code: " << status.code << '\n';
276  }
277}
278```
279
280This will output information about every element in the entire file: it's ID,
281position, header size, and body size. The status of the parse is also checked
282and reported.
283
284For a more complete example, see `demo/demo.cc`, which parses an entire file and
285prints out all of its information. That example overrides every `Callback`
286method to show exactly what information is available while parsing and how to
287access it. The example is verbose, but that's primarily due to pretty-printing
288and string formatting operations.
289
290When compiling your program, add the `include` directory to your compiler's
291header search paths and link to the compiled library. Be sure your compiler has
292C++11 mode enabled (`-std=c++11` in clang++ or g++).
293
294# Testing
295
296Unit tests are located in the `tests` directory. Google Test and Google Mock are
297used as testing frameworks. Building and running the tests will be supported in
298the upcoming CMake scripts, but they can currently be built and run by manually
299compiling them (and linking to Google Test and Google Mock).
300
301# Fuzzing
302
303The parser has been fuzzed with [AFL](http://lcamtuf.coredump.cx/afl/) and
304[libFuzzer](http://llvm.org/docs/LibFuzzer.html). If you wish to fuzz the parser
305with AFL or libFuzzer but don't want to write an executable that exercises the
306parsing API, you may use `fuzzing/webm_fuzzer.cc`.
307
308When compiling for fuzzing, define the macro
309`WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT` to be some integer in order to limit the
310maximum size of ASCII/UTF-8/binary elements. It's too easy for the fuzzer to
311generate elements that claim to have a ridiculously massive size, which will
312cause allocations to fail or the program to allocate too much memory. AFL will
313terminate the process if it allocates too much memory (by default, 50 MB), and
314the [Address Sanitizer doesn't throw `std::bad_alloc` when an allocation fails]
315(https://github.com/google/sanitizers/issues/295). Defining
316`WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT` to a low number (say, 1024) will cause the
317ASCII/UTF-8/binary element parsers to return `Status::kNotEnoughMemory` if the
318element's size exceeds `WEBM_FUZZER_BYTE_ELEMENT_SIZE_LIMIT`, which will avoid
319false positives when fuzzing. The parser expects `std::string` and `std::vector`
320to throw `std::bad_alloc` when an allocation fails, which doesn't necessarily
321happen due to the fuzzers' limitations.
322
323You may also define the macro `WEBM_FUZZER_SEEK_FIRST` to have
324`fuzzing/webm_fuzzer.cc` call `WebmParser::DidSeek()` before doing any parsing.
325This will test the seeking code paths.
326