1<?xml version="1.0"?>
2<!--
3
4   Licensed to the Apache Software Foundation (ASF) under one or more
5   contributor license agreements.  See the NOTICE file distributed with
6   this work for additional information regarding copyright ownership.
7   The ASF licenses this file to You under the Apache License, Version 2.0
8   (the "License"); you may not use this file except in compliance with
9   the License.  You may obtain a copy of the License at
10
11       http://www.apache.org/licenses/LICENSE-2.0
12
13   Unless required by applicable law or agreed to in writing, software
14   distributed under the License is distributed on an "AS IS" BASIS,
15   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16   See the License for the specific language governing permissions and
17   limitations under the License.
18
19-->
20<document>
21  <properties>
22    <title>Commons Compress ZIP package</title>
23    <author email="dev@commons.apache.org">Commons Documentation Team</author>
24  </properties>
25  <body>
26    <section name="The ZIP package">
27
28      <p>The ZIP package provides features not found
29        in <code>java.util.zip</code>:</p>
30
31      <ul>
32        <li>Support for encodings other than UTF-8 for filenames and
33          comments.  Starting with Java7 this is supported
34          by <code>java.util.zip</code> as well.</li>
35        <li>Access to internal and external attributes (which are used
36          to store Unix permission by some zip implementations).</li>
37        <li>Structured support for extra fields.</li>
38      </ul>
39
40      <p>In addition to the information stored
41        in <code>ArchiveEntry</code> a <code>ZipArchiveEntry</code>
42        stores internal and external attributes as well as extra
43        fields which may contain information like Unix permissions,
44        information about the platform they've been created on, their
45        last modification time and an optional comment.</p>
46
47      <subsection name="ZipArchiveInputStream vs ZipFile">
48
49        <p>ZIP archives store a archive entries in sequence and
50          contain a registry of all entries at the very end of the
51          archive.  It is acceptable for an archive to contain several
52          entries of the same name and have the registry (called the
53          central directory) decide which entry is actually to be used
54          (if any).</p>
55
56        <p>In addition the ZIP format stores certain information only
57          inside the central directory but not together with the entry
58          itself, this is:</p>
59
60        <ul>
61          <li>internal and external attributes</li>
62          <li>different or additional extra fields</li>
63        </ul>
64
65        <p>This means the ZIP format cannot really be parsed
66          correctly while reading a non-seekable stream, which is what
67          <code>ZipArchiveInputStream</code> is forced to do.  As a
68          result <code>ZipArchiveInputStream</code></p>
69        <ul>
70          <li>may return entries that are not part of the central
71            directory at all and shouldn't be considered part of the
72            archive.</li>
73          <li>may return several entries with the same name.</li>
74          <li>will not return internal or external attributes.</li>
75          <li>may return incomplete extra field data.</li>
76          <li>may return unknown sizes and CRC values for entries
77          until the next entry has been reached if the archive uses
78          the data descriptor feature (see below).</li>
79        </ul>
80
81        <p><code>ZipArchiveInputStream</code> shares these limitations
82          with <code>java.util.zip.ZipInputStream</code>.</p>
83
84        <p><code>ZipFile</code> is able to read the central directory
85          first and provide correct and complete information on any
86          ZIP archive.</p>
87
88        <p>ZIP archives know a feature called the data descriptor
89          which is a way to store an entry's length after the entry's
90          data.  This can only work reliably if the size information
91          can be taken from the central directory or the data itself
92          can signal it is complete, which is true for data that is
93          compressed using the DEFLATED compression algorithm.</p>
94
95        <p><code>ZipFile</code> has access to the central directory
96          and can extract entries using the data descriptor reliably.
97          The same is true for <code>ZipArchiveInputStream</code> as
98          long as the entry is DEFLATED.  For STORED
99          entries <code>ZipArchiveInputStream</code> can try to read
100          ahead until it finds the next entry, but this approach is
101          not safe and has to be enabled by a constructor argument
102          explicitly.</p>
103
104        <p>If possible, you should always prefer <code>ZipFile</code>
105          over <code>ZipArchiveInputStream</code>.</p>
106
107        <p><code>ZipFile</code> requires a
108        <code>SeekableByteChannel</code> that will be obtained
109        transparently when reading from a file. The class
110        <code>org.apache.commons.compress.utils.SeekableInMemoryByteChannel</code>
111        allows you to read from an in-memory archive.</p>
112
113      </subsection>
114
115      <subsection name="ZipArchiveOutputStream" id="ZipArchiveOutputStream">
116        <p><code>ZipArchiveOutputStream</code> has three constructors,
117        one of them uses a <code>File</code> argument, one a
118        <code>SeekableByteChannel</code> and the last uses an
119        <code>OutputStream</code>.  The <code>File</code> version will
120        try to use <code>SeekableByteChannel</code> and fall back to
121        using a <code>FileOutputStream</code> internally if that
122        fails.</p>
123
124        <p>If <code>ZipArchiveOutputStream</code> can
125          use <code>SeekableByteChannel</code> it can employ some
126          optimizations that lead to smaller archives.  It also makes
127          it possible to add uncompressed (<code>setMethod</code> used
128          with <code>STORED</code>) entries of unknown size when
129          calling <code>putArchiveEntry</code> - this is not allowed
130          if <code>ZipArchiveOutputStream</code> has to use
131          an <code>OutputStream</code>.</p>
132
133        <p>If you know you are writing to a file, you should always
134        prefer the <code>File</code>- or
135        <code>SeekableByteChannel</code>-arg constructors.  The class
136        <code>org.apache.commons.compress.utils.SeekableInMemoryByteChannel</code>
137        allows you to write to an in-memory archive.</p>
138
139      </subsection>
140
141      <subsection name="Extra Fields">
142
143        <p>Inside a ZIP archive, additional data can be attached to
144          each entry.  The <code>java.util.zip.ZipEntry</code> class
145          provides access to this via the <code>get/setExtra</code>
146          methods as arrays of <code>byte</code>s.</p>
147
148        <p>Actually the extra data is supposed to be more structured
149          than that and Compress' ZIP package provides access to the
150          structured data as <code>ExtraField</code> instances.  Only
151          a subset of all defined extra field formats is supported by
152          the package, any other extra field will be stored
153          as <code>UnrecognizedExtraField</code>.</p>
154
155        <p>Prior to version 1.1 of this library trying to read an
156          archive with extra fields that didn't follow the recommended
157          structure for those fields would cause Compress to throw an
158          exception.  Starting with version 1.1 these extra fields
159          will now be read
160          as <code>UnparseableExtraFieldData</code>.</p>
161
162      </subsection>
163
164      <subsection name="Encoding" id="encoding">
165
166        <p>Traditionally the ZIP archive format uses CodePage 437 as
167          encoding for file name, which is not sufficient for many
168          international character sets.</p>
169
170        <p>Over time different archivers have chosen different ways to
171          work around the limitation - the <code>java.util.zip</code>
172          packages simply uses UTF-8 as its encoding for example.</p>
173
174        <p>Ant has been offering the encoding attribute of the zip and
175          unzip task as a way to explicitly specify the encoding to
176          use (or expect) since Ant 1.4.  It defaults to the
177          platform's default encoding for zip and UTF-8 for jar and
178          other jar-like tasks (war, ear, ...) as well as the unzip
179          family of tasks.</p>
180
181        <p>More recent versions of the ZIP specification introduce
182          something called the &quot;language encoding flag&quot;
183          which can be used to signal that a file name has been
184          encoded using UTF-8.  All ZIP-archives written by Compress
185          will set this flag, if the encoding has been set to UTF-8.
186          Our interoperability tests with existing archivers didn't
187          show any ill effects (in fact, most archivers ignore the
188          flag to date), but you can turn off the "language encoding
189          flag" by setting the attribute
190          <code>useLanguageEncodingFlag</code> to <code>false</code> on the
191          <code>ZipArchiveOutputStream</code> if you should encounter
192          problems.</p>
193
194        <p>The <code>ZipFile</code>
195          and <code>ZipArchiveInputStream</code> classes will
196          recognize the language encoding flag and ignore the encoding
197          set in the constructor if it has been found.</p>
198
199        <p>The InfoZIP developers have introduced new ZIP extra fields
200          that can be used to add an additional UTF-8 encoded file
201          name to the entry's metadata.  Most archivers ignore these
202          extra fields.  <code>ZipArchiveOutputStream</code> supports
203          an option <code>createUnicodeExtraFields</code> which makes
204          it write these extra fields either for all entries
205          ("always") or only those whose name cannot be encoded using
206          the specified encoding (not-encodeable), it defaults to
207          "never" since the extra fields create bigger archives.</p>
208
209        <p>The fallbackToUTF8 attribute
210          of <code>ZipArchiveOutputStream</code> can be used to create
211          archives that use the specified encoding in the majority of
212          cases but UTF-8 and the language encoding flag for filenames
213          that cannot be encoded using the specified encoding.</p>
214
215        <p>The <code>ZipFile</code>
216          and <code>ZipArchiveInputStream</code> classes recognize the
217          Unicode extra fields by default and read the file name
218          information from them, unless you set the constructor parameter
219          <code>scanForUnicodeExtraFields</code> to false.</p>
220
221        <h4>Recommendations for Interoperability</h4>
222
223        <p>The optimal setting of flags depends on the archivers you
224          expect as consumers/producers of the ZIP archives.  Below
225          are some test results which may be superseded with later
226          versions of each tool.</p>
227
228        <ul>
229          <li>The java.util.zip package used by the jar executable or
230            to read jars from your CLASSPATH reads and writes UTF-8
231            names, it doesn't set or recognize any flags or Unicode
232            extra fields.</li>
233
234          <li>Starting with Java7 <code>java.util.zip</code> writes
235            UTF-8 by default and uses the language encoding flag.  It
236            is possible to specify a different encoding when
237            reading/writing ZIPs via new constructors.  The package
238            now recognizes the language encoding flag when reading and
239            ignores the Unicode extra fields.</li>
240
241          <li>7Zip writes CodePage 437 by default but uses UTF-8 and
242            the language encoding flag when writing entries that
243            cannot be encoded as CodePage 437 (similar to the zip task
244            with fallbacktoUTF8 set to true).  It recognizes the
245            language encoding flag when reading and ignores the
246            Unicode extra fields.</li>
247
248          <li>WinZIP writes CodePage 437 and uses Unicode extra fields
249            by default.  It recognizes the Unicode extra field and the
250            language encoding flag when reading.</li>
251
252          <li>Windows' "compressed folder" feature doesn't recognize
253            any flag or extra field and creates archives using the
254            platforms default encoding - and expects archives to be in
255            that encoding when reading them.</li>
256
257          <li>InfoZIP based tools can recognize and write both, it is
258            a compile time option and depends on the platform so your
259            mileage may vary.</li>
260
261          <li>PKWARE zip tools recognize both and prefer the language
262            encoding flag.  They create archives using CodePage 437 if
263            possible and UTF-8 plus the language encoding flag for
264            file names that cannot be encoded as CodePage 437.</li>
265        </ul>
266
267        <p>So, what to do?</p>
268
269        <p>If you are creating jars, then java.util.zip is your main
270          consumer.  We recommend you set the encoding to UTF-8 and
271          keep the language encoding flag enabled.  The flag won't
272          help or hurt java.util.zip prior to Java7 but archivers that
273          support it will show the correct file names.</p>
274
275        <p>For maximum interop it is probably best to set the encoding
276          to UTF-8, enable the language encoding flag and create
277          Unicode extra fields when writing ZIPs.  Such archives
278          should be extracted correctly by java.util.zip, 7Zip,
279          WinZIP, PKWARE tools and most likely InfoZIP tools.  They
280          will be unusable with Windows' "compressed folders" feature
281          and bigger than archives without the Unicode extra fields,
282          though.</p>
283
284        <p>If Windows' "compressed folders" is your primary consumer,
285          then your best option is to explicitly set the encoding to
286          the target platform.  You may want to enable creation of
287          Unicode extra fields so the tools that support them will
288          extract the file names correctly.</p>
289      </subsection>
290
291      <subsection name="Encryption and Alternative Compression Algorithms"
292                  id="encryption">
293
294        <p>In most cases entries of an archive are not encrypted and
295        are either not compressed at all or use the DEFLATE
296        algorithm, Commons Compress' ZIP archiver will handle them
297        just fine.   As of version 1.7, Commons Compress can also
298        decompress entries compressed with the legacy SHRINK and
299        IMPLODE algorithms of PKZIP 1.x.  Version 1.11 of Commons
300        Compress adds read-only support for BZIP2.  Version 1.16 adds
301        read-only support for DEFLATE64 - also known as "enhanced DEFLATE".</p>
302
303        <p>The ZIP specification allows for various other compression
304        algorithms and also supports several different ways of
305        encrypting archive contents.  Neither of those methods is
306        currently supported by Commons Compress and any such entry can
307        not be extracted by the archiving code.</p>
308
309        <p><code>ZipFile</code>'s and
310        <code>ZipArchiveInputStream</code>'s
311        <code>canReadEntryData</code> methods will return false for
312        encrypted entries or entries using an unsupported encryption
313        mechanism.  Using this method it is possible to at least
314        detect and skip the entries that can not be extracted.</p>
315
316        <table>
317          <thead>
318            <tr>
319              <th>Version of Apache Commons Compress</th>
320              <th>Supported Compression Methods</th>
321              <th>Supported Encryption Methods</th>
322            </tr>
323          </thead>
324          <tbody>
325            <tr>
326              <td>1.0 to 1.6</td>
327              <td>STORED, DEFLATE</td>
328              <td>-</td>
329            </tr>
330            <tr>
331              <td>1.7 to 1.10</td>
332              <td>STORED, DEFLATE, SHRINK, IMPLODE</td>
333              <td>-</td>
334            </tr>
335            <tr>
336              <td>1.11 to 1.15</td>
337              <td>STORED, DEFLATE, SHRINK, IMPLODE, BZIP2</td>
338              <td>-</td>
339            </tr>
340            <tr>
341              <td>1.16 and later</td>
342              <td>STORED, DEFLATE, SHRINK, IMPLODE, BZIP2, DEFLATE64
343              (enhanced deflate)</td>
344              <td>-</td>
345            </tr>
346          </tbody>
347        </table>
348
349      </subsection>
350
351      <subsection name="Zip64 Support" id="zip64">
352        <p>The traditional ZIP format is limited to archive sizes of
353          four gibibyte (actually 2<sup>32</sup> - 1 bytes &#x2248;
354          4.3 GB) and 65635 entries, where each individual entry is
355          limited to four gibibyte as well.  These limits seemed
356          excessive in the 1980s.</p>
357
358        <p>Version 4.5 of the ZIP specification introduced the so
359          called "Zip64 extensions" to push those limitations for
360          compressed or uncompressed sizes of up to 16 exbibyte
361          (actually 2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB, i.e
362          18.5 x 10<sup>18</sup> bytes) in archives that themselves
363          can take up to 16 exbibyte containing more than
364          18 x 10<sup>18</sup> entries.</p>
365
366        <p>Apache Commons Compress 1.2 and earlier do not support
367          Zip64 extensions at all.</p>
368
369        <p>Starting with Apache Commons Compress
370          1.3 <code>ZipArchiveInputStream</code>
371          and <code>ZipFile</code> transparently support Zip64
372          extensions.  By default <code>ZipArchiveOutputStream</code>
373          supports them transparently as well (i.e. it adds Zip64
374          extensions if needed and doesn't use them for
375          entries/archives that don't need them) if the compressed and
376          uncompressed sizes of the entry are known
377          when <code>putArchiveEntry</code> is called
378          or <code>ZipArchiveOutputStream</code>
379          uses <code>SeekableByteChannel</code>
380          (see <a href="#ZipArchiveOutputStream">above</a>).  If only
381          the uncompressed size is
382          known <code>ZipArchiveOutputStream</code> will assume the
383          compressed size will not be bigger than the uncompressed
384          size.</p>
385
386        <p><code>ZipArchiveOutputStream</code>'s
387          <code>setUseZip64</code> can be used to control the behavior.
388          <code>Zip64Mode.AsNeeded</code> is the default behavior
389          described in the previous paragraph.</p>
390
391        <p>If <code>ZipArchiveOutputStream</code> is writing to a
392          non-seekable stream it has to decide whether to use Zip64
393          extensions or not before it starts wrtiting the entry data.
394          This means that if the size of the entry is unknown
395          when <code>putArchiveEntry</code> is called it doesn't have
396          anything to base the decision on.  By default it will not
397          use Zip64 extensions in order to create archives that can be
398          extracted by older archivers (it will later throw an
399          exception in <code>closeEntry</code> if it detects Zip64
400          extensions had been needed).  It is possible to
401          instruct <code>ZipArchiveOutputStream</code> to always
402          create Zip64 extensions by using
403          the <code>setUseZip64</code> with an argument
404          of <code>Zip64Mode.Always</code>; use this if you are
405          writing entries of unknown size to a stream and expect some
406          of them to be too big to fit into the traditional
407          limits.</p>
408
409        <p><code>Zip64Mode.Always</code> creates archives that use
410          Zip64 extensions for all entries, even those that don't
411          require them.  Such archives will be slightly bigger than
412          archives created with one of the other modes and not be
413          readable by unarchivers that don't support Zip64
414          extensions.</p>
415
416        <p><code>Zip64Mode.Never</code> will not use any Zip64
417          extensions at all and may lead to
418          a <code>Zip64RequiredException</code> to be thrown
419          if <code>ZipArchiveOutputStream</code> detects that one of
420          the format's limits is exceeded.  Archives created in this
421          mode will be readable by all unarchivers; they may be
422          slightly smaller than archives created
423          with <code>SeekableByteChannel</code>
424          in <code>Zip64Mode.AsNeeded</code> mode if some of the
425          entries had unknown sizes.</p>
426
427        <p>The <code>java.util.zip</code> package and the
428          <code>jar</code> command of Java5 and earlier can not read
429          Zip64 extensions and will fail if the archive contains any.
430          So if you intend to create archives that Java5 can consume
431          you must set the mode to <code>Zip64Mode.Never</code></p>
432
433        <h4>Known Limitations</h4>
434
435        <p>Some of the theoretical limits of the format are not
436          reached because Apache Commons Compress' own API
437          (<code>ArchiveEntry</code>'s size information uses
438          a <code>long</code>) or its usage of Java collections
439          or <code>SeekableByteChannel</code> internally.  The table
440          below shows the theoretical limits supported by Apache
441          Commons Compress.  In practice it is very likely that you'd
442          run out of memory or your file system won't allow files that
443          big long before you reach either limit.</p>
444
445        <table>
446          <thead>
447            <tr>
448              <th/>
449              <th>Max. Size of Archive</th>
450              <th>Max. Compressed/Uncompressed Size of Entry</th>
451              <th>Max. Number of Entries</th>
452            </tr>
453          </thead>
454          <tbody>
455            <tr>
456              <td>ZIP Format Without Zip 64 Extensions</td>
457              <td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
458              <td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
459              <td>65535</td>
460            </tr>
461            <tr>
462              <td>ZIP Format using Zip 64 Extensions</td>
463              <td>2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB</td>
464              <td>2<sup>64</sup> - 1 bytes &#x2248; 18.5 EB</td>
465              <td>2<sup>64</sup> - 1 &#x2248; 18.5 x 10<sup>18</sup></td>
466            </tr>
467            <tr>
468              <td>Commons Compress 1.2 and earlier</td>
469              <td>unlimited in <code>ZipArchiveInputStream</code>
470                and <code>ZipArchiveOutputStream</code> and
471                2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB
472                in <code>ZipFile</code>.</td>
473              <td>2<sup>32</sup> - 1 bytes &#x2248; 4.3 GB</td>
474              <td>unlimited in <code>ZipArchiveInputStream</code>,
475                65535 in <code>ZipArchiveOutputStream</code>
476                and <code>ZipFile</code>.</td>
477            </tr>
478            <tr>
479              <td>Commons Compress 1.3 and later</td>
480              <td>unlimited in <code>ZipArchiveInputStream</code>
481                and <code>ZipArchiveOutputStream</code> and
482                2<sup>63</sup> - 1 bytes &#x2248; 9.2 EB
483                in <code>ZipFile</code>.</td>
484              <td>2<sup>63</sup> - 1 bytes &#x2248; 9.2 EB</td>
485              <td>unlimited in <code>ZipArchiveInputStream</code>,
486                2<sup>31</sup> - 1 &#x2248; 2.1 billion
487                in <code>ZipArchiveOutputStream</code>
488                and <code>ZipFile</code>.</td>
489            </tr>
490          </tbody>
491        </table>
492
493        <h4>Known Interoperability Problems</h4>
494
495        <p>The <code>java.util.zip</code> package of OpenJDK7 supports
496        Zip 64 extensions but its <code>ZipInputStream</code> and
497        <code>ZipFile</code> classes will be unable to extract
498        archives created with Commons Compress 1.3's
499        <code>ZipArchiveOutputStream</code> if the archive contains
500        entries that use the data descriptor, are smaller than 4 GiB
501        and have Zip 64 extensions enabled.  I.e. the classes in
502        OpenJDK currently only support archives that use Zip 64
503        extensions only when they are actually needed.  These classes
504        are used to load JAR files and are the base for the
505        <code>jar</code> command line utility as well.</p>
506      </subsection>
507
508      <subsection name="Consuming Archives Completely">
509
510        <p>Prior to version 1.5 <code>ZipArchiveInputStream</code>
511        would return null from <code>getNextEntry</code> or
512        <code>getNextZipEntry</code> as soon as the first central
513        directory header of the archive was found, leaving the whole
514        central directory itself unread inside the stream.  Starting
515        with version 1.5 <code>ZipArchiveInputStream</code> will try
516        to read the archive up to and including the "end of central
517        directory" record effectively consuming the archive
518        completely.</p>
519
520      </subsection>
521
522      <subsection name="Symbolic Links" id="symlinks">
523
524        <p>Starting with Compress 1.5 <code>ZipArchiveEntry</code>
525        recognizes Unix Symbolic Link entries written by InfoZIP's
526        zip.</p>
527
528        <p>The <code>ZipFile</code> class contains a convenience
529        method to read the link name of an entry.  Basically all it
530        does is read the contents of the entry and convert it to
531        a string using the given file name encoding of the
532        archive.</p>
533
534      </subsection>
535
536      <subsection name="Parallel zip creation" id="parallel">
537
538        <p>Starting with Compress 1.10 there is now built-in support for
539          parallel creation of zip archives</p>
540
541          <p>Multiple threads can write
542          to their own <code>ScatterZipOutputStream</code>
543          instance that is backed to file or to some user-implemented form of
544          storage (implementing <code>ScatterGatherBackingStore</code>).</p>
545
546          <p>When the threads finish, they can join these streams together
547          to a complete zip file using the <code>writeTo</code> method
548          that will write a single <code>ScatterOutputStream</code> to a target
549          <code>ZipArchiveOutputStream</code>.</p>
550
551          <p>To assist this process, clients can use
552          <code>ParallelScatterZipCreator</code> that will handle threads
553          pools and correct memory model consistency so the client
554          can avoid these issues. Please note that when writing well-formed
555          Zip files this way, it is usually necessary to keep a
556          separate <code>ScatterZipOutputStream</code> that receives all directories
557          and writes this to the target <code>ZipArchiveOutputStream</code> before
558          the ones created through <code>ParallelScatterZipCreator</code>. This is the responsibility of the client.</p>
559
560          <p>There is no guarantee of order of the entries when writing a Zip
561          file with <code>ParallelScatterZipCreator</code>.</p>
562
563          See the examples section for a code sample demonstrating how to make a zip file.
564      </subsection>
565
566    </section>
567  </body>
568</document>
569