1<?xml version="1.0"?> 2<!-- 3 4 Licensed to the Apache Software Foundation (ASF) under one or more 5 contributor license agreements. See the NOTICE file distributed with 6 this work for additional information regarding copyright ownership. 7 The ASF licenses this file to You under the Apache License, Version 2.0 8 (the "License"); you may not use this file except in compliance with 9 the License. You may obtain a copy of the License at 10 11 http://www.apache.org/licenses/LICENSE-2.0 12 13 Unless required by applicable law or agreed to in writing, software 14 distributed under the License is distributed on an "AS IS" BASIS, 15 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 See the License for the specific language governing permissions and 17 limitations under the License. 18 19--> 20<document> 21 <properties> 22 <title>Commons Compress ZIP package</title> 23 <author email="dev@commons.apache.org">Commons Documentation Team</author> 24 </properties> 25 <body> 26 <section name="The ZIP package"> 27 28 <p>The ZIP package provides features not found 29 in <code>java.util.zip</code>:</p> 30 31 <ul> 32 <li>Support for encodings other than UTF-8 for filenames and 33 comments. Starting with Java7 this is supported 34 by <code>java.util.zip</code> as well.</li> 35 <li>Access to internal and external attributes (which are used 36 to store Unix permission by some zip implementations).</li> 37 <li>Structured support for extra fields.</li> 38 </ul> 39 40 <p>In addition to the information stored 41 in <code>ArchiveEntry</code> a <code>ZipArchiveEntry</code> 42 stores internal and external attributes as well as extra 43 fields which may contain information like Unix permissions, 44 information about the platform they've been created on, their 45 last modification time and an optional comment.</p> 46 47 <subsection name="ZipArchiveInputStream vs ZipFile"> 48 49 <p>ZIP archives store a archive entries in sequence and 50 contain a registry of all entries at the very end of the 51 archive. It is acceptable for an archive to contain several 52 entries of the same name and have the registry (called the 53 central directory) decide which entry is actually to be used 54 (if any).</p> 55 56 <p>In addition the ZIP format stores certain information only 57 inside the central directory but not together with the entry 58 itself, this is:</p> 59 60 <ul> 61 <li>internal and external attributes</li> 62 <li>different or additional extra fields</li> 63 </ul> 64 65 <p>This means the ZIP format cannot really be parsed 66 correctly while reading a non-seekable stream, which is what 67 <code>ZipArchiveInputStream</code> is forced to do. As a 68 result <code>ZipArchiveInputStream</code></p> 69 <ul> 70 <li>may return entries that are not part of the central 71 directory at all and shouldn't be considered part of the 72 archive.</li> 73 <li>may return several entries with the same name.</li> 74 <li>will not return internal or external attributes.</li> 75 <li>may return incomplete extra field data.</li> 76 <li>may return unknown sizes and CRC values for entries 77 until the next entry has been reached if the archive uses 78 the data descriptor feature (see below).</li> 79 </ul> 80 81 <p><code>ZipArchiveInputStream</code> shares these limitations 82 with <code>java.util.zip.ZipInputStream</code>.</p> 83 84 <p><code>ZipFile</code> is able to read the central directory 85 first and provide correct and complete information on any 86 ZIP archive.</p> 87 88 <p>ZIP archives know a feature called the data descriptor 89 which is a way to store an entry's length after the entry's 90 data. This can only work reliably if the size information 91 can be taken from the central directory or the data itself 92 can signal it is complete, which is true for data that is 93 compressed using the DEFLATED compression algorithm.</p> 94 95 <p><code>ZipFile</code> has access to the central directory 96 and can extract entries using the data descriptor reliably. 97 The same is true for <code>ZipArchiveInputStream</code> as 98 long as the entry is DEFLATED. For STORED 99 entries <code>ZipArchiveInputStream</code> can try to read 100 ahead until it finds the next entry, but this approach is 101 not safe and has to be enabled by a constructor argument 102 explicitly.</p> 103 104 <p>If possible, you should always prefer <code>ZipFile</code> 105 over <code>ZipArchiveInputStream</code>.</p> 106 107 <p><code>ZipFile</code> requires a 108 <code>SeekableByteChannel</code> that will be obtained 109 transparently when reading from a file. The class 110 <code>org.apache.commons.compress.utils.SeekableInMemoryByteChannel</code> 111 allows you to read from an in-memory archive.</p> 112 113 </subsection> 114 115 <subsection name="ZipArchiveOutputStream" id="ZipArchiveOutputStream"> 116 <p><code>ZipArchiveOutputStream</code> has three constructors, 117 one of them uses a <code>File</code> argument, one a 118 <code>SeekableByteChannel</code> and the last uses an 119 <code>OutputStream</code>. The <code>File</code> version will 120 try to use <code>SeekableByteChannel</code> and fall back to 121 using a <code>FileOutputStream</code> internally if that 122 fails.</p> 123 124 <p>If <code>ZipArchiveOutputStream</code> can 125 use <code>SeekableByteChannel</code> it can employ some 126 optimizations that lead to smaller archives. It also makes 127 it possible to add uncompressed (<code>setMethod</code> used 128 with <code>STORED</code>) entries of unknown size when 129 calling <code>putArchiveEntry</code> - this is not allowed 130 if <code>ZipArchiveOutputStream</code> has to use 131 an <code>OutputStream</code>.</p> 132 133 <p>If you know you are writing to a file, you should always 134 prefer the <code>File</code>- or 135 <code>SeekableByteChannel</code>-arg constructors. The class 136 <code>org.apache.commons.compress.utils.SeekableInMemoryByteChannel</code> 137 allows you to write to an in-memory archive.</p> 138 139 </subsection> 140 141 <subsection name="Extra Fields"> 142 143 <p>Inside a ZIP archive, additional data can be attached to 144 each entry. The <code>java.util.zip.ZipEntry</code> class 145 provides access to this via the <code>get/setExtra</code> 146 methods as arrays of <code>byte</code>s.</p> 147 148 <p>Actually the extra data is supposed to be more structured 149 than that and Compress' ZIP package provides access to the 150 structured data as <code>ExtraField</code> instances. Only 151 a subset of all defined extra field formats is supported by 152 the package, any other extra field will be stored 153 as <code>UnrecognizedExtraField</code>.</p> 154 155 <p>Prior to version 1.1 of this library trying to read an 156 archive with extra fields that didn't follow the recommended 157 structure for those fields would cause Compress to throw an 158 exception. Starting with version 1.1 these extra fields 159 will now be read 160 as <code>UnparseableExtraFieldData</code>.</p> 161 162 </subsection> 163 164 <subsection name="Encoding" id="encoding"> 165 166 <p>Traditionally the ZIP archive format uses CodePage 437 as 167 encoding for file name, which is not sufficient for many 168 international character sets.</p> 169 170 <p>Over time different archivers have chosen different ways to 171 work around the limitation - the <code>java.util.zip</code> 172 packages simply uses UTF-8 as its encoding for example.</p> 173 174 <p>Ant has been offering the encoding attribute of the zip and 175 unzip task as a way to explicitly specify the encoding to 176 use (or expect) since Ant 1.4. It defaults to the 177 platform's default encoding for zip and UTF-8 for jar and 178 other jar-like tasks (war, ear, ...) as well as the unzip 179 family of tasks.</p> 180 181 <p>More recent versions of the ZIP specification introduce 182 something called the "language encoding flag" 183 which can be used to signal that a file name has been 184 encoded using UTF-8. All ZIP-archives written by Compress 185 will set this flag, if the encoding has been set to UTF-8. 186 Our interoperability tests with existing archivers didn't 187 show any ill effects (in fact, most archivers ignore the 188 flag to date), but you can turn off the "language encoding 189 flag" by setting the attribute 190 <code>useLanguageEncodingFlag</code> to <code>false</code> on the 191 <code>ZipArchiveOutputStream</code> if you should encounter 192 problems.</p> 193 194 <p>The <code>ZipFile</code> 195 and <code>ZipArchiveInputStream</code> classes will 196 recognize the language encoding flag and ignore the encoding 197 set in the constructor if it has been found.</p> 198 199 <p>The InfoZIP developers have introduced new ZIP extra fields 200 that can be used to add an additional UTF-8 encoded file 201 name to the entry's metadata. Most archivers ignore these 202 extra fields. <code>ZipArchiveOutputStream</code> supports 203 an option <code>createUnicodeExtraFields</code> which makes 204 it write these extra fields either for all entries 205 ("always") or only those whose name cannot be encoded using 206 the specified encoding (not-encodeable), it defaults to 207 "never" since the extra fields create bigger archives.</p> 208 209 <p>The fallbackToUTF8 attribute 210 of <code>ZipArchiveOutputStream</code> can be used to create 211 archives that use the specified encoding in the majority of 212 cases but UTF-8 and the language encoding flag for filenames 213 that cannot be encoded using the specified encoding.</p> 214 215 <p>The <code>ZipFile</code> 216 and <code>ZipArchiveInputStream</code> classes recognize the 217 Unicode extra fields by default and read the file name 218 information from them, unless you set the constructor parameter 219 <code>scanForUnicodeExtraFields</code> to false.</p> 220 221 <h4>Recommendations for Interoperability</h4> 222 223 <p>The optimal setting of flags depends on the archivers you 224 expect as consumers/producers of the ZIP archives. Below 225 are some test results which may be superseded with later 226 versions of each tool.</p> 227 228 <ul> 229 <li>The java.util.zip package used by the jar executable or 230 to read jars from your CLASSPATH reads and writes UTF-8 231 names, it doesn't set or recognize any flags or Unicode 232 extra fields.</li> 233 234 <li>Starting with Java7 <code>java.util.zip</code> writes 235 UTF-8 by default and uses the language encoding flag. It 236 is possible to specify a different encoding when 237 reading/writing ZIPs via new constructors. The package 238 now recognizes the language encoding flag when reading and 239 ignores the Unicode extra fields.</li> 240 241 <li>7Zip writes CodePage 437 by default but uses UTF-8 and 242 the language encoding flag when writing entries that 243 cannot be encoded as CodePage 437 (similar to the zip task 244 with fallbacktoUTF8 set to true). It recognizes the 245 language encoding flag when reading and ignores the 246 Unicode extra fields.</li> 247 248 <li>WinZIP writes CodePage 437 and uses Unicode extra fields 249 by default. It recognizes the Unicode extra field and the 250 language encoding flag when reading.</li> 251 252 <li>Windows' "compressed folder" feature doesn't recognize 253 any flag or extra field and creates archives using the 254 platforms default encoding - and expects archives to be in 255 that encoding when reading them.</li> 256 257 <li>InfoZIP based tools can recognize and write both, it is 258 a compile time option and depends on the platform so your 259 mileage may vary.</li> 260 261 <li>PKWARE zip tools recognize both and prefer the language 262 encoding flag. They create archives using CodePage 437 if 263 possible and UTF-8 plus the language encoding flag for 264 file names that cannot be encoded as CodePage 437.</li> 265 </ul> 266 267 <p>So, what to do?</p> 268 269 <p>If you are creating jars, then java.util.zip is your main 270 consumer. We recommend you set the encoding to UTF-8 and 271 keep the language encoding flag enabled. The flag won't 272 help or hurt java.util.zip prior to Java7 but archivers that 273 support it will show the correct file names.</p> 274 275 <p>For maximum interop it is probably best to set the encoding 276 to UTF-8, enable the language encoding flag and create 277 Unicode extra fields when writing ZIPs. Such archives 278 should be extracted correctly by java.util.zip, 7Zip, 279 WinZIP, PKWARE tools and most likely InfoZIP tools. They 280 will be unusable with Windows' "compressed folders" feature 281 and bigger than archives without the Unicode extra fields, 282 though.</p> 283 284 <p>If Windows' "compressed folders" is your primary consumer, 285 then your best option is to explicitly set the encoding to 286 the target platform. You may want to enable creation of 287 Unicode extra fields so the tools that support them will 288 extract the file names correctly.</p> 289 </subsection> 290 291 <subsection name="Encryption and Alternative Compression Algorithms" 292 id="encryption"> 293 294 <p>In most cases entries of an archive are not encrypted and 295 are either not compressed at all or use the DEFLATE 296 algorithm, Commons Compress' ZIP archiver will handle them 297 just fine. As of version 1.7, Commons Compress can also 298 decompress entries compressed with the legacy SHRINK and 299 IMPLODE algorithms of PKZIP 1.x. Version 1.11 of Commons 300 Compress adds read-only support for BZIP2. Version 1.16 adds 301 read-only support for DEFLATE64 - also known as "enhanced DEFLATE".</p> 302 303 <p>The ZIP specification allows for various other compression 304 algorithms and also supports several different ways of 305 encrypting archive contents. Neither of those methods is 306 currently supported by Commons Compress and any such entry can 307 not be extracted by the archiving code.</p> 308 309 <p><code>ZipFile</code>'s and 310 <code>ZipArchiveInputStream</code>'s 311 <code>canReadEntryData</code> methods will return false for 312 encrypted entries or entries using an unsupported encryption 313 mechanism. Using this method it is possible to at least 314 detect and skip the entries that can not be extracted.</p> 315 316 <table> 317 <thead> 318 <tr> 319 <th>Version of Apache Commons Compress</th> 320 <th>Supported Compression Methods</th> 321 <th>Supported Encryption Methods</th> 322 </tr> 323 </thead> 324 <tbody> 325 <tr> 326 <td>1.0 to 1.6</td> 327 <td>STORED, DEFLATE</td> 328 <td>-</td> 329 </tr> 330 <tr> 331 <td>1.7 to 1.10</td> 332 <td>STORED, DEFLATE, SHRINK, IMPLODE</td> 333 <td>-</td> 334 </tr> 335 <tr> 336 <td>1.11 to 1.15</td> 337 <td>STORED, DEFLATE, SHRINK, IMPLODE, BZIP2</td> 338 <td>-</td> 339 </tr> 340 <tr> 341 <td>1.16 and later</td> 342 <td>STORED, DEFLATE, SHRINK, IMPLODE, BZIP2, DEFLATE64 343 (enhanced deflate)</td> 344 <td>-</td> 345 </tr> 346 </tbody> 347 </table> 348 349 </subsection> 350 351 <subsection name="Zip64 Support" id="zip64"> 352 <p>The traditional ZIP format is limited to archive sizes of 353 four gibibyte (actually 2<sup>32</sup> - 1 bytes ≈ 354 4.3 GB) and 65635 entries, where each individual entry is 355 limited to four gibibyte as well. These limits seemed 356 excessive in the 1980s.</p> 357 358 <p>Version 4.5 of the ZIP specification introduced the so 359 called "Zip64 extensions" to push those limitations for 360 compressed or uncompressed sizes of up to 16 exbibyte 361 (actually 2<sup>64</sup> - 1 bytes ≈ 18.5 EB, i.e 362 18.5 x 10<sup>18</sup> bytes) in archives that themselves 363 can take up to 16 exbibyte containing more than 364 18 x 10<sup>18</sup> entries.</p> 365 366 <p>Apache Commons Compress 1.2 and earlier do not support 367 Zip64 extensions at all.</p> 368 369 <p>Starting with Apache Commons Compress 370 1.3 <code>ZipArchiveInputStream</code> 371 and <code>ZipFile</code> transparently support Zip64 372 extensions. By default <code>ZipArchiveOutputStream</code> 373 supports them transparently as well (i.e. it adds Zip64 374 extensions if needed and doesn't use them for 375 entries/archives that don't need them) if the compressed and 376 uncompressed sizes of the entry are known 377 when <code>putArchiveEntry</code> is called 378 or <code>ZipArchiveOutputStream</code> 379 uses <code>SeekableByteChannel</code> 380 (see <a href="#ZipArchiveOutputStream">above</a>). If only 381 the uncompressed size is 382 known <code>ZipArchiveOutputStream</code> will assume the 383 compressed size will not be bigger than the uncompressed 384 size.</p> 385 386 <p><code>ZipArchiveOutputStream</code>'s 387 <code>setUseZip64</code> can be used to control the behavior. 388 <code>Zip64Mode.AsNeeded</code> is the default behavior 389 described in the previous paragraph.</p> 390 391 <p>If <code>ZipArchiveOutputStream</code> is writing to a 392 non-seekable stream it has to decide whether to use Zip64 393 extensions or not before it starts wrtiting the entry data. 394 This means that if the size of the entry is unknown 395 when <code>putArchiveEntry</code> is called it doesn't have 396 anything to base the decision on. By default it will not 397 use Zip64 extensions in order to create archives that can be 398 extracted by older archivers (it will later throw an 399 exception in <code>closeEntry</code> if it detects Zip64 400 extensions had been needed). It is possible to 401 instruct <code>ZipArchiveOutputStream</code> to always 402 create Zip64 extensions by using 403 the <code>setUseZip64</code> with an argument 404 of <code>Zip64Mode.Always</code>; use this if you are 405 writing entries of unknown size to a stream and expect some 406 of them to be too big to fit into the traditional 407 limits.</p> 408 409 <p><code>Zip64Mode.Always</code> creates archives that use 410 Zip64 extensions for all entries, even those that don't 411 require them. Such archives will be slightly bigger than 412 archives created with one of the other modes and not be 413 readable by unarchivers that don't support Zip64 414 extensions.</p> 415 416 <p><code>Zip64Mode.Never</code> will not use any Zip64 417 extensions at all and may lead to 418 a <code>Zip64RequiredException</code> to be thrown 419 if <code>ZipArchiveOutputStream</code> detects that one of 420 the format's limits is exceeded. Archives created in this 421 mode will be readable by all unarchivers; they may be 422 slightly smaller than archives created 423 with <code>SeekableByteChannel</code> 424 in <code>Zip64Mode.AsNeeded</code> mode if some of the 425 entries had unknown sizes.</p> 426 427 <p>The <code>java.util.zip</code> package and the 428 <code>jar</code> command of Java5 and earlier can not read 429 Zip64 extensions and will fail if the archive contains any. 430 So if you intend to create archives that Java5 can consume 431 you must set the mode to <code>Zip64Mode.Never</code></p> 432 433 <h4>Known Limitations</h4> 434 435 <p>Some of the theoretical limits of the format are not 436 reached because Apache Commons Compress' own API 437 (<code>ArchiveEntry</code>'s size information uses 438 a <code>long</code>) or its usage of Java collections 439 or <code>SeekableByteChannel</code> internally. The table 440 below shows the theoretical limits supported by Apache 441 Commons Compress. In practice it is very likely that you'd 442 run out of memory or your file system won't allow files that 443 big long before you reach either limit.</p> 444 445 <table> 446 <thead> 447 <tr> 448 <th/> 449 <th>Max. Size of Archive</th> 450 <th>Max. Compressed/Uncompressed Size of Entry</th> 451 <th>Max. Number of Entries</th> 452 </tr> 453 </thead> 454 <tbody> 455 <tr> 456 <td>ZIP Format Without Zip 64 Extensions</td> 457 <td>2<sup>32</sup> - 1 bytes ≈ 4.3 GB</td> 458 <td>2<sup>32</sup> - 1 bytes ≈ 4.3 GB</td> 459 <td>65535</td> 460 </tr> 461 <tr> 462 <td>ZIP Format using Zip 64 Extensions</td> 463 <td>2<sup>64</sup> - 1 bytes ≈ 18.5 EB</td> 464 <td>2<sup>64</sup> - 1 bytes ≈ 18.5 EB</td> 465 <td>2<sup>64</sup> - 1 ≈ 18.5 x 10<sup>18</sup></td> 466 </tr> 467 <tr> 468 <td>Commons Compress 1.2 and earlier</td> 469 <td>unlimited in <code>ZipArchiveInputStream</code> 470 and <code>ZipArchiveOutputStream</code> and 471 2<sup>32</sup> - 1 bytes ≈ 4.3 GB 472 in <code>ZipFile</code>.</td> 473 <td>2<sup>32</sup> - 1 bytes ≈ 4.3 GB</td> 474 <td>unlimited in <code>ZipArchiveInputStream</code>, 475 65535 in <code>ZipArchiveOutputStream</code> 476 and <code>ZipFile</code>.</td> 477 </tr> 478 <tr> 479 <td>Commons Compress 1.3 and later</td> 480 <td>unlimited in <code>ZipArchiveInputStream</code> 481 and <code>ZipArchiveOutputStream</code> and 482 2<sup>63</sup> - 1 bytes ≈ 9.2 EB 483 in <code>ZipFile</code>.</td> 484 <td>2<sup>63</sup> - 1 bytes ≈ 9.2 EB</td> 485 <td>unlimited in <code>ZipArchiveInputStream</code>, 486 2<sup>31</sup> - 1 ≈ 2.1 billion 487 in <code>ZipArchiveOutputStream</code> 488 and <code>ZipFile</code>.</td> 489 </tr> 490 </tbody> 491 </table> 492 493 <h4>Known Interoperability Problems</h4> 494 495 <p>The <code>java.util.zip</code> package of OpenJDK7 supports 496 Zip 64 extensions but its <code>ZipInputStream</code> and 497 <code>ZipFile</code> classes will be unable to extract 498 archives created with Commons Compress 1.3's 499 <code>ZipArchiveOutputStream</code> if the archive contains 500 entries that use the data descriptor, are smaller than 4 GiB 501 and have Zip 64 extensions enabled. I.e. the classes in 502 OpenJDK currently only support archives that use Zip 64 503 extensions only when they are actually needed. These classes 504 are used to load JAR files and are the base for the 505 <code>jar</code> command line utility as well.</p> 506 </subsection> 507 508 <subsection name="Consuming Archives Completely"> 509 510 <p>Prior to version 1.5 <code>ZipArchiveInputStream</code> 511 would return null from <code>getNextEntry</code> or 512 <code>getNextZipEntry</code> as soon as the first central 513 directory header of the archive was found, leaving the whole 514 central directory itself unread inside the stream. Starting 515 with version 1.5 <code>ZipArchiveInputStream</code> will try 516 to read the archive up to and including the "end of central 517 directory" record effectively consuming the archive 518 completely.</p> 519 520 </subsection> 521 522 <subsection name="Symbolic Links" id="symlinks"> 523 524 <p>Starting with Compress 1.5 <code>ZipArchiveEntry</code> 525 recognizes Unix Symbolic Link entries written by InfoZIP's 526 zip.</p> 527 528 <p>The <code>ZipFile</code> class contains a convenience 529 method to read the link name of an entry. Basically all it 530 does is read the contents of the entry and convert it to 531 a string using the given file name encoding of the 532 archive.</p> 533 534 </subsection> 535 536 <subsection name="Parallel zip creation" id="parallel"> 537 538 <p>Starting with Compress 1.10 there is now built-in support for 539 parallel creation of zip archives</p> 540 541 <p>Multiple threads can write 542 to their own <code>ScatterZipOutputStream</code> 543 instance that is backed to file or to some user-implemented form of 544 storage (implementing <code>ScatterGatherBackingStore</code>).</p> 545 546 <p>When the threads finish, they can join these streams together 547 to a complete zip file using the <code>writeTo</code> method 548 that will write a single <code>ScatterOutputStream</code> to a target 549 <code>ZipArchiveOutputStream</code>.</p> 550 551 <p>To assist this process, clients can use 552 <code>ParallelScatterZipCreator</code> that will handle threads 553 pools and correct memory model consistency so the client 554 can avoid these issues. Please note that when writing well-formed 555 Zip files this way, it is usually necessary to keep a 556 separate <code>ScatterZipOutputStream</code> that receives all directories 557 and writes this to the target <code>ZipArchiveOutputStream</code> before 558 the ones created through <code>ParallelScatterZipCreator</code>. This is the responsibility of the client.</p> 559 560 <p>There is no guarantee of order of the entries when writing a Zip 561 file with <code>ParallelScatterZipCreator</code>.</p> 562 563 See the examples section for a code sample demonstrating how to make a zip file. 564 </subsection> 565 566 </section> 567 </body> 568</document> 569