Commons Compress calls all formats that compress a single stream of data compressor formats while all formats that collect multiple entries inside a single (potentially compressed) archive are archiver formats.
The compressor formats supported are gzip, bzip2, xz, lzma, Pack200, DEFLATE, Brotli, DEFLATE64, ZStandard and Z, the archiver formats are 7z, ar, arj, cpio, dump, tar and zip. Pack200 is a special case as it can only compress JAR files.
We currently only provide read support for arj, dump, Brotli, DEFLATE64 and Z. arj can only read uncompressed archives, 7z can read archives with many compression and encryption algorithms supported by 7z but doesn't support encryption when writing archives.
The stream classes all wrap around streams provided by the
calling code and they work on them directly without any
additional buffering. On the other hand most of them will
benefit from buffering so it is highly recommended that
users wrap their stream
in Buffered(In|Out)putStream
s before
using the Commons Compress API.
Compress provides factory methods to create input/output streams based on the names of the compressor or archiver format as well as factory methods that try to guess the format of an input stream.
To create a compressor writing to a given output by using the algorithm name:
Make the factory guess the input format for a given archiver stream:
Make the factory guess the input format for a given compressor stream:
Note that there is no way to detect the lzma or Brotli formats so only
the two-arg version of
createCompressorInputStream
can be used. Prior
to Compress 1.9 the .Z format hasn't been auto-detected
either.
Starting with Compress 1.14
CompressorStreamFactory
has an optional
constructor argument that can be used to set an upper limit of
memory that may be used while decompressing or compressing a
stream. As of 1.14 this setting only affects decompressing Z,
XZ and LZMA compressed streams.
For the Snappy and LZ4 formats the amount of memory used during compression is directly proportional to the window size.
Starting with Compress 1.17 most of the
CompressorInputStream
implementations as well as
ZipArchiveInputStream
and all streams returned by
ZipFile.getInputStream
implement the
InputStreamStatistics
interface. SevenZFile
provides statistics for the
current entry via the
getStatisticsForCurrentEntry
method. This
interface can be used to track progress while extracting a
stream or to detect potential zip bombs
when the compression ration becomes suspiciously large.
Many of the supported formats have developed different dialects and extensions and some formats allow for features (not yet) supported by Commons Compress.
The ArchiveInputStream
class provides a method
canReadEntryData
that will return false if
Commons Compress can detect that an archive uses a feature
that is not supported by the current implementation. If it
returns false you should not try to read the entry but skip
over it.
All archive formats provide meta data about the individual
archive entries via instances of ArchiveEntry
(or
rather subclasses of it). When reading from an archive the
information provided the getName
method is the
raw name as stored inside of the archive. There is no
guarantee the name represents a relative file name or even a
valid file name on your target operating system at all. You
should double check the outcome when you try to create file
names from entry names.
Apart from 7z all formats provide a subclass of
ArchiveInputStream
that can be used to create an
archive. For 7z SevenZFile
provides a similar API
that does not represent a stream as our implementation
requires random access to the input and cannot be used for
general streams. The ZIP implementation can benefit a lot from
random access as well, see the zip
page for details.
Assuming you want to extract an archive to a target
directory you'd call getNextEntry
, verify the
entry can be read, construct a sane file name from the entry's
name, create a IOUtils.copy
may come handy. You do so
for every entry until getNextEntry
returns
null
.
A skeleton might look like:
where the hypothetical fileName
method is
written by you and provides the absolute name for the file
that is going to be written on disk. Here you should perform
checks that ensure the resulting file name actually is a valid
file name on your operating system or belongs to a file inside
of targetDir
when using the entry's name as
input.
If you want to combine an archive format with a compression
format - like when reading a "tar.gz" file - you wrap the
ArchiveInputStream
around
CompressorInputStream
for example:
Apart from 7z all formats that support writing provide a
subclass of ArchiveOutputStream
that can be used
to create an archive. For 7z SevenZOutputFile
provides a similar API that does not represent a stream as our
implementation requires random access to the output and cannot
be used for general streams. The
ZipArchiveOutputStream
class will benefit from
random access as well but can be used for non-seekable streams
- but not all features will be available and the archive size
might be slightly bigger, see the zip page for
details.
Assuming you want to add a collection of files to an
archive, you can first use createArchiveEntry
for
each file. In general this will set a few flags (usually the
last modified time, the size and the information whether this
is a file or directory) based on the File
instance. Alternatively you can create the
ArchiveEntry
subclass corresponding to your
format directly. Often you may want to set additional flags
like file permissions or owner information before adding the
entry to the archive.
Next you use putArchiveEntry
in order to add
the entry and then start using write
to add the
content of the entry - here IOUtils.copy
may
come handy. Finally you invoke
closeArchiveEntry
once you've written all content
and before you add the next entry.
Once all entries have been added you'd invoke
finish
and finally close
the
stream.
A skeleton might look like:
where the hypothetical entryName
method is
written by you and provides the name for the entry as it is
going to be written to the archive.
If you want to combine an archive format with a compression
format - like when creating a "tar.gz" file - you wrap the
ArchiveOutputStream
around a
CompressorOutputStream
for example:
Note that Commons Compress currently only supports a subset of compression and encryption algorithms used for 7z archives. For writing only uncompressed entries, LZMA, LZMA2, BZIP2 and Deflate are supported - in addition to those reading supports AES-256/SHA-256 and DEFLATE64.
Multipart archives are not supported at all.
7z archives can use multiple compression and encryption
methods as well as filters combined as a pipeline of methods
for its entries. Prior to Compress 1.8 you could only specify
a single method when creating archives - reading archives
using more than one method has been possible before. Starting
with Compress 1.8 it is possible to configure the full
pipeline using the setContentMethods
method of
SevenZOutputFile
. Methods are specified in the
order they appear inside the pipeline when creating the
archive, you can also specify certain parameters for some of
the methods - see the Javadocs of
SevenZMethodConfiguration
for details.
When reading entries from an archive the
getContentMethods
method of
SevenZArchiveEntry
will properly represent the
compression/encryption/filter methods but may fail to
determine the configuration options used. As of Compress 1.8
only the dictionary size used for LZMA2 can be read.
Currently solid compression - compressing multiple files as a single block to benefit from patterns repeating accross files - is only supported when reading archives. This also means compression ratio will likely be worse when using Commons Compress compared to the native 7z executable.
Reading or writing requires a
SeekableByteChannel
that will be obtained
transparently when reading from or writing to a file. The
class
org.apache.commons.compress.utils.SeekableInMemoryByteChannel
allows you to read from or write to an in-memory archive.
Adding an entry to a 7z archive:
Uncompressing a given 7z archive (you would certainly add exception handling and make sure all streams get closed properly):
Uncompressing a given in-memory 7z archive:
Currently Compress supports reading but not writing of
encrypted archives. When reading an encrypted archive a
password has to be provided to one of
SevenZFile
's constructors. If you try to read
an encrypted archive without specifying a password a
PasswordRequiredException
(a subclass of
IOException
) will be thrown.
When specifying the password as a byte[]
one
common mistake is to use the wrong encoding when creating
the byte[]
from a String
. The
SevenZFile
class expects the bytes to
correspond to the UTF16-LE encoding of the password. An
example of reading an encrypted archive is
Starting with Compress 1.17 new constructors have been
added that accept the password as char[]
rather
than a byte[]
. We recommend you use these in
order to avoid the problem above.
In addition to the information stored
in ArchiveEntry
a ArArchiveEntry
stores information about the owner user and group as well as
Unix permissions.
Adding an entry to an ar archive:
Reading entries from an ar archive:
Traditionally the AR format doesn't allow file names longer than 16 characters. There are two variants that circumvent this limitation in different ways, the GNU/SRV4 and the BSD variant. Commons Compress 1.0 to 1.2 can only read archives using the GNU/SRV4 variant, support for the BSD variant has been added in Commons Compress 1.3. Commons Compress 1.3 also optionally supports writing archives with file names longer than 16 characters using the BSD dialect, writing the SVR4/GNU dialect is not supported.
Version of Apache Commons Compress | Support for Traditional AR Format | Support for GNU/SRV4 Dialect | Support for BSD Dialect |
---|---|---|---|
1.0 to 1.2 | read/write | read | - |
1.3 and later | read/write | read | read/write |
It is not possible to detect the end of an AR archive in a
reliable way so ArArchiveInputStream
will read
until it reaches the end of the stream or fails to parse the
stream's content as AR entries.
Note that Commons Compress doesn't support compressed, encrypted or multi-volume ARJ archives, yet.
Uncompressing a given arj archive (you would certainly add exception handling and make sure all streams get closed properly):
In addition to the information stored
in ArchiveEntry
a CpioArchiveEntry
stores various attributes including information about the
original owner and permissions.
The cpio package supports the "new portable" as well as the "old" format of CPIO archives in their binary, ASCII and "with CRC" variants.
Adding an entry to a cpio archive:
Reading entries from an cpio archive:
Traditionally CPIO archives are written in blocks of 512
bytes - the block size is a configuration parameter of the
Cpio*Stream
's constuctors. Starting with version
1.5 CpioArchiveInputStream
will consume the
padding written to fill the current block when the end of the
archive is reached. Unfortunately many CPIO implementations
use larger block sizes so there may be more zero-byte padding
left inside the original input stream after the archive has
been consumed completely.
In general, JAR archives are ZIP files, so the JAR package supports all options provided by the ZIP package.
To be interoperable JAR archives should always be created using the UTF-8 encoding for file names (which is the default).
Archives created using JarArchiveOutputStream
will implicitly add a JarMarker
extra field to
the very first archive entry of the archive which will make
Solaris recognize them as Java archives and allows them to
be used as executables.
Note that ArchiveStreamFactory
doesn't
distinguish ZIP archives from JAR archives, so if you use
the one-argument createArchiveInputStream
method on a JAR archive, it will still return the more
generic ZipArchiveInputStream
.
The JarArchiveEntry
class contains fields for
certificates and attributes that are planned to be supported
in the future but are not supported as of Compress 1.0.
Adding an entry to a jar archive:
Reading entries from an jar archive:
In addition to the information stored
in ArchiveEntry
a DumpArchiveEntry
stores various attributes including information about the
original owner and permissions.
As of Commons Compress 1.3 only dump archives using the new-fs format - this is the most common variant - are supported. Right now this library supports uncompressed and ZLIB compressed archives and can not write archives at all.
Reading entries from an dump archive:
Prior to version 1.5 DumpArchiveInputStream
would close the original input once it had read the last
record. Starting with version 1.5 it will not close the
stream implicitly.
The TAR package has a dedicated documentation page.
Adding an entry to a tar archive:
Reading entries from an tar archive:
The ZIP package has a dedicated documentation page.
Adding an entry to a zip archive:
ZipArchiveOutputStream
can use some internal
optimizations exploiting SeekableByteChannel
if it
knows it is writing to a seekable output rather than a non-seekable
stream. If you are writing to a file, you should use the
constructor that accepts a File
or
SeekableByteChannel
argument rather
than the one using an OutputStream
or the
factory method in ArchiveStreamFactory
.
Reading entries from an zip archive:
Reading entries from an zip archive using the
recommended ZipFile
class:
Reading entries from an in-memory zip archive using
SeekableInMemoryByteChannel
and ZipFile
class:
Creating a zip file with multiple threads:
A simple implementation to create a zip file might look like this:For the bzip2, gzip and xz formats as well as the framed
lz4 format a single compressed file
may actually consist of several streams that will be
concatenated by the command line utilities when decompressing
them. Starting with Commons Compress 1.4 the
*CompressorInputStream
s for these formats support
concatenating streams as well, but they won't do so by
default. You must use the two-arg constructor and explicitly
enable the support.
The implementation of this package is provided by the Google Brotli dec library.
Uncompressing a given Brotli compressed file (you would certainly add exception handling and make sure all streams get closed properly):
Note that BZipCompressorOutputStream
keeps
hold of some big data structures in memory. While it is
recommended for any stream that you close it as soon as
you no longer need it, this is even more important
for BZipCompressorOutputStream
.
Uncompressing a given bzip2 compressed file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given file using bzip2 (you would certainly add exception handling and make sure all streams get closed properly):
The implementation of the DEFLATE/INFLATE code used by this
package is provided by the java.util.zip
package
of the Java class library.
Uncompressing a given DEFLATE compressed file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given file using DEFLATE (you would certainly add exception handling and make sure all streams get closed properly):
Uncompressing a given DEFLATE64 compressed file (you would certainly add exception handling and make sure all streams get closed properly):
The implementation of the DEFLATE/INFLATE code used by this
package is provided by the java.util.zip
package
of the Java class library.
Uncompressing a given gzip compressed file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given file using gzip (you would certainly add exception handling and make sure all streams get closed properly):
There are two different "formats" used for lz4. The format called "block format" only contains the raw compressed data while the other provides a higher level "frame format" - Commons Compress offers two different stream classes for reading or writing either format.
Uncompressing a given frame LZ4 file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given file using the LZ4 frame format (you would certainly add exception handling and make sure all streams get closed properly):
The implementation of this package is provided by the public domain XZ for Java library.
Uncompressing a given lzma compressed file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given file using lzma (you would certainly add exception handling and make sure all streams get closed properly):
The Pack200 package has a dedicated documentation page.
The implementation of this package is provided by
the java.util.zip
package of the Java class
library.
Uncompressing a given pack200 compressed file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given jar using pack200 (you would certainly add exception handling and make sure all streams get closed properly):
There are two different "formats" used for Snappy, one only contains the raw compressed data while the other provides a higher level "framing format" - Commons Compress offers two different stream classes for reading either format.
Starting with 1.12 we've added support for different
dialects of the framing format that can be specified when
constructing the stream. The STANDARD
dialect
follows the "framing format" specification while the
IWORK_ARCHIVE
dialect can be used to parse IWA
files that are part of Apple's iWork 13 format. If no dialect
has been specified, STANDARD
is used. Only the
STANDARD
format can be detected by
CompressorStreamFactory
.
Uncompressing a given framed Snappy file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given file using framed Snappy (you would certainly add exception handling and make sure all streams get closed properly):
The implementation of this package is provided by the public domain XZ for Java library.
When you try to open an XZ stream for reading using
CompressorStreamFactory
, Commons Compress will
check whether the XZ for Java library is available. Starting
with Compress 1.9 the result of this check will be cached
unless Compress finds OSGi classes in its classpath. You can
use XZUtils#setCacheXZAvailability
to overrride
this default behavior.
Uncompressing a given XZ compressed file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given file using XZ (you would certainly add exception handling and make sure all streams get closed properly):
Uncompressing a given Z compressed file (you would certainly add exception handling and make sure all streams get closed properly):
The implementation of this package is provided by the Zstandard JNI library.
Uncompressing a given Zstandard compressed file (you would certainly add exception handling and make sure all streams get closed properly):
Compressing a given file using the Zstandard format (you would certainly add exception handling and make sure all streams get closed properly):
Starting in release 1.13, it is now possible to add Compressor- and ArchiverStream implementations using the Java's ServiceLoader mechanism.
To provide your own compressor, you must make available on the classpath a file called
META-INF/services/org.apache.commons.compress.compressors.CompressorStreamProvider
.
This file MUST contain one fully-qualified class name per line.
For example:
org.apache.commons.compress.compressors.TestCompressorStreamProvider
This class MUST implement the Commons Compress interface org.apache.commons.compress.compressors.CompressorStreamProvider.
To provide your own compressor, you must make available on the classpath a file called
META-INF/services/org.apache.commons.compress.archivers.ArchiveStreamProvider
.
This file MUST contain one fully-qualified class name per line.
For example:
org.apache.commons.compress.archivers.TestArchiveStreamProvider
This class MUST implement the Commons Compress interface org.apache.commons.compress.archivers.ArchiveStreamProvider.