1<?xml version="1.0"?>
2<!--
3
4   Licensed to the Apache Software Foundation (ASF) under one or more
5   contributor license agreements.  See the NOTICE file distributed with
6   this work for additional information regarding copyright ownership.
7   The ASF licenses this file to You under the Apache License, Version 2.0
8   (the "License"); you may not use this file except in compliance with
9   the License.  You may obtain a copy of the License at
10
11       http://www.apache.org/licenses/LICENSE-2.0
12
13   Unless required by applicable law or agreed to in writing, software
14   distributed under the License is distributed on an "AS IS" BASIS,
15   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16   See the License for the specific language governing permissions and
17   limitations under the License.
18
19-->
20<document>
21  <properties>
22    <title>Commons Compress TAR package</title>
23    <author email="dev@commons.apache.org">Commons Documentation Team</author>
24  </properties>
25  <body>
26    <section name="The TAR package">
27
28      <p>In addition to the information stored
29      in <code>ArchiveEntry</code> a <code>TarArchiveEntry</code>
30      stores various attributes including information about the
31      original owner and permissions.</p>
32
33      <p>There are several different dialects of the TAR format, maybe
34      even different TAR formats. The tar package contains special
35      cases in order to read many of the existing dialects and will by
36      default try to create archives in the original format (often
37      called "ustar"). This original format didn't support file names
38      longer than 100 characters or bigger than 8 GiB and the tar
39      package will by default fail if you try to write an entry that
40      goes beyond those limits. "ustar" is the common denominator of
41      all the existing tar dialects and is understood by most of the
42      existing tools.</p>
43
44      <p>The tar package does not support the full POSIX tar standard
45      nor more modern GNU extension of said standard.</p>
46
47      <subsection name="Long File Names">
48
49        <p>The <code>longFileMode</code> option of
50        <code>TarArchiveOutputStream</code> controls how files with
51        names longer than 100 characters are handled.  The possible
52        choices are:</p>
53
54        <ul>
55          <li><code>LONGFILE_ERROR</code>: throw an exception if such a
56          file is added.  This is the default.</li>
57          <li><code>LONGFILE_TRUNCATE</code>: truncate such names.</li>
58          <li><code>LONGFILE_GNU</code>: use a GNU tar variant now
59          refered to as "oldgnu" of storing such names.  If you choose
60          the GNU tar option, the archive can not be extracted using
61          many other tar implementations like the ones of OpenBSD,
62          Solaris or MacOS X.</li>
63          <li><code>LONGFILE_POSIX</code>: use a PAX <a
64          href="http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_03">extended
65          header</a> as defined by POSIX 1003.1.  Most modern tar
66          implementations are able to extract such archives. <em>since
67          Commons Compress 1.4</em></li>
68        </ul>
69
70        <p><code>TarArchiveInputStream</code> will recognize the GNU
71        tar as well as the POSIX extensions (starting with Commons
72        Compress 1.2) for long file names and reads the longer names
73        transparently.</p>
74      </subsection>
75
76      <subsection name="Big Numeric Values">
77
78        <p>The <code>bigNumberMode</code> option of
79        <code>TarArchiveOutputStream</code> controls how files larger
80        than 8GiB or with other big numeric values that can't be
81        encoded in traditional header fields are handled.  The
82        possible choices are:</p>
83
84        <ul>
85          <li><code>BIGNUMBER_ERROR</code>: throw an exception if such an
86          entry is added.  This is the default.</li>
87          <li><code>BIGNUMBER_STAR</code>: use a variant first
88          introduced by J&#xf6;rg Schilling's <a
89          href="http://developer.berlios.de/projects/star">star</a>
90          and later adopted by GNU and BSD tar.  This method is not
91          supported by all implementations.</li>
92          <li><code>BIGNUMBER_POSIX</code>: use a PAX <a
93          href="http://pubs.opengroup.org/onlinepubs/009695399/utilities/pax.html#tag_04_100_13_03">extended
94          header</a> as defined by POSIX 1003.1.  Most modern tar
95          implementations are able to extract such archives.</li>
96        </ul>
97
98        <p>Starting with Commons Compress 1.4
99        <code>TarArchiveInputStream</code> will recognize the star as
100        well as the POSIX extensions for big numeric values and reads them
101        transparently.</p>
102      </subsection>
103
104      <subsection name="File Name Encoding">
105        <p>The original ustar format only supports 7-Bit ASCII file
106        names, later implementations use the platform's default
107        encoding to encode file names.  The POSIX standard recommends
108        using PAX extension headers for non-ASCII file names
109        instead.</p>
110
111        <p>Commons Compress 1.1 to 1.3 assumed file names would be
112        encoded using ISO-8859-1.  Starting with Commons Compress 1.4
113        you can specify the encoding to expect (to use when writing)
114        as a parameter to <code>TarArchiveInputStream</code>
115        (<code>TarArchiveOutputStream</code>), it now defaults to the
116        platform's default encoding.</p>
117
118        <p>Since Commons Compress 1.4 another optional parameter -
119        <code>addPaxHeadersForNonAsciiNames</code> - of
120        <code>TarArchiveOutputStream</code> controls whether PAX
121        extension headers will be written for non-ASCII file names.
122        By default they will not be written to preserve space.
123        <code>TarArchiveInputStream</code> will read them
124        transparently if present.</p>
125      </subsection>
126
127      <subsection name="Sparse files">
128
129        <p><code>TarArchiveInputStream</code> will recognize sparse
130        file entries stored using the "oldgnu" format
131        (<code>-&#x2d;sparse-version=0.0</code> in GNU tar) but is not
132        able to extract them correctly.  <a href="#Unsupported
133        Features"><code>canReadEntryData</code></a> will return false
134        on such entries.  The other variants of sparse files can
135        currently not be detected at all.</p>
136      </subsection>
137
138      <subsection name="Consuming Archives Completely">
139
140        <p>The end of a tar archive is signalled by two consecutive
141        records of all zeros.  Unfortunately not all tar
142        implementations adhere to this and some only write one record
143        to end the archive.  Commons Compress will always write two
144        records but stop reading an archive as soon as finds one
145        record of all zeros.</p>
146
147        <p>Prior to version 1.5 this could leave the second EOF record
148        inside the stream when <code>getNextEntry</code> or
149        <code>getNextTarEntry</code> returned <code>null</code>
150        Starting with version 1.5 <code>TarArchiveInputStream</code>
151        will try to read a second record as well if present,
152        effectively consuming the archive completely.</p>
153
154      </subsection>
155
156      <subsection name="PAX Extended Header">
157        <p>The tar package has supported reading PAX extended headers
158        since 1.3 for local headers and 1.11 for global headers. The
159        following entries of PAX headers are applied when reading:</p>
160
161        <dl>
162          <dt>path</dt>
163          <dd>set the entry's name</dd>
164
165          <dt>linkpath</dt>
166          <dd>set the entry's link name</dd>
167
168          <dt>gid</dt>
169          <dd>set the entry's group id</dd>
170
171          <dt>gname</dt>
172          <dd>set the entry's group name</dd>
173
174          <dt>uid</dt>
175          <dd>set the entry's user id</dd>
176
177          <dt>uname</dt>
178          <dd>set the entry's user name</dd>
179
180          <dt>size</dt>
181          <dd>set the entry's size</dd>
182
183          <dt>mtime</dt>
184          <dd>set the entry's modification time</dd>
185
186          <dt>SCHILY.devminor</dt>
187          <dd>set the entry's minor device number</dd>
188
189          <dt>SCHILY.devmajor</dt>
190          <dd>set the entry's major device number</dd>
191        </dl>
192
193        <p>in addition some fields used by GNU tar and star used to
194        signal sparse entries are supported and are used for the
195        <code>is*GNUSparse</code> and <code>isStarSparse</code>
196        methods.</p>
197
198        <p>Some PAX extra headers may be set when writing archives,
199        for example for non-ASCII names or big numeric values. This
200        depends on various setting of the output stream - see the
201        previous sections.</p>
202
203        <p>Since 1.15 you can directly access all PAX extension
204        headers that have been found when reading an entry or specify
205        extra headers to be written to a (local) PAX extended header
206        entry.</p>
207
208        <p>Some hints if you try to set extended headers:</p>
209
210        <ul>
211          <li>pax header keywords should be ascii.  star/gnutar
212          (SCHILY.xattr.* ) do not check for this.  libarchive/bsdtar
213          (LIBARCHIVE.xattr.*) uses URL-Encoding.</li>
214          <li>pax header values should be encoded as UTF-8 characters
215          (including trailing <code>\0</code>).  star/gnutar
216          (SCHILY.xattr.*) do not check for this.  libarchive/bsdtar
217          (LIBARCHIVE.xattr.*) encode values using Base64.</li>
218          <li>libarchive/bsdtar will read SCHILY.xattr headers, but
219          will not generate them.</li>
220          <li>gnutar will complain about LIBARCHIVE.xattr (and any
221          other unknown) headers and will neither encode nor decode
222          them.</li>
223        </ul>
224      </subsection>
225
226    </section>
227  </body>
228</document>
229