1# -*- coding: utf-8 -*-
2# Copyright 2012 Google Inc. All Rights Reserved.
3#
4# Licensed under the Apache License, Version 2.0 (the "License");
5# you may not use this file except in compliance with the License.
6# You may obtain a copy of the License at
7#
8#     http://www.apache.org/licenses/LICENSE-2.0
9#
10# Unless required by applicable law or agreed to in writing, software
11# distributed under the License is distributed on an "AS IS" BASIS,
12# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13# See the License for the specific language governing permissions and
14# limitations under the License.
15"""Additional help about using gsutil for production tasks."""
16
17from __future__ import absolute_import
18
19from gslib.help_provider import HelpProvider
20
21_DETAILED_HELP_TEXT = ("""
22<B>OVERVIEW</B>
23  If you use gsutil in large production tasks (such as uploading or
24  downloading many GiBs of data each night), there are a number of things
25  you can do to help ensure success. Specifically, this section discusses
26  how to script large production tasks around gsutil's resumable transfer
27  mechanism.
28
29
30<B>BACKGROUND ON RESUMABLE TRANSFERS</B>
31  First, it's helpful to understand gsutil's resumable transfer mechanism,
32  and how your script needs to be implemented around this mechanism to work
33  reliably. gsutil uses resumable transfer support when you attempt to upload
34  or download a file larger than a configurable threshold (by default, this
35  threshold is 2 MiB). When a transfer fails partway through (e.g., because of
36  an intermittent network problem), gsutil uses a truncated randomized binary
37  exponential backoff-and-retry strategy that by default will retry transfers up
38  to 23 times over a 10 minute period of time (see "gsutil help retries" for
39  details). If the transfer fails each of these attempts with no intervening
40  progress, gsutil gives up on the transfer, but keeps a "tracker" file for
41  it in a configurable location (the default location is ~/.gsutil/, in a file
42  named by a combination of the SHA1 hash of the name of the bucket and object
43  being transferred and the last 16 characters of the file name). When transfers
44  fail in this fashion, you can rerun gsutil at some later time (e.g., after
45  the networking problem has been resolved), and the resumable transfer picks
46  up where it left off.
47
48
49<B>SCRIPTING DATA TRANSFER TASKS</B>
50  To script large production data transfer tasks around this mechanism,
51  you can implement a script that runs periodically, determines which file
52  transfers have not yet succeeded, and runs gsutil to copy them. Below,
53  we offer a number of suggestions about how this type of scripting should
54  be implemented:
55
56  1. When resumable transfers fail without any progress 23 times in a row
57     over the course of up to 10 minutes, it probably won't work to simply
58     retry the transfer immediately. A more successful strategy would be to
59     have a cron job that runs every 30 minutes, determines which transfers
60     need to be run, and runs them. If the network experiences intermittent
61     problems, the script picks up where it left off and will eventually
62     succeed (once the network problem has been resolved).
63
64  2. If your business depends on timely data transfer, you should consider
65     implementing some network monitoring. For example, you can implement
66     a task that attempts a small download every few minutes and raises an
67     alert if the attempt fails for several attempts in a row (or more or less
68     frequently depending on your requirements), so that your IT staff can
69     investigate problems promptly. As usual with monitoring implementations,
70     you should experiment with the alerting thresholds, to avoid false
71     positive alerts that cause your staff to begin ignoring the alerts.
72
73  3. There are a variety of ways you can determine what files remain to be
74     transferred. We recommend that you avoid attempting to get a complete
75     listing of a bucket containing many objects (e.g., tens of thousands
76     or more). One strategy is to structure your object names in a way that
77     represents your transfer process, and use gsutil prefix wildcards to
78     request partial bucket listings. For example, if your periodic process
79     involves downloading the current day's objects, you could name objects
80     using a year-month-day-object-ID format and then find today's objects by
81     using a command like gsutil ls "gs://bucket/2011-09-27-*". Note that it
82     is more efficient to have a non-wildcard prefix like this than to use
83     something like gsutil ls "gs://bucket/*-2011-09-27". The latter command
84     actually requests a complete bucket listing and then filters in gsutil,
85     while the former asks Google Storage to return the subset of objects
86     whose names start with everything up to the "*".
87
88     For data uploads, another technique would be to move local files from a "to
89     be processed" area to a "done" area as your script successfully copies
90     files to the cloud. You can do this in parallel batches by using a command
91     like:
92
93       gsutil -m cp -r to_upload/subdir_$i gs://bucket/subdir_$i
94
95     where i is a shell loop variable. Make sure to check the shell $status
96     variable is 0 after each gsutil cp command, to detect if some of the copies
97     failed, and rerun the affected copies.
98
99     With this strategy, the file system keeps track of all remaining work to
100     be done.
101
102  4. If you have really large numbers of objects in a single bucket
103     (say hundreds of thousands or more), you should consider tracking your
104     objects in a database instead of using bucket listings to enumerate
105     the objects. For example this database could track the state of your
106     downloads, so you can determine what objects need to be downloaded by
107     your periodic download script by querying the database locally instead
108     of performing a bucket listing.
109
110  5. Make sure you don't delete partially downloaded temporary files after a
111     transfer fails: gsutil picks up where it left off (and performs a hash
112     of the final downloaded content to ensure data integrity), so deleting
113     partially transferred files will cause you to lose progress and make
114     more wasteful use of your network.
115
116  6. If you have a fast network connection, you can speed up the transfer of
117     large numbers of files by using the gsutil -m (multi-threading /
118     multi-processing) option. Be aware, however, that gsutil doesn't attempt to
119     keep track of which files were downloaded successfully in cases where some
120     files failed to download. For example, if you use multi-threaded transfers
121     to download 100 files and 3 failed to download, it is up to your scripting
122     process to determine which transfers didn't succeed, and retry them. A
123     periodic check-and-run approach like outlined earlier would handle this
124     case.
125
126     If you use parallel transfers (gsutil -m) you might want to experiment with
127     the number of threads being used (via the parallel_thread_count setting
128     in the .boto config file). By default, gsutil uses 10 threads for Linux
129     and 24 threads for other operating systems. Depending on your network
130     speed, available memory, CPU load, and other conditions, this may or may
131     not be optimal. Try experimenting with higher or lower numbers of threads
132     to find the best number of threads for your environment.
133""")
134
135
136class CommandOptions(HelpProvider):
137  """Additional help about using gsutil for production tasks."""
138
139  # Help specification. See help_provider.py for documentation.
140  help_spec = HelpProvider.HelpSpec(
141      help_name='prod',
142      help_name_aliases=[
143          'production', 'resumable', 'resumable upload', 'resumable transfer',
144          'resumable download', 'scripts', 'scripting'],
145      help_type='additional_help',
146      help_one_line_summary='Scripting Production Transfers',
147      help_text=_DETAILED_HELP_TEXT,
148      subcommand_help_text={},
149  )
150