[tahoe-lafs-trac-stream] [tahoe-lafs] #600: storage: maybe store buckets as files, not directories

Wed Jul 17 13:45:33 UTC 2013

#600: storage: maybe store buckets as files, not directories
-------------------------+-------------------------------------------------
     Reporter:  warner   |      Owner:  warner
         Type:           |     Status:  new
  enhancement            |  Milestone:  undecided
     Priority:  minor    |    Version:  1.2.0
    Component:  code-    |   Keywords:  storage disk-backend performance
  storage                |  migration crawlers brians-opinion-needed
   Resolution:           |
Launchpad Bug:           |
-------------------------+-------------------------------------------------
Changes (by daira):

 * keywords:  storage disk-backend performance migration crawlers =>
     storage disk-backend performance migration crawlers brians-opinion-
     needed
 * owner:   => warner

Old description:

> Our current storage-server backend share-file format defines a "bucket"
> for
> each storage index, into which some quantity of numbered "shares" are
> placed.
> The "buckets" are each represented as a directory (named with the base32
> representation of the storage index), and the shares are files inside
> that
> directory. To make ext3 happier, these bucket directories are contained
> in a
> series of "prefix directories", one for each two-letter base32-alphabet
> string. So, if we are storing both share 0 and share 5 of storage index
> "aktyxrieysdumjed2hoynwpnl4", they would be located in:
>
> {{{
> NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/0
> NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/5
> }}}
>
> (there are two ways this makes ext3 happier: ext3 cannot have more than
> 32000
> subdirectories in a single directory, and very large directories (lots of
> child files or subdirectories) have very slow lookup times)
>
> There is a certain amount of metadata associated with each bucket. For
> mutable files, this includes the write-enabler. Both mutable and
> immutable
> files contain lease information. To make share-migration easier, we
> decided
> to make the share files stand alone, by placing this metadata inside the
> share files themselves, even though the metadata is really attached to
> the
> bucket. This unfortunately creates a danger for mutable files: some of
> the
> metadata is located at the end of the share, and when the share is
> enlarged,
> the server must copy the metadata to a new location within the file,
> creating
> a window during which it might be shut down, and the metadata lost.
>
> Since we might want to add even more metadata (the other-share-location
> hints, described in #599), perhaps we should should consider moving this
> metadata to a separate file, so there would be one copy per bucket,
> rather
> than one copy per share. One approach might be to place a non-numeric
> "metadata" file in each bucket directory, so:
>
> {{{
> NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/metadata
> NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/0
> NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/5
> }}}
>
> Another approach would be to stop using subdirectories for buckets
> altogether, and include the share numbers in the metadata file:
>
> {{{
> NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.metadata
> NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.0
> NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.5
> }}}
>
> In this latter approach, the {{{get_buckets}}} query would be processed
> by
> looking for an "$SI.metadata" file. If present, the file is opened and a
> list
> of share numbers read out of it (as well as other metadata). Those share
> numbers are then used to compute the filenames of the shares themselves,
> and
> those files can then be opened.
>
> The first approach (SI/metadata) adds an extra inode and an extra block
> to
> the total disk used per SI (probably 8kB). The second approach removes a
> directory and adds a file, so the disk space use is probably neutral,
> except
> that there are now multiple copies of the (long) SI-based filename, which
> must be stored in the prefix directory's dnode. This approach also at
> least
> doubles the number of children kept in each prefix directory, although
> they
> will all be file children rather than subdir children, and ext3 does not
> appear to have an arbitrary limit on the number of file children that a
> single directory can hold. (at least, not a small arbitrary limit like
> 32000).
>
> Both of these approaches make an offline share-migration tool slightly
> tougher: the tool must copy two files to a new server, not just one. The
> second approach is doubly tricky, because the metadata file must be
> modified
> (if, say, the sh0+sh5 pair are split up: the new metadata file must only
> reference the share that actually lives next to it). On the other hand,
> since
> metadata files will contain leases that are specific to a given server,
> they
> will likely need to be rewritten anyways.
>
> The main benefit of moving the metadata to a separate file is to reduce
> the
> complexity of the lease-maintenance code, by removing redundancy. With
> the
> current scheme, the code that walks buckets (looking for expired leases,
> etc)
> must really walk shares.

New description:

 Our current storage-server backend share-file format defines a "bucket"
 for
 each storage index, into which some quantity of numbered "shares" are
 placed.
 The "buckets" are each represented as a directory (named with the base32
 representation of the storage index), and the shares are files inside that
 directory. To make ext3 happier, these bucket directories are contained in
 a
 series of "prefix directories", one for each two-letter base32-alphabet
 string. So, if we are storing both share 0 and share 5 of storage index
 "aktyxrieysdumjed2hoynwpnl4", they would be located in:

 {{{
 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/0
 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/5
 }}}

 (there are two ways this makes ext3 happier: ext3 cannot have more than
 32000
 subdirectories in a single directory, and very large directories (lots of
 child files or subdirectories) have very slow lookup times)

 There is a certain amount of metadata associated with each bucket. For
 mutable files, this includes the write-enabler. [edit: Both mutable and
 immutable
 container files used to also contain lease information at the end of the
 file, but that is no longer true on the leasedb branch which will be
 merged
 soon.]

 To make share-migration easier, we originally decided to make the share
 files
 stand alone, by placing this metadata inside the share files themselves,
 even though the metadata is really attached to the bucket.
 ~~This unfortunately creates a danger for mutable files: some of the
 metadata is located at the end of the share, and when the share is
 enlarged,
 the server must copy the metadata to a new location within the file,
 creating
 a window during which it might be shut down, and the metadata lost.~~

 Since we might want to add even more metadata (the other-share-location
 hints, described in #599), perhaps we should should consider moving this
 metadata to a separate file, so there would be one copy per bucket, rather
 than one copy per share. One approach might be to place a non-numeric
 "metadata" file in each bucket directory, so:

 {{{
 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/metadata
 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/0
 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4/5
 }}}

 Another approach would be to stop using subdirectories for buckets
 altogether, and include the share numbers in the metadata file:

 {{{
 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.metadata
 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.0
 NODEDIR/storage/shares/ak/aktyxrieysdumjed2hoynwpnl4.5
 }}}

 In this latter approach, the {{{get_buckets}}} query would be processed by
 looking for an "$SI.metadata" file. If present, the file is opened and a
 list
 of share numbers read out of it (as well as other metadata). Those share
 numbers are then used to compute the filenames of the shares themselves,
 and
 those files can then be opened.

 The first approach (SI/metadata) adds an extra inode and an extra block to
 the total disk used per SI (probably 8kB). The second approach removes a
 directory and adds a file, so the disk space use is probably neutral,
 except
 that there are now multiple copies of the (long) SI-based filename, which
 must be stored in the prefix directory's dnode. This approach also at
 least
 doubles the number of children kept in each prefix directory, although
 they
 will all be file children rather than subdir children, and ext3 does not
 appear to have an arbitrary limit on the number of file children that a
 single directory can hold. (at least, not a small arbitrary limit like
 32000).

 Both of these approaches make an offline share-migration tool slightly
 tougher: the tool must copy two files to a new server, not just one. The
 second approach is doubly tricky, because the metadata file must be
 modified
 (if, say, the sh0+sh5 pair are split up: the new metadata file must only
 reference the share that actually lives next to it). On the other hand,
 since
 metadata files will contain leases that are specific to a given server,
 they
 will likely need to be rewritten anyways.

 The main benefit of moving the metadata to a separate file is to reduce
 the
 complexity of the lease-maintenance code, by removing redundancy. With the
 current scheme, the code that walks buckets (looking for expired leases,
 etc)
 must really walk shares.

--

Comment:

 I'm not sure this ticket is any longer relevant for the leasedb branch.
 Brian?

-- 
Ticket URL: <https://tahoe-lafs.org/trac/tahoe-lafs/ticket/600#comment:2>
tahoe-lafs <https://tahoe-lafs.org>
secure decentralized storage