[tahoe-dev] TBFS: The Tar-Baby Filesystem [was: RFC: So, what *shall* we do about st_birthtime, anyway?]

Tom Christiansen tchrist at perl.com
Fri Feb 27 19:39:28 PST 2009


PREFACE:

  *Please* consider this email public and freely redistributable.
  You should probably please do so to any and all concerned parties.

  I *REALLY* wonder what Kirk (or Linus; or Margo; or Ouster; or Ken or
  Dennis or Rob Pike) would say about all this?  My dim guess is that
  Ken&Rob or Linus might be the most sensitive to it, for historical
  reasons.  I'd dig into the Proceedings from the long years of the
  filesystem wars, but I don't know that the following troubles were yet
  raised then and there.  Jeff may recall better than I, as he was once
  pretty heavily involved in at least a couple of the areas of the dread
  problems facing you that I herebelow sketch some pieces of.

SUMMARY:

  We should discuss this all at some length and probably in person, but if
  you respond briefly with some of your tenets, goals, and requirements,
  after having given due consideration to the many matters I raise, I'll
  try to provide brief demonstration cases to illuminate where the holes
  are.  But these depend on concerns you now seem to be unaware of.

  In short: I'm afraid I'm about to pop your blissful-ignorance
  cherry-bubble.  Do please forgive me, and may Amber console you! :-)

On Fri, 27 Feb 2009 13:27:29 MST you wrote:

>>> Heh heh heh.  Coincidentally, we've been wrestling with both
>>> file timestamps [1] and character encodings [2] on the tahoe-
>>> dev list in the last couple of weeks.  :-)

    [1] http://allmydata.org/pipermail/tahoe-dev/2009-February/001194.html
    [2] http://allmydata.org/pipermail/tahoe-dev/2009-February/001343.html

>> I begin to think you can't escape another metadata tag for the
>> filename itself, which is just totally hosed.

> I don't understand this sentence.  I've just made a proposal for how
> to handle these things in the tahoe tool, linked in my previous
> mail.  Hopefully you can infer enough of the context.

>   http://allmydata.org/pipermail/tahoe-dev/2009-February/001343.html

Ok, I read it.  Twice or thrice.

Dear Bryce, I am sorry to say that what you there propose for your
filesystem to me appears woefully far from sufficient for the matter.
While this may be because I am speaking from limited or incomplete
knowledge, which I am, I assure you I mean no offence whatsoever.

I had *thought* we'd discussed these matters before in person,
but perhaps not.  I guess maybe it was Jeff Haemer I went through
all this with, not with you.  Or maybe it was Todd.  ENOMEM.

While I do agree that UTF-8 encoding is a *start* of a solution,
I cringe, stricken with horror at the notion of a Latin1 fallback.
It can only exacerbate a problem you cannot escape, of which I
shall begin to outline just a wee bit further below.

Plus, we've had some pretty rough times with Perl's own Unicode design,
development, and implementation: many stemming from that sort of tactic.
And so alas I speak not from a position of pure theory here, but with
experience's scars yet unfaded.

The compass of your designs will depend on just *which* of many problems,
some interconnected, others not, which you're trying or hoping to solve:

  --> Kernel treatment vs userland treatment, and any necessary or
      desired interactions between these two.

  --> Think about non-linear namei(9) resolutions to cope with large
      directories; meaning, those w/many files.  Some form of hashing
      (or occasionally, splay trees) is typically employed for this
      trouble, but one must be FAR more clever than normal in selecting
      one's hashing algorithm to avoid degrading to worst-case, full-scan
      linear lookup.  Fortunately, this is the least of your troubles, as
      it is not a terribly complex subject and well studied, *and* I have
      good data and testing for it which you may use to evaluate how good
      a job you're doing.

  --> Creating a wholly new scheme for one's pet filesystem's filenaming
      metadata.  You might think this is the best way--and it is--but
      still lies fraught--plagued even--with NEARLY IRRESOLUBLE troubles.

      Keep reading.

  --> Even UCS-2 can be pretty bad.  People jumped way too fast on it,
      including the Java and M$FT fulx.  It is limited to Plane-0 unless
      one sneaks off into nasty surrogates.  Plus unlike UTF-8, it is
      *VERY* slow to recover from midstream errors.  Ken&Rob's UTF-8,
      in contrast, recovers nearly immediately.  ('Twas Andrew Hume who
      explained and contrasted this deficiency to me some months back at
      the previous LISA.)

  --> Compatibility concerns with old datasets.  That is, coping with
      legacy systems' filename data, a partially intractable problem due
      to the old 8-bit encodings of ISO-8859-X, or legacy CJK(V) 16-bit
      encodings like the many different flavors of Shift-JIS or the whole
      EUC fiasco; even UCS-2.

      Let's suppose you want to upgrade an old filesystem's style to your
      unified style.  Should be easy, right?  Well, it is not.  And it's
      not even merely "not-easy"; it may well be DOWN-RIGHT IMPOSSIBLE,
      excepting only if *ALL* users shall have used the same encoding.

      Why?

      Consider the problem of an old-style filesystem where people made
      filenames, but EACH according to their own POSIX locale.  You cannot
      distinguish from the bytes which they meant, because these are all
      8-bit encodings.  But to upconvert to Unicode, you have to know their
      origin locales.  ARGH!

          ISO-8859-1      Latin1, "Western European"
          ISO-8859-2      Latin2, "Central European"
          ISO-8859-5      ASCII Latin (low bytes) + Cyrillic (high bytes)
          ISO-8859-7      ASCII Latin (low bytes) + Greek (high bytes)
          ISO-8859-14     Latin8, for Celtic tongues: Welsh, Gaelic, Breton
          ISO-8859-15     Latin9, kinda like Latin1 but w/ Euro sym and others

      And what about the subfenestrated vulgate?  *They* have their own
      8-bit encodings, you know, such as

          Code Page 1250  East European Latin
          Code Page 1252  West European Latin
          Code Page 1253  Greek
          Code Page 1251  Cyrillic

      and many others, but these do *not* map to the ISO-8859 standards,
      much that they pretend them to do so.  Time and again I get mail
      marked ISO-8859-1 but which is really CP 1252.  That means it won't
      display right.  Drives me nuts.

 -->  Support for case-ignorant filesystems, as well as for support of
      case-insensitive access to case-sensitive ones.  These truly are
      amongst *the* most terrible things you can conceive of!  I am
      almost certain you do not even BEGIN to realize how bad it truly
      is, for if you were, I'd expect you to have a very chunky bottle
      for powerful benzodiazepines on hand, and insofar as I am aware,
      you do not.

      One mere tip of this multi-peaked iceberg is exposed here:

            $ perl -le 'for(@ARGV) {
                $_ = oct if /^0/;
                printf "uc of chr %d (U+%04X) is U+%v04X\n",
                    $_, $_, uc pack U => $_;
              }' 0xB5 0xDF 0xFF

            uc of chr 181 (U+00B5) is U+039C
            uc of chr 223 (U+00DF) is U+0053.0053
            uc of chr 255 (U+00FF) is U+0178

      That is:

            MICRO SIGN
                => GREEK CAPITAL LETTER MU
            LATIN SMALL LETTER SHARP S
                => LATIN CAPITAL LETTER S . LATIN CAPITAL LETTER S
            LATIN SMALL LETTER Y WITH DIAERESIS
                => LATIN CAPITAL LETTER Y WITH DIAERESIS

      Even worse: the character length *and/or* the byte length
      (based on encoding, but assuming UTF-8) both change with
      these.  Isn't that... charming?

 -->  Do you *really* expect that you would DARE allow someone
      to create multiple files in a directory *ALL* named émigré?

      You could do so in no fewer than 5 ways, and probably more.
      The first is the Latin1 version, in hex bytes, which is:

          E9.6D.69.67.72.E9

      While that is a fair Unicode (but **NOT** UTF-8) description
      of the string, it is not UTF-8 (which I shall reveal below)!

      First, you must consider how either "é" might be

          U+00E9  LATIN SMALL LETTER E WITH ACUTE

      or the equally validly

          U+0065  LATIN SMALL LETTER E
          U+0301  COMBINING ACUTE ACCENT

      So given:

         ($e1, $e2) = ("\x{E9}", "\x{65}\x{301}");

      You could spell thus the very same word, émigré, in the
      following ways in Unicode:

        "${e1}migr{$e1}"
            In Unicode chars: E9.6D.69.67.72.E9
            In UTF-8 octets:  C3.A9.6D.69.67.72.C3.A9
        "${e1}migr{$e2}"
            In Unicode chars: E9.6D.69.67.72.65.301
            In UTF-8 octets:  C3.A9.6D.69.67.72.65.CC.81
        "${e2}migr{$e1}"
            In Unicode chars: 65.301.6D.69.67.72.E9
            In UTF-8 octets:  65.CC.81.6D.69.67.72.C3.A9
        "${e2}migr{$e2}"
            In Unicode chars: 65.301.6D.69.67.72.65.301
            In UTF-8 octets:  65.CC.81.6D.69.67.72.65.CC.81

      Thus you would have to consider the possibility of having all *FIVE*
      of these byte-sequences for filenames coëxisting in the very *same*
      directory, and they would *all* appear to be the same word when
      printed (well, normally not the first, but your "fall back to Latin1"
      clause makes this also so):

             6 bytes: E9.6D.69.67.72.E9
             8 bytes: C3.A9.6D.69.67.72.C3.A9
             9 bytes: C3.A9.6D.69.67.72.65.CC.81
             9 bytes: 65.CC.81.6D.69.67.72.C3.A9
            10 bytes: 65.CC.81.6D.69.67.72.65.CC.81

      They all say émigré, you know!

      Are you TRULY prepared to allow that??

        ± If you are, *how on earth* are users to distinguish these,
          and what does this mean for the kernel implementation?

        ± If you are not, what are *you* going to do about it, and
          where?  It has to be a kernel issue.  I'm truly sorry.

 -->  I'm quite certain the case-ignorant people (and filesystems?)
      would fully expect émigré and ÉMIGRÉ to be the same.  Quite
      possibly, they'd also expect emigre and Émigré to be the same.

      Both lead down smelly ratholes from which few ever return, and
      none that do, come back unscathed and smelling sweeting.  This
      owes its noxiousness in part to the point raised in the previous
      --> just explained, but more because it extends beyond even this
      into an area that is all but certain to turn your nose.

      In many languages, including the Romance tongues, "e" and "é" are not
      held to be different letters.  In French, they may sort slightly
      differently (backwards to frontwards, actually), but in Spanish they
      do not.  And in languages like Icelandic, your "é" and "e" *are*
      completely different letters.  In French or Portuguese, "ç" is not a
      letter distinct from "c"; yet in Catalan, it *is*.  In Welsh or pre-
      1997 Spanish, "ch" and "ll" are their own "letters".  Additionally in
      Welsh, "dd" is a letter different from "d", as is "ff" from "f"; plus
      "ph", "rh", "ng", and "th" all get their own places in the Welsh
      alphabet, each digraph deemed a single, separate letter!

      Thus, whether something is a letter or not, and whether it's the the
      same as another letter, case insensitively or otherwise, DEPENDS UPON
      THE LANGUAGE NOT THE DATA!  UTF-8 does not address this, and cannot.
      You need more.

      Just look at the variant Nordic treatments of ø for more examples,
      some of which have even changed over the years by fiat, as Spanish
      did in 1997.

      Plus, what to do about letters that aren't even *in* their language?

**NOW*** do you begin to see why I quailed at the thought?  It's
also why I initially said that I begin to think you can't escape
another metadata tag for the filename itself, and that at some levels,
this very idea is itself all just totally hosed to start with.

Why, you might need more than one metatag, even!

I can see one for the character set *and* the character encoding
(which are NOT THE SAME THING!!).  That means ASCII, Unicode, Shift-
JIS, Latin1, Latin7, ISO 2022-JP, CP1252, EUC-JP, UCS-2, GB18030,
etc on old systems.  However, *if* you can somehow upgrade
everything to UTF-8, you wouldn't need that.

You would still appear to need a metatag for LANG, though.  This thought
distresses me a great deal.  Yet without it, or at least respect for the
current user's setting of such (maybe both, with an override), you can't
even know whether you have case-sensitivity collisions or not.

It's a TERRIBLE problem you've only scratched the surface of, much being
matters you were presumably previously unaware of, prior to this message.
And that's just part of it.  There's a ***WHOLE*** bunch more.  I nearly
shudder to dwell upon it.

BEST TO MY MIND IS: avoid any metatags, go all UTF-8, screw backwards
compat, keep the kernel ignorant, but rely on userspace for LANG stuff.

And even still, just what, mon cher ami, *are* you going to do about the
4+ valid ways to spell émigré?  The kernel must (well, probably should)
make decisions here, and this is *not* a good thing.  It *smells* wrong,
offends some internalized sense of cleanliness and propriety in me.

But what other choice have you?

Moi, [j'ne] sais pas.

We should discuss this all at some length and probably in person, but if
you respond briefly with some of your tenets, goals, and requirements,
I'll try to provide demos that shed light where holes are.  But doing so
properly depends first on my knowing which parts of the problemspace you
are aware of, and which ones you aren't.

OK?

--tom

PS:  I'm virtually certain Rob has recent (w/i ~decade) papers
     on using UTF-8 in filesystems.  If I were you, I'd try to
     dig these up to see whether he addresses any of this.

PPS: The program below should provide some useful insights
     about rushing into that place where angels fear to tread.
    (That is, the first DATA line is LA's true name, being
     virtually unpronounceable by Anglos, even Angelinos. :-)

--



#!usr/bin/perl
#
# es-sort: sort Spanish-language(s) city-data according
#          to the imprimatured Unicode collation technique,
#          slow, painful, and flexible though it be--and is.
#
# Tom Christiansen
# tchrist at perl.com

use 5.010_000;

use Unicode::Collate;

$es = Unicode::Collate->new( entry => <<'ENTRY',
       0063 0068 ; [.1000.0020.0002.0063] # ch
       0043 0068 ; [.1000.0020.0007.0043] # Ch
       0043 0048 ; [.1000.0020.0008.0043] # CH
       006C 006C ; [.10F5.0020.0002.006C] # ll
       004C 006C ; [.10F5.0020.0007.004C] # Ll
       004C 004C ; [.10F5.0020.0008.004C] # LL
       00E7      ; [.0FFC.0020.0002.0063] # c-cedilla
       0063 0327 ; [.0FFC.0020.0002.0063] # c-cedilla
       00C7      ; [.0FFC.0020.0002.0043] # C-cedilla
       0043 0327 ; [.0FFC.0020.0002.0043] # C-cedilla
       00F1      ; [.112B.0020.0002.00F1] # n-tilde
       006E 0303 ; [.112B.0020.0002.00F1] # n-tilde
       00D1      ; [.112B.0020.0008.00D1] # N-tilde
       004E 0303 ; [.112B.0020.0008.00D1] # N-tilde
ENTRY
       upper_before_lower => 1,
       normalization => "NFKD",
       preprocess => sub {
          my $_ = shift;
   # 1st strip leading articles
          s/^L'//;    # Catalan
          s{ ^
            (?:
        # Castilian
                El
              | Los
              | La
              | Las
              | Lo

        # Catalan
              | Els
              | Les
              | Sa
              | Es

        # Gallego / Portuguese
              | O
              | Os
              | A
              | As
            )
            \s+
          }{}x;
    # 2nd strip interior particles
          s/\b[dl]'//g;   # Catalan
          s{
            \b
            (?:
                el  | los | la | las | de  | del | y          # ES
              | els | les | i  | sa | es | dels               # CA
              | o   | os  | a  | as  | do  | da | dos | das   # GAL
            )
            \b
        }{}gx;
        return $_;
       },
      ) || die;

binmode(DATA,":encoding(latin1)")|| die;

chomp(@words = <DATA>);

@swords = $es->sort(@words);

for $word (@swords) {
    say $word;
    next;
    # printf "%-12s %s\n", $word, $es->viewSortKey($word);
}

__END__
La Ciudad de Nuestra Señora la Reina de Los Ángeles de Porciúncula, California Alta
Ratón, New Mexico
Monterey, California
Trinidad, Colorado
Cortez, Colorado
Los Gatos, California
Alamosa, Colorado
Monterrey, Mexico
Durango, Colorado
Peña Blvd, Denver
Piñon, Colorado
El Paso, New Mexico
Cañon City, Colorado
Villaviciosa
San Martín del Rey Aurelio
Viñuela
Villaconejos
Vilafranca del Penedès
Alaraz
Gallegos de Argañán
Monterrubio de Armuña
Llanes
Morille
Terrassa
Agallas
Vilanova del Camí
Cabana de Bergantiños
Alcalá del Río
Peñarrubia
Cádiz
Moriles
Villanueva del Conde
As Pontes de García Rodríguez
Pas de la Casa
Alameda del Valle
La Coruña
Muxía
Alanís
Castellar del Riu
Albacete
Rozas de Puerto Real
Aguadulce
Torrelles de Llobregat
Griñón
Tarragonès
Buenavista
Villaviciosa de Odón
Campo Real
Montesquiu
Ourense
Òrrius
Lleida
Avilés
Villaseco de los Reyes
Carabaña
Casariche
Miranda de Azán
Aiguafreda
Bóveda del Río Almar
Montejo
Islas Baleares
Villafranca de Córdoba
Sabadell
Dos Torres
Vilarmaior
Isla Mayor
La Lantejuela
Tavèrnoles
Tarragona
Vilafranca de Bonany
Peñaflor
Pajares de la Laguna
Lliçà d'Amunt
Vilanova i la Geltrú
Lena
Montejaque
Castellar del Vallès
Écija
Galisancho
Baix Llobregat
Vilanova de Sau
Castellar de n'Hug
Vilanova del Vallès
Borredà
Navarredonda de la Rinconada
Béjar
Òdena
Paradinas de San Juan
Cardeña
Llafranc
Rubió
Bigues i Riells
Villacarriedo
Lluçà
Villamanrique de la Condesa
Alba de Tormes
Cànoves i Samalús
Logroño
Villanueva de la Cañada
Lliçà de Vall
Manacor
Rubí
Aguilar de Segarra
Llinars del Vallès
Málaga
La Llagosta
Villaverde del Río
Pacs del Penedès
Vizcaya
L'Hospitalet de Llobregat
Hinojosa de Duero
Torresmenudas
Asturias
Alt Penedès
Florida de Liébana
El Álamo
Navarredonda y San Mamés
Dios le Guarde
Oviedo
Moriscos
Cabanas
Tavertet
Molinillo
Tardáguila
San Martín de la Vega
Maó
Aldehuela de Yeltes
Dosrius
Villarmuerto
As Somozas
Villaviciosa de Córdoba
Terradillos
Cáceres
Encinas de Abajo
Villamayor
Aguilar de la Frontera
Les Cabanyes
Brión
Ruiloba
Els Hostalets de Pierola
Brunete
Parres
Mañón
Bañobárez
Els Prats de Rei
Brincones
Candamo
Palaciosrubios
Villa del Río
Gallegos de Solmirón
Girona
Teruel
Huévar del Aljarafe
Aldehuela de la Bóveda
Molins de Rei
Torrelles de Foix
Herrera
Monterrubio de la Sierra
Torrelodones
Vilasantar
San Martín de Oscos
El Pont de Vilomara i Rocafort
Hinojosa del Duque
Villaescusa
Torrelaguna
Cabanillas de la Sierra
Palacios del Arzobispo
Figaró-Montmany
Alcalá de Henares
Montellano
Villalbilla
Torelló
Marbella
Doñinos de Salamanca
Herrerías
Villasrubias
Ávila
Xixón
Salamanca
Tarazona de Guareña
Navarra
Jaén
Camaleño
Villar de Ciervo
Cardona
Villa del Prado
Palenciana
Getxo
Laredo
León
Castell de l'Areny
Piélagos
Árchez
Encinas Reales
Les Bons
Casares
Montemayor del Río
O Pino
Ruesga
Pastores
Gallifa
Irún
Castañeda
Miranda del Castañar
Villanueva de Córdoba
Fogars de Montclús
Dos Hermanas
Vilaller
Villamanrique de Tajo
Peñarandilla
Baix Penedès
La Llacuna
Sitges
Álora
Manzanares el Real
Hermandad de Campo de Suso
Navarcles
Casarrubuelos
Alt Empordà
Palencia de Negrilla
A Coruña
Palau-solità i Plegamans
Larrodrigo
Bilbao
Gironès
Yecla de Yeltes
Doñinos de Ledesma
San Martín del Castañar
Álava
Torremolinos
San Martín de Valdeiglesias
Cortes de la Frontera
Córdoba
Almodóvar del Río
Vilalba Sasserra
Villalba de los Llanos
Villamanta
Gijón
El Real de la Jara
Fogars de la Selva
Doña Mencía
Peñaranda de Bracamonte
Parets del Vallès


More information about the tahoe-dev mailing list