Opened at 2007-05-25T20:02:53Z
Last modified at 2012-02-16T16:30:01Z
#57 new enhancement
logging: compress/truncate/encode/decode/format for human readability and privacy
Reported by: | zooko | Owned by: | warner |
---|---|---|---|
Priority: | minor | Milestone: | undecided |
Component: | code-nodeadmin | Version: | 0.2.0 |
Keywords: | logging privacy anonymity | Cc: | |
Launchpad Bug: |
Description
Mojo Nation's and Mnet's logging system had a humanreadable function that would be used to compress, truncate, encode, decode and format data before it was interpolated into the format string. For example, 20-byte binary identifiers can be represented as "<abcde>", and 64,000 byte strings can be represented as "abcdefg ...63,986 bytes omitted... 4567890".
This humanreadability scheme derived from the Python Standard Library's repr module.
One might ask: why should humanreadability be integrated into the logging system? Why shouldn't people write:
log("We received this string: %s, but it wasn't a substring of %s as we expected.", humanreadable(thisstring), humanreadable(thatstring))
The answer is found in ticket #56 -- laziness. Also in the fact that you sometimes want to log an object, and you don't know if that object should be turned into a humanreadable form by calling .str() on it or by passing it to humanreadable().
Although perhaps this could be both lazy and humanreadable by dint of some sort of lazy Adaptor pattern?
Change History (9)
comment:1 Changed at 2007-08-14T18:57:48Z by warner
- Component changed from code to code-nodeadmin
comment:2 Changed at 2007-10-10T23:17:19Z by warner
- Milestone set to undecided
comment:3 Changed at 2007-11-27T19:50:17Z by zooko
comment:4 follow-up: ↓ 7 Changed at 2007-11-29T20:33:52Z by warner
the foolscap logging work I've been doing this week is leading me to log calls that look like the following:
log.msg("beginning upload of SI=%s: %d peers available", storage_index, num_peers)
or maybe
log.msg("beginning upload of SI=%(si)s: %(numpeers)d peers available", si=storage_index, numpeers=num_peers)
(which is easier to implement but harder to use).
I want to keep the machine-readable data in the log events, so e.g. I want to record the num_peers integer rather than mostly-irrevocably squashing it into a string right away (which is useful for a human but more of a hassle for a log viewer that wants to apply a filter to show all begin-upload messages for which num_peers<5).
OTOH, I don't want to have the log events retain references to large or mutable objects. What I'm planning is to pickle a dictionary that includes the format string and the arguments.
Anyways, this gives us the opportunity to ask those objects how they want to be represented, and we could apply a humanreadable()-type transformation on them at that time. Note that this could occur well after the log message was first recorded: for example, a post-mortem log display tool might apply humanreadable() only if the message is actually being displayed to a human. Many uses of the logfiles will display only a tiny fraction of the log events to a human, so doing lazy conversion seems to be a good idea.
I'm not entirely convinced that this will help us hide private information from the logs: either you explicitly add code to the log.msg() call to wrap the arguments (in which case you make it a point to log e.g. the storage index instead of the full URI), or the code is blindly throwing its arguments into the log message (log.msg("callback fired: %s %s", args, kwargs), in which case the logger or humanreadable-izer or privacy-filterer needs to recursively traverse through the arguments and apply the filter to everything inside them.
Hmm. If I use foolscap serialization when recording these arguments, I get more control over what Slicers get used for each object. We could set up a context for that serialization that says "ask each object how it wants to be serialized, but remind them that we don't want to record any private information".
I still can't really see how to use that easily or safely, though.
comment:5 Changed at 2008-06-01T20:53:24Z by warner
- Milestone changed from eventually to undecided
comment:6 Changed at 2009-12-13T02:14:37Z by davidsarah
- Keywords logging added
comment:7 in reply to: ↑ 4 Changed at 2009-12-13T03:26:47Z by davidsarah
- Keywords privacy added
Replying to warner:
I'm not entirely convinced that this will help us hide private information from the logs: either you explicitly add code to the log.msg() call to wrap the arguments (in which case you make it a point to log e.g. the storage index instead of the full URI), or the code is blindly throwing its arguments into the log message (log.msg("callback fired: %s %s", args, kwargs), in which case the logger or humanreadable-izer or privacy-filterer needs to recursively traverse through the arguments and apply the filter to everything inside them.
I'm also skeptical, but adding the privacy keyword anyway.
comment:8 Changed at 2009-12-13T04:09:52Z by zooko
There are three potentially private types of data in logs: caps, filenames, and IP addresses. I've had one user decline, at first, to send me his logs, and when I persuaded him that we weren't logging any filenames or caps then he went and ran some sort of sed script on the twistd.log to censor his IP address, and then when he realized that he couldn't use sed to do that to the foolscap logs (they are in a binary pickle format) then he went ahead and sent the foolscap logs even though they had his IP address in them.
I'm not sure what the moral of the story is, other than that there are only three specific types that (as far as I know) anybody cares to withhold when sharing their logs.
comment:9 Changed at 2012-02-16T16:30:01Z by davidsarah
- Keywords anonymity added
- Summary changed from logging: compress/truncate/encode/decode/format for human readability to logging: compress/truncate/encode/decode/format for human readability and privacy
Another feature that this could implement is hiding private information from logs.
Capabilities, for example, could be securely hashed, or even just truncated, so that the logs don't contain the authorities to a bunch of stuff. (Actually the truncation for human-readability might well accomplish this already.)