In regards to writing file paths in manifest files, the spec states the following:
If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].
My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While CR and LF are rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of % too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.
I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode CR and LF and do not encode %. The first behavior is broken for file names that contain an LF or CR and the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.
I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.
Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as \n and a \ is added to the beginning of the line. Additionally, literal \ characters are escaped with another \ and a \ is also added to the beginning of the line.
For example, let's say that we have the file named new\nline (important, this must be an actual newline and not the characters \ and n) and one named back\slash, and then executed the following:
# On linux
$ sha256sum *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee new\nline
# On mac
$ shasum -a 256 *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee new\nline
This seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.
In regards to writing file paths in manifest files, the spec states the following:
My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While
CRandLFare rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of%too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode
CRandLFand do not encode%. The first behavior is broken for file names that contain anLForCRand the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.
Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as
\nand a\is added to the beginning of the line. Additionally, literal\characters are escaped with another\and a\is also added to the beginning of the line.For example, let's say that we have the file named
new\nline(important, this must be an actual newline and not the characters\andn) and one namedback\slash, and then executed the following:This seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.