Skip to content

Commit

Permalink
Normalize ingested API data to unicode NFC format
Browse files Browse the repository at this point in the history
Some unicode characters can be represented in "NFC" or "NFD" format. The
C and D stand for composed and decomposed. Composed means characters
like the umlaut-ed "u" are a single code point, whereas decomposed means
a standard "u" followed by an umlaut "diaeresis" character which
combines to be VIEWED as a single character, when it is actually two
characters.

In the HFS filesystem era of macOS, all files used NFD format which,
while valid, was not how every single linux-based system does it - they
use NFC. As of APFS, either form is accepted and the filesystem will not
modify/normalize the written paths.

However, in the case of mktorrent (and probably because mktorrent uses
libraries built-in to macOS), files read from the APFS disk (which are
written in NFC) are actually normalized back to NFD when mktorrent
generates the encoded data for the torrent.

This means that a torrent built with mktorrent on macOS will fail to
verify when loaded in a torrent client on a linux machine, when a path
within that torrent contains a string that is different between NFC and
NFD normalization forms.

This is not something I can fix here (despite this commit making sure we
write data in NFC format). There is a GitHub issue open on the mktorrent
repo: pobrn/mktorrent#14

For now I need to make sure that it is clear that we need to generate
the torrents on a Linux box until we can confirm that mktorrent makes
the right normalization decisions.
  • Loading branch information
taylorthurlow committed Feb 6, 2022
1 parent f8d0bd6 commit 84fd404
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 1 deletion.
1 change: 1 addition & 0 deletions lib/redacted_better/group.rb
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ class Group
# JSON API
def initialize(data_hash)
data_hash = Utils.deep_unescape_html(data_hash)
data_hash = Utils.deep_unicode_normalize(data_hash)

@id = data_hash["id"]
@name = data_hash["name"]
Expand Down
1 change: 1 addition & 0 deletions lib/redacted_better/torrent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ class Torrent
# @param download_directory [String] the path to the download directory
def initialize(data_hash, group, download_directory)
data_hash = Utils.deep_unescape_html(data_hash)
data_hash = Utils.deep_unicode_normalize(data_hash)

@group = group
@id = data_hash["id"]
Expand Down
21 changes: 20 additions & 1 deletion lib/redacted_better/utils.rb
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
require "open3"

module RedactedBetter
class Utils
def self.deep_unescape_html(data)
case data
when Hash
data.map { |k, v| [k, deep_unescape_html(v)] }.to_h
data.transform_values { |v| deep_unescape_html(v) }
when Array
data.map { |e| deep_unescape_html(e) }
when String
Expand All @@ -12,5 +14,22 @@ def self.deep_unescape_html(data)
data
end
end

# @param normalization_format [Symbol] one of the normalization format
# symbols accepted by String#unicode_normalize
#
# @return [String]
def self.deep_unicode_normalize(data, normalization_format: :nfc)
case data
when Hash
data.transform_values { |v| deep_unicode_normalize(v) }
when Array
data.map { |e| deep_unicode_normalize(e) }
when String
data.unicode_normalize(normalization_format)
else
data
end
end
end
end

0 comments on commit 84fd404

Please sign in to comment.