Normalize ingested API data to unicode NFC format

Some unicode characters can be represented in "NFC" or "NFD" format. The C and D stand for composed and decomposed. Composed means characters like the umlaut-ed "u" are a single code point, whereas decomposed means a standard "u" followed by an umlaut "diaeresis" character which combines to be VIEWED as a single character, when it is actually two characters. In the HFS filesystem era of macOS, all files used NFD format which, while valid, was not how every single linux-based system does it - they use NFC. As of APFS, either form is accepted and the filesystem will not modify/normalize the written paths. However, in the case of mktorrent (and probably because mktorrent uses libraries built-in to macOS), files read from the APFS disk (which are written in NFC) are actually normalized back to NFD when mktorrent generates the encoded data for the torrent. This means that a torrent built with mktorrent on macOS will fail to verify when loaded in a torrent client on a linux machine, when a path within that torrent contains a string that is different between NFC and NFD normalization forms. This is not something I can fix here (despite this commit making sure we write data in NFC format). There is a GitHub issue open on the mktorrent repo: pobrn/mktorrent#14 For now I need to make sure that it is clear that we need to generate the torrents on a Linux box until we can confirm that mktorrent makes the right normalization decisions.
taylorthurlow · Feb 6, 2022 · 84fd404 · 84fd404
1 parent f8d0bd6
commit 84fd404
Show file tree

Hide file tree

Showing 3 changed files with 22 additions and 1 deletion.
diff --git a/lib/redacted_better/group.rb b/lib/redacted_better/group.rb
@@ -37,6 +37,7 @@ class Group
     #   JSON API
     def initialize(data_hash)
       data_hash = Utils.deep_unescape_html(data_hash)
+      data_hash = Utils.deep_unicode_normalize(data_hash)
 
       @id = data_hash["id"]
       @name = data_hash["name"]

diff --git a/lib/redacted_better/torrent.rb b/lib/redacted_better/torrent.rb
@@ -48,6 +48,7 @@ class Torrent
     # @param download_directory [String] the path to the download directory
     def initialize(data_hash, group, download_directory)
       data_hash = Utils.deep_unescape_html(data_hash)
+      data_hash = Utils.deep_unicode_normalize(data_hash)
 
       @group = group
       @id = data_hash["id"]

diff --git a/lib/redacted_better/utils.rb b/lib/redacted_better/utils.rb
@@ -1,9 +1,11 @@
+require "open3"
+
 module RedactedBetter
   class Utils
     def self.deep_unescape_html(data)
       case data
       when Hash
-        data.map { |k, v| [k, deep_unescape_html(v)] }.to_h
+        data.transform_values { |v| deep_unescape_html(v) }
       when Array
         data.map { |e| deep_unescape_html(e) }
       when String
@@ -12,5 +14,22 @@ def self.deep_unescape_html(data)
         data
       end
     end
+
+    # @param normalization_format [Symbol] one of the normalization format
+    #   symbols accepted by String#unicode_normalize
+    #
+    # @return [String]
+    def self.deep_unicode_normalize(data, normalization_format: :nfc)
+      case data
+      when Hash
+        data.transform_values { |v| deep_unicode_normalize(v) }
+      when Array
+        data.map { |e| deep_unicode_normalize(e) }
+      when String
+        data.unicode_normalize(normalization_format)
+      else
+        data
+      end
+    end
   end
 end