Implement charset handling in WebRequestConcern #950

knu · 2015-08-01T08:40:32Z

The force_encoding and unzip options in WebsiteAgent are moved to
WebRequestConcern so other users of the concern such as RssAgent can
benefit from them.
WebRequestConcern detects a charset specified in the Content-Type
header to decode the content properly, and if it is missing the
content is assumed to be encoded in UTF-8 unless it has a binary MIME
type. Not all Faraday adopters handle character encodings, and
Faraday passes through what is returned from the backend, so we need
to do this on our own. (cf. response.body is ASCII-8BIT when Content-Type is text/xml; charset=utf-8 lostisland/faraday#139)
WebRequestConcern now converts text contents to UTF-8, so agents can
handle non-UTF-8 data without having to deal with encodings
themselves. Previously, WebsiteAgent in "json"/"text" modes and
RssAgent would suffer from encoding errors when dealing with non-UTF-8
contents. WebsiteAgent in "html"/"xml" modes did not have this
problem because Nokogiri would always return results in UTF-8
independent of the input encoding.

This should fix #608.

- The `force_encoding` and `unzip` options in WebsiteAgent are moved to WebRequestConcern so other users of the concern such as RssAgent can benefit from them. - WebRequestConcern detects a charset specified in the Content-Type header to decode the content properly, and if it is missing the content is assumed to be encoded in UTF-8 unless it has a binary MIME type. Not all Faraday adopters handle character encodings, and Faraday passes through what is returned from the backend, so we need to do this on our own. (cf. lostisland/faraday#139) - WebRequestConcern now converts text contents to UTF-8, so agents can handle non-UTF-8 data without having to deal with encodings themselves. Previously, WebsiteAgent in "json"/"text" modes and RssAgent would suffer from encoding errors when dealing with non-UTF-8 contents. WebsiteAgent in "html"/"xml" modes did not have this problem because Nokogiri would always return results in UTF-8 independent of the input encoding. This should fix #608.

cantino · 2015-08-01T19:20:09Z

app/concerns/web_request_concern.rb

+          # Not all Faraday adapters support automatic charset
+          # detection, so we do that.
+          case env[:response_headers][:content_type]
+          when /;\s*charset\s*=\s*([^()<>@,;:\\\"\/\[\]?={}\s]+)/i


Would https://github.com/cantino/guess_html_encoding be useful here?

I think so. More detection logics can be added later. BOM, XML declaration, HTML <meta> elements, etc.

cantino · 2015-08-01T19:21:11Z

This looks really good!

Implement charset handling in WebRequestConcern

knu · 2015-08-03T13:28:26Z

@cantino Please feel free to improve the charset detection part!

knu force-pushed the web_content_charset branch from 2b5c7dd to c047ed8 Compare August 1, 2015 11:25

knu force-pushed the web_content_charset branch from c047ed8 to 6f667a4 Compare August 1, 2015 11:26

cantino reviewed Aug 1, 2015
View reviewed changes

knu added a commit that referenced this pull request Aug 3, 2015

Merge pull request #950 from cantino/web_content_charset

d14027c

Implement charset handling in WebRequestConcern

knu merged commit d14027c into master Aug 3, 2015

knu deleted the web_content_charset branch August 3, 2015 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement charset handling in WebRequestConcern #950

Implement charset handling in WebRequestConcern #950

knu commented Aug 1, 2015

cantino Aug 1, 2015

knu Aug 3, 2015

cantino commented Aug 1, 2015

knu commented Aug 3, 2015

Implement charset handling in WebRequestConcern #950

Implement charset handling in WebRequestConcern #950

Conversation

knu commented Aug 1, 2015

cantino Aug 1, 2015

Choose a reason for hiding this comment

knu Aug 3, 2015

Choose a reason for hiding this comment

cantino commented Aug 1, 2015

knu commented Aug 3, 2015