Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch syntax highlighting to Linguist w/ auto-detection #28

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 22 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,31 +45,40 @@ output to the next filter's input. So if you wanted to have content be
filtered through Markdown and be syntax highlighted, you can create the
following pipeline:

```ruby
<pre lang=rb>
pipeline = HTML::Pipeline.new [
HTML::Pipeline::MarkdownFilter,
HTML::Pipeline::SyntaxHighlightFilter
]
result = pipeline.call <<CODE
This is *great*:

``` ruby
some_code(:first)
```
result = pipeline.call &lt;&lt;-MDOWN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use &lt; here? Does it render incorrectly on github otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not but just to be on the safe side. <pre> allows HTML tags so anything that's not a tag but starts with < might confuse the parser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Language defined *explicitly*:

CODE
result[:output].to_s
```ruby
5.times { puts "Odelay!" }
```

Language *auto-detected* from code:

5.times { puts "Odelay!" }

MDOWN

puts result[:output].to_s
</pre>

Prints:

```html
<p>This is <em>great</em>:</p>
<p>Language defined <em>explicitly</em>:</p>

<div class="highlight">
<pre><span class="n">some_code</span><span class="p">(</span><span class="ss">:first</span><span class="p">)</span>
</pre>
</div>
<div class="highlight"><pre><span class="mi">5</span><span class="o">.</span><span class="n">times</span> <span class="p">{</span> <span class="nb">print</span> <span class="s2">"Odelay!"</span> <span class="p">}</span>
</pre></div>

<p>Language <em>auto-detected</em> from code:</p>

<div class="highlight"><pre><span class="mi">5</span><span class="o">.</span><span class="nb">times</span> <span class="p">{</span> <span class="k">print</span> <span class="s">"Odelay!"</span> <span class="p">}</span>
</pre></div>
```

Some filters take an optional **context** and/or **result** hash. These are
Expand Down
2 changes: 1 addition & 1 deletion html-pipeline.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Gem::Specification.new do |gem|
gem.add_dependency "nokogiri", "~> 1.4"
gem.add_dependency "github-markdown", "~> 0.5"
gem.add_dependency "sanitize", "~> 2.0"
gem.add_dependency "pygments.rb", ">= 0.2.13"
gem.add_dependency "github-linguist", "~> 2.1"
gem.add_dependency "rinku", "~> 1.7"
gem.add_dependency "escape_utils", "~> 0.2"
gem.add_dependency "activesupport", ">= 2"
Expand Down
67 changes: 55 additions & 12 deletions lib/html/pipeline/syntax_highlight_filter.rb
Original file line number Diff line number Diff line change
@@ -1,26 +1,69 @@
require 'pygments'
require 'linguist'

module HTML
class Pipeline
# HTML Filter that syntax highlights code blocks wrapped
# in <pre lang="...">.
# HTML Filter that syntax highlights code blocks wrapped in <pre> tags.
#
# If a <pre> has a "lang" attribute, it is taken as the language
# identifier. Otherwise, the language is auto-detected from the contents of
# the code block. Pass in `:detect_syntax => false` to disable this.
# You can also disable language detection per code block by assigning a
# value to the "lang" attribute such as "plain".
#
# Language detection is done with GitHub Linguist. Note that some popular
# languages that Linguist 2.3.4 isn't yet taught to detect are:
# ActionScript, C#, Common Lisp, CSS, Erlang, Haskell, HTML, Lua, SQL.
class SyntaxHighlightFilter < Filter
def call
doc.search('pre').each do |node|
next unless lang = node['lang']
next unless lexer = Pygments::Lexer[lang]
text = node.inner_text
doc.search('pre').each do |pre|
code = pre.inner_text

html = highlight_with_timeout_handling(lexer, text)
next if html.nil?
if language_name = language_name_from_node(pre)
language = lookup_language(language_name)
elsif detect_language?
detected = detect_languages(code).first
language = detected && lookup_language(detected[0])
end

node.replace(html)
if html = language && colorize(language, code)
pre.replace(html)
end
end
doc
end

def highlight_with_timeout_handling(lexer, text)
lexer.highlight(text)
def detect_language?
context[:detect_syntax] != false
end

def language_name_from_node node
node['lang']
end

def lookup_language name
Linguist::Language[name]
end

def detect_languages code
Linguist::Classifier.classify(classifier_db, code, possible_languages)
end

def classifier_db() Linguist::Samples::DATA end

def possible_languages
popular_language_names & sampled_languages
end

def popular_language_names
Linguist::Language.popular.map {|lang| lang.name }
end

def sampled_languages
classifier_db['languages'].keys
end

def colorize language, code
language.colorize code
rescue Timeout::Error => boom
nil
end
Expand Down
43 changes: 38 additions & 5 deletions test/html/pipeline/syntax_highlight_filter_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,51 @@ def filter(*args)
end

def test_unchanged
html = "<pre>plain</pre>"
html = %(<pre lang="plain">I am a poem</pre>)
assert_equal html, filter(html).to_s
end

def test_syntax_highlighting
html = "<pre lang=rb>a = 1</pre>"
assert_equal_html <<-RESULT, filter(html).to_s
<div class="highlight">
<pre>
<div class="highlight"><pre>
<span class="n">a</span> <span class="o">=</span> <span class="mi">1</span>
</pre>
</div>
</pre></div>
RESULT
end

def test_explicit_lang_skips_detection
html = "<pre lang=rb>var a = null</pre>"
assert_equal_html <<-RESULT, filter(html).to_s
<div class="highlight"><pre>
<span class="n">var</span> <span class="n">a</span>
<span class="o">=</span> <span class="n">null</span>
</pre></div>
RESULT
end

def test_detects_ruby
html = "<pre>def foo; end</pre>"
assert_equal_html <<-RUBY, filter(html).to_s
<div class="highlight"><pre>
<span class="k">def</span> <span class="nf">foo</span><span class="p">;</span>
<span class="k">end</span>
</pre></div>
RUBY
end

def test_detects_javascript
html = "<pre>var a = null</pre>"
assert_equal_html <<-RESULT, filter(html).to_s
<div class="highlight"><pre>
<span class="kd">var</span> <span class="nx">a</span>
<span class="o">=</span> <span class="kc">null</span>
</pre></div>
RESULT
end

def test_disable_detect
html = "<pre>def foo; end</pre>"
assert_equal html, filter(html, :detect_syntax => false).to_s
end
end