Change comment directive parsing #1149

tompng · 2024-08-04T19:34:01Z

Fix comment directive parsing problem

Problem of comment parsing

The main problem is that @preprocess.handle parses comment, removes directive, and process code_object at the same time.
This pull request change RDoc to parse comment and extract directives first, and then apply directives to code object.

Flow of legacy RDoc parsing method

For example parsing this code

class A
  # :yields: x, y
  # :args:   a, b
  # :call-seq: 
  #--
  # :not-new:
  # :category: foobar
  #++
  #   initialize(x, y, z)
  def initialize(*args, &block); end
end

Step 1

RDoc performs @preprocess.hanlde to RDoc::NormalClass.

:category: is applied to klass and replaced with blank line
:not-new: and :yields: are replaced with blank line. maybe bug.
:args: a, b is replaced with :args: a, b

Step 2

RDoc performs @preprocess.hanlde to RDoc::AnyMethod.
:args: a, b is applied to meth.params.

Step 3

RDoc removes private section that starts with #-- and ends with #++.

Step 4

RDoc normalizes comment by removing # and indentation.

Step 5

RDoc extracts ":call-seq:\n initialize(x, y, z) from comment and apply to method object.

Problems

RDoc removes directives and expand :include: twice in some case, and once in other case.
To avoid all directives removed in the first @preprocess.handle, preprocess needs directive-replace mechanizm which is making things complex.

Private section and call-seq are processed later. This is making RDoc accept weird comment like directive inside private section and private section inside call-seq.

Handling meta programming method is also hard.
@preprocess.handle(comment, code_object) requires code object already created.
We need to parse the comment to know the code object type (method or attribute). After that, we can finally parse the comment with the code object.

C comments are also complicated. :include: can include text containing */.
Removing directive line and private section from the comment might remove /* and */ which makes normalize_comment fail.
The original implementation was avoiding this by using different processing order than ruby parser. This is not consistent.

Solution

We need to parse comment first and only once to extract directives.
Expand :include:, read directives (including :call-seq:), remove private section at the same time.
Comment parser should return normalized comment text and directives as an attribute hash. Directive should also contain line number.

Changed things

:call-seq:

New type of directive called "multiline directive" is introduced to make :call-seq: also a directive.

# :multiline-directive:
#   html
#     head
#       title
#
#     body
#       header
#       footer

Multiline directive ends with blank line. This restriction is for compatibility with old RDoc.
Some invalid multiline directive (unindented, ends with other directive) is also accepted with warning.

The resuld of parsing this call-seq is changed. I think it get better.

# :call-seq:
#   STDIN.getc()     -> string # Only this line was call-seq
#
#   STDIN.getc(a)    -> string
#
#   STDIN.getc(a, b) -> string
#   $stdin.getc(c)   -> string # It's now call-seq until this line
#
# :other:

Private section

#----foobar was accepted as private section start.
#++++foobar was decomposed to #++(private end) and ++foobar(normal comment).
Start is now /^#-{2,}$/ (two or more -), end is now /^#\+{2}$/ (exactly two +).

Unhandled directives

In old RDoc, unhandled directive # :unknown: foo remain in normal comment.
Now it is removed just like other directives. Unhandled directive is appended to code object's metadata. It does not make sence to leave metadata in the comment. I think this was just a side effect of avoiding double parsing problem.

Normalize and remove private section

Everything is done in parse phase

C and Simple parser

C used to accept /*\n# :directive:\n*/ but now only accepts * :directive:.
Changes for call-seq, private section and unhandled directive described above are also applied to C and Simple parser.

Old comment parsing

RDoc::Markup::PreProcess#handle RDoc::Comment#extract_call_seq RDoc::Comment#remove_private is only used from RDoc::Parser::Ruby. We can remove them in the future.

Diff (updated: 2025/02/02)

I compared generated html files of rdoc itself and in ruby/ruby.

HTML meta tag content (ruby/ruby)

Files:

Date/Error.html
Enumerator/Generator.html
Enumerator/Producer.html
Enumerator/Yielder.html
Fiddle/Pointer.html
UnicodeNormalize.html

Example diff

<meta name="description" content="class Date::Error: Exception for invalid date/time ">
↓
<meta name="description" content="class Date::Error: Exception for invalid date/time">

OpenSSL/Timestamp/Factory.html (ruby/ruby)

This invalid document is parsed differentl

/* Document-class: OpenSSL::Timestamp::Factory
 * Document for default_policy_id
 * call-seq:
 *       factory.default_policy_id = "string" -> string
 * Document for serial_number
 * call-seq:
 *       factory.serial_number = number -> number
 * Document for gen_time
 * call-seq:
 *       factory.gen_time = Time -> Time
 */

Win32.html (ruby/ruby, RDOC_USE_PRISM_PARSER)

This will no longer considered to be a private section(invisible comment surrounded by -- and ++)

--- info
--- num_keys

History_rdoc.html (ruby/rdoc)

Parsing this part is improved.

* Bug fixes
  * `ri []` and other special methods now work properly.  Issue #52 by
    ddebernardy.
  * `ri` now has space between class comments from multiple files.
  * :stopdoc: no longer creates Object references.  Issue #55 by Simon Chiang
  * :nodoc: works on class aliases now.  Issue #51 by Steven G. Harms
  * Remove tokenizer restriction on header lengths for verbatim sections.
    Issue #49 by trans

The current document looks like * :stopdoc: and * :nodoc: was processed as directive.

lib/rdoc/markdown_kpeg.html (ruby/rdoc)

Maybe it shouldn't be documented.
https://ruby.github.io/rdoc/lib/rdoc/markdown_kpeg.html

RDoc/MarkupReference.html (ruby/rdoc, RDOC_USE_PRISM_PARSER)

<pre>:call-seq: → <pre>:call-seq: (trailing space removed)

RDoc/Parser/Ruby.html (ruby/rdoc, RDOC_USE_PRISM_PARSER)

Escape of # \:method: or :attr: directives in +comment+. is now working.
Note that this is related to an old bug in master branch

class Foo
  # A string constant with
  # \:nodoc: (this is documented. :nodoc: is escaped)
  A = ':nodoc:
  # Prints the word
  # \:nodoc: (this method is not documented. :nodoc: is not escaped)
  def print_colon_nodoc = puts(':nodoc:')
end

st0012 · 2024-08-23T21:46:03Z

Thanks for the PR! But because 1) the new maintainers are still learning the codebase, and 2) this PR touches a core part of RDoc, I'll hold off merging it until version 6.8.0 series is released 🙏

st0012 · 2025-02-16T12:43:20Z

lib/rdoc/comment.rb

+        value_lines = lines.take_while do |l|
+          l.rstrip[indent_regexp].size > base_indent_size
+        end
+        min_indent = value_lines.map { |l| l[indent_regexp].size }.min


Can this be:

Suggested change

min_indent = value_lines.map { |l| l[indent_regexp].size }.min

min_indent = value_lines.min_by { |l| l[indent_regexp].size }.size

It can be like this

value_lines = [' arg2,', ' arg3,', ' )'] indent_regexp = /^\s*/ min_indent = value_lines.min_by { |l| l[indent_regexp].size }[indent_regexp].size #=> 2

but I think map{}.min is better

st0012 · 2025-02-16T12:44:48Z

lib/rdoc/comment.rb

+      when :c
+        private_start_regexp = /^(\s*\*)?-{2,}$/
+        private_end_regexp = /^(\s*\*)?\+{2}$/
+        indent_regexp = /^\s*(\/\*+|\*)?\s*/


Can we store these static regexp objects in constants instead?

It already have an explicit name private_start_regexp and I don't see much benefit of making it a constant.

C_PRIVATE_START_REGEXP = /regexp/ private_constant :C_PRIVATE_START_REGEXP, :C_... ... when :c private_start_regexp = C_PRIVATE_START_REGEXP

But why can't we directly refer to C_PRIVATE_START_REGEXP without defining locals for them 🤔

Because these locals are a layer to absorbs differences between types

when :ruby private_start_regexp = /^-{2,}$/ when :c private_start_regexp = /^(\s*\*)?-{2,}$/ when :simple private_start_regexp = /^-{2}$/ end ... line.match?(private_start_regexp) ... value_lines = take_multiline_directive_value_lines(..., indent_regexp, ...)

I think it could be simplified to:

Remove # or * depend on type first

Use the same regexp for all types

But it will introduce incompatibility.

horizontal divider between two paragraphs in :simple --- -- only two dashes starts private section in :simple ++ #----- # Two or more dashes after # are private section start in :ruby # Tested in RDocCommentTest#test_remove_private_long #++

Removing spaces before * instead of replacing * to a space (needed to distinguish *--- from * ---) also brings changes in a c comment where * are not aligned. (I think this is a small problem though)

/* * :call-seq: * foo(bar) */

st0012 · 2025-02-16T12:45:37Z

lib/rdoc/comment.rb

+        line = lines.shift
+        read_lines = 1
+        if in_private
+          in_private = false if line.match?(private_end_regexp)


Why do we need to flip in_private here? Can we add a comment explaining it?

private_start_regexp (matches to --) begins private section (in_private = true)
private_end_regexp (matchs to ++) ends private section (in_private = false)

comment added

lib/rdoc/comment.rb

st0012 · 2025-02-16T12:57:02Z

lib/rdoc/comment.rb

+        prefix_indent = ' ' * prefix.size
+        line = line.byteslice(prefix.bytesize..)
+        /\A(?<colon>\\?:|:?)(?<directive>[\w-]+):(?<param>.*)/ =~ line
+


Let's add param.strip! here so we don't need to call param.strip in multiple branches?

Good catch 👍
changed to:

raw_param = directive_match[:param] param = raw_param.strip

Unstripped raw_param is also used to reject :toto:: like in the old implementation in pre_process.rb

rdoc/lib/rdoc/markup/pre_process.rb

Line 112 in a253a8d

# skip something like ':toto::'

lib/rdoc/comment.rb

nobu · 2025-03-09T07:04:15Z

lib/rdoc/comment.rb

+      blank_line_regexp = /\A\s*\z/
+      min_spaces = lines.map do |l|
+        l[/\A */].size unless l.match?(blank_line_regexp)
+      end.compact.min


Suggested change

blank_line_regexp = /\A\s*\z/

min_spaces = lines.map do |l|

l[/\A */].size unless l.match?(blank_line_regexp)

end.compact.min

min_spaces = lines.map do |l|

l.match(/\A *(?=\S)/)&.end(0)

end.compact.min

nobu · 2025-03-09T07:05:53Z

lib/rdoc/comment.rb

+      lines.shift while lines.first&.match?(blank_line_regexp)
+      lines.pop while lines.last&.match?(blank_line_regexp)


Stripping blank lines can't be done first?

nobu · 2025-03-09T07:09:03Z

lib/rdoc/comment.rb

+    private def take_multiline_directive_value_lines(directive, filename, line_no, lines, base_indent_size, indent_regexp, has_param)
+      return [] if lines.empty?
+
+      first_indent_size = lines.first[indent_regexp].size


Suggested change

first_indent_size = lines.first[indent_regexp].size

first_indent_size = lines.first.match(indent_regexp)&.end(0)

It can avoid to create a string for each time.

st0012 added this to the v7.0.0 milestone Oct 17, 2024

st0012 self-requested a review January 28, 2025 14:26

tompng force-pushed the comment_parsing branch 2 times, most recently from 178917a to b17e948 Compare January 31, 2025 13:43

tompng mentioned this pull request Feb 2, 2025

Reduce document difference between RDoc::Parser::Ruby and RDoc::Parser::PrismRuby #1284

Merged

tompng force-pushed the comment_parsing branch from b17e948 to 1b4b08d Compare February 2, 2025 14:11

tompng mentioned this pull request Feb 3, 2025

Indent multiline call-seq comment ruby/bigdecimal#311

Open

tompng force-pushed the comment_parsing branch from 1b4b08d to 15fa499 Compare February 11, 2025 18:40

st0012 reviewed Feb 16, 2025

View reviewed changes

nobu reviewed Mar 9, 2025

View reviewed changes

tompng added 5 commits March 16, 2025 03:38

Change comment directive parsing

7e2061c

Escape directive-like document content

1646924

Extract directive regexp to constant

50a3900

Extract normalizing comment indent to a method

92d8fc6

Use an efficient way to calc length of regexp matched string

07a0678

tompng force-pushed the comment_parsing branch from 2d9be0f to 07a0678 Compare March 15, 2025 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change comment directive parsing #1149

Change comment directive parsing #1149

tompng commented Aug 4, 2024 •

edited

Loading

st0012 commented Aug 23, 2024

st0012 Feb 16, 2025

tompng Feb 18, 2025

st0012 Feb 16, 2025

tompng Feb 18, 2025

st0012 Feb 18, 2025

tompng Feb 18, 2025

st0012 Feb 16, 2025

tompng Feb 18, 2025

st0012 Feb 16, 2025

tompng Feb 18, 2025

nobu Mar 9, 2025 •

edited

Loading

nobu Mar 9, 2025

nobu Mar 9, 2025

	min_indent = value_lines.map { \|l\| l[indent_regexp].size }.min
	min_indent = value_lines.min_by { \|l\| l[indent_regexp].size }.size

		lines.shift while lines.first&.match?(blank_line_regexp)
		lines.pop while lines.last&.match?(blank_line_regexp)

	first_indent_size = lines.first[indent_regexp].size
	first_indent_size = lines.first.match(indent_regexp)&.end(0)

Change comment directive parsing #1149

Are you sure you want to change the base?

Change comment directive parsing #1149

Conversation

tompng commented Aug 4, 2024 • edited Loading

Problem of comment parsing

Flow of legacy RDoc parsing method

Step 1

Step 2

Step 3

Step 4

Step 5

Problems

Solution

Changed things

:call-seq:

Private section

Unhandled directives

Normalize and remove private section

C and Simple parser

Old comment parsing

Diff (updated: 2025/02/02)

HTML meta tag content (ruby/ruby)

OpenSSL/Timestamp/Factory.html (ruby/ruby)

Win32.html (ruby/ruby, RDOC_USE_PRISM_PARSER)

History_rdoc.html (ruby/rdoc)

lib/rdoc/markdown_kpeg.html (ruby/rdoc)

RDoc/MarkupReference.html (ruby/rdoc, RDOC_USE_PRISM_PARSER)

RDoc/Parser/Ruby.html (ruby/rdoc, RDOC_USE_PRISM_PARSER)

st0012 commented Aug 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nobu Mar 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tompng commented Aug 4, 2024 •

edited

Loading

nobu Mar 9, 2025 •

edited

Loading