Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Initital proposal for JMarkdown design specification #9

Closed
wants to merge 6 commits into from

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Nov 14, 2019

Following discussion with @choldgraf @jstac and @mmcky, and based on my experience developing IPyPublish, this PR presents an initial design proposal for a JMarkdown (or IMarkdown) design specification.

Aim

Have a pure text representation of a Jupyter Notebooks, that can be used to create technical/scientific documents. In particular, for conversion to PDF (via TeX) and HTML.

Key Design Goals

For me the key design goals are that it should:

  • Support key syntax features, required to produce a technical/scientific document:
    • Standard text markup (bold, italic, etc), and block elements (headers, lists, etc)
    • Math
    • Internal labelling and referencing of figure, tables and equations, including those output from code-cells.
    • Referencing of citations, from a bibtex file(s)
    • Referencing of acronyms/definitions would also be desirable, like I have implemented in sphinx_ext_bibgloss
    • Optional formatting of notebook cells and code cell outputs, via cell metadata.
  • Be easy to read/use/adopt by users
  • Be easy to do round trip conversions with a Notebook, e.g.via jupytext
  • Be easy to parse into and manipulate with a document converter, e.g. pandoc
  • Be easy to write an LSP extension for, to provide cross-editor language support. Note LSP is supported by VS Code, PyCharm, Atom, Sublime, Emacs and Vim, and is currently being implemented in Adopt the language server protocol jupyterlab/jupyterlab#2163. A good example of what an LSP can provide, can be seen in LaTex-Workshop, including:
    • General Markdown syntax highlighting, validation and error highlighting
    • Autocompletion of internal references and ‘jump-to-definition’
    • Autocompletion of citations (by linking to a bibtex file) and hover preview of citation metadata
    • Hover previews of LaTex Math
    • Metadata validation, autocompletion, and error highlighting
    • Common snippets insertion

Implementation Considerations

To support the syntax features, the three main options are: LaTeX, RST and Markdown.
LaTeX supports all of these syntaxes, however, the drawbacks are that (a) it is not intrinsically supported by Jupyter Notebooks, (b) It is not as easy to convert to HTML etc, (c) It's not as user-friendly as Markdown. (a) and (c) also apply to RST, so that leaves us with Markdown.

RMarkdown would then be the next natural consideration, however, this also has some issues:

  • It doesn't have a comprehensive design specification. A good example of this is the CommonMark-Spec.
  • It doesn't intrinsically support markdown/raw cell metadata
  • The code cell metadata format ( ```{`python a=1, b="c"}), isn't 'official' pandoc syntax (so isn't automatically handled by it), and isn't ideal for usability/readability or parsing.
  • It doesn't have first-class support labelling, referencing, etc. (second-class support is provided by Bookdown)
  • Its pretty closely tied to RStudio and Knitr

This is where JMarkdown comes in. It is similar to RMarkdown, but:

  1. Stores metadata for all cells in YAML metadata code blocks
  2. Will have a comprehensive design specification, that can be used by parsers and LSP, including JSON schema for notebook and cell level metadata (like what is implemented in IPyPublish's document schema and cell schema).
  3. Will eventually provide first-class support for output agnostic labelling, referencing, etc, probably through an amalgamation of the (similar) syntaxes derived in bookdown and IPyPublish.

Also, where possible, the syntax will use the basic Pandoc Markdown elements, and keep boilerplate elements to a minimum.

Proof of Concept Implementation

In this PR I have created a small python package, as a proof-of-concept for:

  • Conversion from a Notebook
  • Parsing to Pandoc AST
  • Processing of the AST, including code cell execution.
  • Converting to HTML

Conversion from a Notebook

The initial notebook is imd_test.ipynb. With imd_poc.nb2imd.parse_nb2imd, I then parse this notebook to test_nb2imd.md (and the preview format), which demonstrates the basic format of JMarkdown. Note:

  • Metadata are stored in code blocks, with the identifier metadata
  • Raw cells are stored in code blocks, with the identifier raw-cell
  • Code cells are stored in code blocks, with the identifiers code-cell and the language, obtained from the notebook metadata. Unfortunately, GitHub only syntax highlights the shortcut form; ```python, which I originally used, but this could be an issue if someone wanted to write python code in a markdown cell, so ```{.python .code-cell} is more robust.
  • Metadata is not required for every cell. If the metadata is empty, then a block will not be created.
  • if a Markdown cell with no metadata follows another metadata cell then a Pandoc Span is inserted with the new-cell class: []{.new-cell}. This ensures a parser can identify the start of all new cells (in the current jupytext RMarkdown converter, multiple markdown cells are merged, during the round-trip conversion).

TODO handling cell attachments

Parsing to Pandoc AST

In imd_poc.imd2pandoc, test_nb2imd.md is parsed directly into pandoc AST (using panflute), then some intital processing is done, to convert the document to a more 'computationally friendly' format (inserting missing metadata, wrapping metadata & cell content in divs, and numbering cells). Converting back to Markdown, we obtain test_imd2pandoc.md.

Code cell execution and output to HTML

In imd_poc.pandoc_exec, I iterate through the pandoc AST, and execute all the code cells, saving the outputs as YAML. Converting back to Markdown we get: test_pandoc_exec.md.

In imd_poc.pandoc2html, I build on this to process references and create a basic HTML document: test_pandoc2html.html

image

Note how we've now utilized the metadata to colour a note block, place a numbered caption under the generated pandas table and added an anchor link to it, and turned the reference into a hyperlink with the same name as the table caption.

You'll also note that I haven't used the nbconvert conversion mechanism. My approach has a number of benefits:

  1. You don't need to convert the document back to a notebook before conversion. You might want to use jupytext to sync with it though, if you wanted to access pre-computed code cell outputs.
  2. You only need to run pandoc once on the entire document. This is a big win, because with nbconvert, you have to call it separately for every markdown cell, which is not particularly efficient, and makes it more difficult to deal with document wide aspects, like figure numbering.
  3. In my opinion the nbconvert process is overly complex, because of (a) the Traitlets configuration (which they are already discussing replacing), and (b) the Jinja templates. Writing Jinja templates is essentially like writing in a completely different programming language. This might be fine, if the language editor support was anywhere close to the level of python, which it isn't. Because of this, it makes them tedious to write, test and maintain.

Finally, to create HTML, it is better to use Sphinx (as I already do in IPyPublish).
This would either be via the creation of an intermediate RST representation, or you might even be able to convert the pandoc AST directly to the docutils AST (which is what Sphinx uses at a lower level).

I think that's about it for now!

@chrisjsewell
Copy link
Member Author

Hey @mwouts thanks for the mwouts/jupytext#383 fix :) Thought you might like to give you're take on the format proposed above. If you can't be bothered to read it all, just have a look at the basic format in test_nb2imd.md.

@choldgraf
Copy link
Member

This looks really promising - thanks @chrisjsewell for taking the time to think through it and share your ideas. I'll also take a deeper pass through this. Is there any particular feedback that you're looking for right now? Particular areas that you're unsure etc?

also, ping @emdupre who I think would be quite interested in this conversation as well.

@chrisjsewell
Copy link
Member Author

Is there any particular feedback that you're looking for right now? Particular areas that you're unsure etc?

Thanks @choldgraf, I wouldn't say there's anything I'm unsure of. As you can see, I have quite strong ideas about this lol! The main thing is to get back critique, and answer the question: Could you see this being adopted by the community and users?

As we've discussed before, RMarkdown has a decent user base, which would take time to replicate. But for the technical/scientific document use case, and for compatibility with Jupyter notebooks in general, I really feel it has some key disadvantages. I hope I've demonstrated here how, with the few tweaks made for JMarkdown, it makes it much simpler to implement the more complex document elements that we are trying to achieve here, without diminishing the readability and user experience (and hopefully enhancing it) .

@jlperla
Copy link

jlperla commented Nov 14, 2019

@chrisjsewell is https://raw.githubusercontent.com/chrisjsewell/meta/imarkdown/tests/test_nb2imd/test_nb2imd.md representative of what a content author would write directly?

@chrisjsewell
Copy link
Member Author

@chrisjsewell is https://raw.githubusercontent.com/chrisjsewell/meta/imarkdown/tests/test_nb2imd/test_nb2imd.md representative of what a content author would write directly?

Yes. Hopefully, with the aide of an LSP, akin to LaTeX-Workshop

@chrisjsewell
Copy link
Member Author

This is a rough example of how you might write content:

jmarkdown_edit_example

@jlperla
Copy link

jlperla commented Nov 14, 2019

@chrisjsewell As a jupyter user, I would be strongly supportive of this format and it would make me more likely to use jupyter for "real" work (e.g. the builtin git diffs in something like github, along with discussions on PRs become possible whereas they are not right now). Right now I have all of my interactive code in Weave.jl or Pweave notebooks so I can coordinate software developement with RAs.

In fact, I would love to have an option to natively store in that format instead of the current JSON one! In that case, the tooling isn't a problem since the jupyter notebook/lab interface gives the tools you need.

In the design, it seems that everything is very cell-centric (which makes sense for jupyter but not for HTML or PDF) rather than being a general markup format where jupyter is just one output type. It also seems like it can faithfully reflect the metadata format of jupyter ipynb itself, which is a major advantage for software engineering reproducibility.

But... as a content author for executable books, it is significantly more verbose than both the current RST based approach and bookdown/RMarkdown. This isn't because there are other approaches to reprsenting jupyter in text (what you propose makes complete sense) but rather that the user-cases for authoring books with multiple output formats are different from that of authoring jupyter notebooks.

In particular, my gut tells me that the emphasis on the metadata blocks suggests that it would be very difficult to author this as a minimal, clean document markup file in the way authors want. As a user of this system, my hope was that we could move away from the sort of complex formatting that is in RST towards something more clean and minimal.

As an author, the other advantage of a minimal metadata based specification is that it makes it easier to bring on coauthors. RMarkdown, Weave, etc. files have been proven to be within the grasp of people with minimal programming experience, whereas a verbose format makes it harder to attract contributors.

@chrisjsewell
Copy link
Member Author

Thanks for the comments @jlperla. Here's my defence lol:

it is significantly more verbose ... it seems that everything is very cell-centric ... the emphasis on the metadata blocks suggests that it would be very difficult to author this as a minimal, clean document markup file

As I've mentioned, the document does not have to contain any metadata blocks, if you don't require any particular formatting of the cells or code outputs. My PoC parser would happily work with

# Header

Just writing some text, no need for metadata.

```{.python code-cell}
print("I'm some generically formatted output")
```

Here the author doesn't really need to have any concept of 'cells'.

Utilising metadata in the YAML format, I think is quite user friendly (that's exactly why it was designed, as an alternative to JSON), and because it is a standard format, it is very easy to implement editor support for things like syntax highlighting, autocompletion and validation:

jmarkdown_metdata2

I think this makes it considerably easier for the user to grasp/work with, without the need to be constantly consulting the documentation for what options are available.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 14, 2019

Also:

it seems that everything is very cell-centric (which makes sense for jupyter but not for HTML or PDF)

Ermm, well everything in HTML is essentially cell centric, there just referred to as divs instead of cells 😜. At the end of the day, you are always going to need a way to segregate your document into different blocks of content, whether by cell or div or section headings, etc

@jlperla
Copy link

jlperla commented Nov 14, 2019

the document does not have to contain any metadata blocks

But for a book written with this format, it would. Take one of the existing quantecon RST files and represent all of the variations on settings/etc. into your format. It is going to be full of metadata cells and impossible to write cleanly by hand. Half of the document might end up yaml blocks. Note that Rmd/bookdown only has metadata where it is really needed because the markup language is designed to have direct features for authoring content and doesn't try to be general. Similarly, RST has custom directives which jupinx exploits.

One issue is that there are multiple outputs, but Jupyter has the least "information" in it, and which is not designed as a higher-level DSL. If you try to use a general purpose format like jupyter for the content, the only way to store all of that stuff is a pile of metadata with every setting and permutation of the output, because the jupyter format isn't designed as a general document markup for direct authoring of interconnected content (unlike sphinx or rmarkdown/bookdown).

Here's my defence lol:

I don't think you need to worry about defending it at all! What you have written is the best possible representation of a jupyter notebook in a text format. I don't think there is any suggestion I could make that would be an improvement, and I would love to use it.

My statements here is that I would rather use the existing RST or a RMarkdown-clone as a markup language for writing online book content because I don't think jupyter has the specialized semantics required for content authoring of this sort.

Combining a few examples from Rmd, you can see

  • if I want to make a citation in Rmd all I do is @mycite directly in the text
  • If I want to have a labelled figure, I could just do {python, out.width='25%', fig.align='center', fig.cap='...'} and then have the code inside.
  • if I want to have a section to reference to it, I just go
# Introduction {#intro}
``{theorem}
My theorem text....

\begin{equation}
a + b x = c  (\#eq:linear)
\end{equation}
``

``{python cars-plot out.width='25%', fig.align='center', fig.cap='My figure'}
plot(x,y)
``

Some inline math $x = 2$

This is Chapter \@ref(intro)
See Figure \@ref(fig:cars-plot)
See Equation \@ref(eq:linear)
See @Aiyigari1994  as a citation

and all of the other stuff in https://bookdown.org/yihui/rmarkdown/bookdown-markdown.html#special-headers including theorems and proof environment.

I think that is pretty tough to beat for user friendliness. The jupinx features are also excellent as a higher-level language, even if RST itself is verbose.

Again, to reiterate: I think that your format is something that I would love to use for storing/working with jupyter notebooks directly that are not an online book. For that, I would prefer higher-level semantics in the markup language itself.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 14, 2019

Firstly I'd stress that I personally wouldn't want to write in any of these formats as they stand, because none has a descent LSP extension. If someone wants to write in a 'plain' text editor, fine, but they're living in the dark ages lol.

To that end, it wouldn't really bother me if "half the document was metadata" because, you just have an option to 'collapse' them all.

image

Take one of the existing quantecon RST files

It might be helpful if you could attach one of these files here, for me to have a look at.
The format I specify here though, to my mind, closely mirrors RST directives. Instead of:

.. my-directive::
   :option1: abc
   :option2: efg
   
  Some content

You are writing something like:

```metadata
my-directive:
  option1: abc
  option2: efg
```

Some content

Note that both RMarkdown and JMarkdown don't inherently support nested blocks, like RST does. In some respects I would be happy to just use RST, and IPyPublish already has relatively decent Notebook/RMarkdown -> RST -> HTML support (see Writing Markdown and sphinx.ext.notebook). However, as we've said, it's a bit verbose, not necessarily the most user-friendly, and you would need to implement an RST <-> Notebook round trip converter.

Jupyter has the least "information" in it

I'm not sure I understand you here. Notebooks are just JSON blobs, so surely they can store arbitrary amounts of information.

the only way to store all of that stuff is a pile of metadata with every setting and permutation of the output

Again, I'm not sure what additional metadata you are referring to, that does not already need to be strored in an RST/RMarkdown file?

if I want to make a citation in Rmd all I do is @mycite directly in the text

That's exactly what I was already proposing anyway, and is already implemented in IPyPublish?

If I want to have a labelled figure, I could just do {python, out.width='25%', fig.align='center', fig.cap='...'} and then have the code inside.

As I mentioned before, my only 'gripe' with this format is that (a) it's not great if you want to set a lot of options, you are going to get a very long line, (b) it's a fairly bespoke standard, that is an added burden to support by parsers, e.g. if I feed this through pandoc, everything within the ```s is just treated as raw text (the same with the ```theorem block ). But functionally they are exactly the same, you've just taken the metadata block, condensed it to one line, and placed it inside the code cell.

Also, I've just noted, that it doesn't appear that Bookdown has much support for code generated tables (e.g. pandas.DataFrame) or equations (e.g. sympy )?

@jstac
Copy link
Member

jstac commented Nov 14, 2019

Here's a typical rst file from the QuantEcon lectures, as requested:

https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/rst/finite_markov.rst

It's compiled to ipynb using Jupinx but we have no round trip support. I would love to have that because I want to jump backwards and forwards between text source and ipynb.

I also think that we need to put a lot of emphasis on people who start out editing purely in Jupyter. That will probably be the most common use case. As their book expands, they will see the need to use the text representation more. So, again, the mapping from text source to ipynb and back is critical.

I think the YAML approach will work with sufficent editor support.

@mmcky
Copy link
Member

mmcky commented Nov 14, 2019

metadata

I agree that the metadata implementation we run with should be as concise as possible while retaining flexibility (lot of trade-offs to explore here). When we compare and contrast -- It would be helpful to explore ways to minimise metadata in the yaml blocks by using contextual information of single documents (within larger book style projects) in addition to ways for have default settings that can be overridden only when needed. For example, it would be nice to have general figure size settings for generated images from code, and then override if this needs to change in the metadata).

  • I think one of the next steps in this discussion of formats is to actually write some documents in RST, JMarkdown etc. and compare and contrast markups. I think this will help in the iteration of design.

@jlperla
Copy link

jlperla commented Nov 14, 2019

If someone wants to write in a 'plain' text editor, fine, but they're living in the dark ages lol.

Of course I am talking about using things like vscode, but I think you overestimate the role of tooling. When I type in snippets in vscode's latex extension (which is what I use day to day), it is filling in the high-level latex language... not dropping all sorts of metadata and boilerplate into the code. The tool that is ideal for editing the format you proposed is Jupyter, not a vscode extension.

To that end, it wouldn't really bother me if "half the document was metadata" because, you just have an option to 'collapse' them all.

If tooling was enough to overcome verbose boilerplate, then Java might have remained the world's most popular language. Instead, its verbosity is killing it in the longrun, no matter how good the tooling has become. People love (and intro users can handle) Rmd because of its simplicity and lack of noise.

I'm not sure I understand you here. Notebooks are just JSON blobs, so surely they can store arbitrary amounts of information.

Sorry, I meant semantic information. Jupyter as the output format (as used by end-users directly, not thinking about it as an editor for content which is separate) has the least amount of formatting and special cases compared to pdf and html.

Again, I'm not sure what additional metadata you are referring to, that does not already need to be strored in an RST/RMarkdown file.

Rmd/bookdown is a specification written to make a lot of the concepts as a top-level DSL for content creation rather than as an generic file format. If you do not have top-level commands and constructions along the same lines as Rmd then you have to represent all of that in json metadata.

I also think that we need to put a lot of emphasis on people who start out editing purely in Jupyter. That will probably be the most common use case.

I don't know. Rmd has been a massive success, in part because it is a focused language which anyone feels they can read/write as text. I would go so far as to say that it is one of the key reasons for R's success.

Buy lets say you are right that people would love to use jupyter as an editor. If you want jupyter as a key way for people to write content, that doesn't mean you need to roundtrip. Jupyer files could have a one-directional transformation to a specialized markdown that is done within the build process (i.e. it is one-way, not one-time). It is only when you need to use advanced features that you switch to generate those files and edit them directly. Everyone wins. Those who want to use jupyter can do so, and those who want a Rmd style markdown with high-level editing semantics win as well.

But as soon as you round-trip, you end up having to use something compatible with JSON metadata in Jupyter and/or have fragile generated HTML or other markup inside of the jupyter blocks.

To give an analogy: have you ever used scientific workplace or lynx? In theory, they both generate pure latex that is editable in vscode without the GUI (i.e. a roundtrip) but in reality, they create so much metadata and fragile comments that you can't really read the output on its own, and if you edit it direclty there is a good chance you will break the roundtrip compatibility.

But I will stay out of it now that I have made my plea as a user not to have to edit a file filled with mounds of yaml blocks. I would rather use the current RST format, which at least allows us to have a DSL made from high-level specialized directives.

@chrisjsewell
Copy link
Member Author

But I will stay out of it now

No please keep opinionating, thanks 😀 It's great to hear another side of the argument.

@jlperla
Copy link

jlperla commented Nov 15, 2019

No please keep opinionating

I will wait until you guys have some specifications you want feedback on. Otherwise too much opionating could easily get in the way of progress!

And, just in case it wasn't sufficiently clear, I really would love to your your jmarkdown specification.... essentially as is... available as an alternative to the ipynb format. It would make my life easier for non-book editing tasks. A good analogy to me is that for mathematica, I always write .m files rather than .nb files (and still use the mathematica GUI) in order to have the source in a sane github setup.

@choldgraf
Copy link
Member

hey folks - I will read through this and digest a bit. Just want to say thanks very much to @chrisjsewell for fleshing out his ideas, and thanks to @jlperla for providing thoughtful perspective and feedback, I'm excited that we'll make some big progress on this!

@mwouts
Copy link

mwouts commented Nov 15, 2019

Hello @chrisjsewell! This is a super interesting conversation and I've learnt a lot by reading the comments above. Now I'll try to contribute back a few (opinionated 😄 ) thoughts...

  • I think JMarkdown should support round trips with Jupyter notebooks. Because 1) It increases your user base - everyone has notebooks and 2) it's probably how you will execute the code anyway
  • I agree 100% that notes, captions, theorems etc are missing in the notebook. I'd be glad to have them in your format, but... I also want them in Jupyter
  • I have R Markdown in very high esteem. It is both simple and powerful. Did you know that they support child documents? That cell options can be R variables or expressions? One year ago their implementation of HTML slides looked more advanced than Jupyter's one - at least I could include JS plots in this example. Also their system of cache and dependencies between cells is interesting. But sure, it's hard to market for Python users, because 1) it has to be compiled with R and 2) people are more used to Jupyter.
  • Now tell me what JMarkdown stands for: Jupyter Markdown? (not Julia, right)?
  • And then, sorry for that question... Do we really need a third Markdown representation of notebook, beside Jupytext's Markdown and Pandoc's Jupyter Markdown? Already it is a pitty that pandoc's Markdown representation of Jupyter notebooks is so different from Jupytext's one. Note that I could'nt evolve Jupytext's one (September, 2018) to pandoc's one (February, 2019) as in Jupytext Markdown, we try to get a Markdown file that looks nice on many viewers (GitHub, VS Code, PyCharm), and this is not possible if we use divs like pandoc.
  • Note, still, that I am open to evolving the Jupytext Markdown format. Thanks to @choldgraf, writing metadata or explicit cell boundaries will be much simpler in Jupytext 1.3. In other words, please let me know what you don't like in Jupytext's Markdown!
  • Thanks for the link to jupinx, that's very interesting for me to discover it. Can it turn arbitrary rST files into meaningful notebooks? What do we loose when we do rST -> (jupinx) -> ipynb -> (jupytext) -> md?
  • I love the idea of writing a plugin for VS Code. Note that RStudio does offer metadata completion in R Markdown:
    image
    Do you think I could reuse what you've done for Visual studio extension for Jupytext mwouts/jupytext#143? To begin with, I'd like to execute jupytext --sync on save.

@chrisjsewell
Copy link
Member Author

Thanks for the feedback @mwouts, I've certainly learnt a lot as well!
Well, it looks like we're sticking with RMarkdown 😂

At this point, I find myself playing devils advocate, and asking: If packages like jupyter-book, jupyinx and IPyPublish are, in some respect, trying to replicate functionality that is already in Bookdown/Knitr, why not just use Bookdown/Knitr?

@choldgraf sorry this is probably a complete misrepresentation of jupyter-book, but given the aim:

"Jupyter Books lets you build an online book using a collection of Jupyter Notebooks and Markdown files. Its output is similar to the excellent Bookdown tool, and adds extra functionality for people running a Jupyter stack."

and current pipeline:

RMarkdown (jupytext)-> Jupyter Notebook (nbconvert/Pandoc)-> HTML/LaTeX

(Note the base nbconvert.HTMLWriter template uses mistune for Markdown->HTML, but if you want to add in all the functionality of Bookdown and also LaTeX writing, this will likely require pandoc)

Would it not be better to do:

Jupyter Notebook (jupytext)-> RMarkdown (Bookdown/Knitr)-> HTML/LaTeX

Is it worth the 'extra functionality', to try to replicate Bookdown/Knitr in nbconvert/Pandoc, or would it not be better to add this functionality by making PRs to Bookdown/Knitr, or creating a separate fork of Bookdown that is more specialized for the Jupyter stack?

@mwouts
Copy link

mwouts commented Nov 15, 2019

Well, it looks like we're sticking with RMarkdown 😂

Well I rather see RMarkdown as a great challenger. You'll want to compare your stack to theirs. But there's definitely room for a Python stack. People in using Python/Jupyter will find Python tools easier to install. Not even mentioning maintenance... contributing to Knitr or Bookdown may not be very easy unless you have much experience with R!

For the same reason, unless you want to open your notebooks in RStudio, I think you probably want to use Jupytext to convert your notebooks to Jupytext Markdown (.md) rather than R Markdown (.Rmd).

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 15, 2019

Contributing to Knitr or Bookdown may not be very easy unless you have much experience with R

That's the crux of the matter. I certainly don't want to have to start programming in R! But are 'we' (the python/jupyter community) willing to create/maintain a proper python alternative to Knitr/Bookdown? For reasons I've already mentioned, at least in its current state, I don't think nconvert is the tool to achieve this.

For the same reason, unless you want to open your notebooks in RStudio, I think you probably want to use Jupytext to convert your notebooks to Jupytext Markdown (.md) rather than R Markdown (.Rmd).

Jupytext Markdown is certainly helpful, but it doesn't inherently support any of the syntax required to write technical/scientific documentation à la Bookdown. Again, this is why I think it would be a big win to write an LSP/VS Code extension for RMarkdown, so that you weren't restricted to only using it in RStudio.
FYI I've previously written a VS Code Extension; vscode-moose. So if we were to go down the RMarkdown route, I'd be happy to help write this (given time/funding of course!).

@choldgraf
Copy link
Member

choldgraf commented Nov 15, 2019

Hey all - I am going to pick a few ideas to respond to here, to avoid this becoming a wall of text in an already very-long thread :-)

I'll refer to Chris' idea for a "Jupyter Markdown" as "Jmd" for the rest of the post.

from @chrisjsewell / @jlperla

If tooling was enough to overcome verbose boilerplate, then Java might have remained the world's most popular language

My intuition is that the language will be most-useful and most widely-adopted if it is pleasant and easy to look at without any extra tooling. It should work nicely with syntax highlighters, code folders, etc...but it can't only look nice if you have those tools or I suspect it'll be a non-starter for many people. (that said, I think yaml header blocks could look quite nice).

I agree with @mmcky that it would be helpful to have a representative "page of content" that we can write in each of the formats we're considering, so that we can make comparisons about the simplicity and structure of each.

You'll also note that I haven't used the nbconvert conversion mechanism.

I think it'd be great if there were a separate Python package that goes from jmd to HTML, similar to the RMarkdown package. If that existed, it would be fairly trivial to implement a processor for nbconvert that utilized this package. That feels like the most modular "do one thing and do it well" solution, no?

to create HTML, it is better to use Sphinx

Just so I understand - right now ipypublish converts everything to rST via nbconvert, and then to HTML via Sphinx, right? Would this proposal be to go directly from notebooks -> sphinx? And if so, I wonder if we can upstream this as an improvement to another project like nbsphinx.

Also, another note is that I'd love for this project to result in upstream changes and improvements to tools like nbconvert. I agree it is clunky and difficult to work with, but this is in many ways because it's had little resources over the years. There's renewed interest in improving it and I'd love to be a part of that effort as well (for me, this project is about making upstream contributions as well as building a publishing tool)

from @mwouts

we try to get a Markdown file that looks nice on many viewers (GitHub, VS Code, PyCharm), and this is not possible if we use divs like pandoc

I think you and @chrisjsewell agree on this goal, something I like about Jmd is that it behaves nicely with pre-existing syntax highlighting in most editors, similar to jupytext md.

I agree 100% that notes, captions, theorems etc are missing in the notebook. I'd be glad to have them in your format, but... I also want them in Jupyter

In my mind, there are two goals to this piece of the project:

  1. Agree on the syntax we'll use for within-cell functionality like citations (do we use @?), equation references (e.g. :eq: or @) etc.
  2. Agree on the syntax we'll use to structure content into cells along with metadata (e.g. do we use yaml headers in code blocks, or JSON in-line with the triple backticks, etc)

I think no. 1 is something we can build in to Jupyter's interfaces via an extension, maybe a LSP, or conversion tools (be it nbconvert, sphinx, pandoc, etc). No. 2 is something that we can solve independent of no. 1, and should be done with the stakeholders who are already working in the space of "Jupyter notebooks as text files" - @mwouts I'd consider you one of the leading voices in this space!

Do we really need a third Markdown representation of notebook, beside Jupytext's Markdown and Pandoc's Jupyter Markdown

Perhaps as a start, if a "Jmd" flavor exists that breaks from jupytext markdown (e.g. YAML metadata blocks), then we can upstream these to Jupytext and work to try and unify the two in the medium-term. Don't forget that this project can be in a "research and prototyping" phase for a while before we start "officially" shipping things. I'm a fan of exploring what's possible and walking down paths a bit before we make strong decisions.

I have R Markdown in very high esteem. It is both simple and powerful. Did you know that they support child documents? That cell options can be R variables or expressions?

To me these are all great things that we want to support on the Python / Jupyter side. It would also be great to be able to do things like interpolate variable values into markdown cells (e.g. I could write something like "the mean in group 1 was significantly greater than zero (p=={{ my_p_value }})")

And one more from @chrisjsewell since it's a longer answer:

Is it worth the 'extra functionality', to try to replicate Bookdown/Knitr in nbconvert/Pandoc, or would it not be better to add this functionality by making PRs to Bookdown/Knitr, or creating a separate fork of Bookdown that is more specialized for the Jupyter stack?

This could be a reasonable approach to take. The reason that I didn't do this at the beginning of Jupyter Book, and why I'd probably still not do this now, is:

  1. I don't know R very well, and when I have used it, haven't found it a very compelling tool for software development (but maybe that's my own ignorance)
  2. It's unclear to me what kind of governance bookdown/knitr/etc have. The R-community (and I guess bookdown/knitr) feels very driven by RStudio (which is an amazing company in many ways, I am just not sure what implication this has for open source governance and decision-making...but I probably spend more time than most worrying about this kinda thing).
  3. I suspect that for many users, saying "to use this tool you'll need to install R" will be a non-starter. And we have the benefit that if people have jupyter installed, they'll also already have Python (and that's a lot of people)

As an aside, the thing I love about Jupyter Book is that it's still a very simple project. Most of the heavy lifting is in the CSS / Javascript, and other than this, it's a fairly lightweight wrapper that orchestrates several other tools for doing things:

  • executing notebooks: nbconvert (I'd like to use papermill)
  • converting text files to notebooks: jupytext
  • converting notebooks to HTML: nbconvert (the html template is extremely simple, and the preprocessors are almost totally "out of the box".
  • converting multiple HTML pages into a book: Jekyll (I really want this to be something else, I was planning on Hugo but am now interested in Sphinx after conversations with y'all)

I suspect that the reason many people like Jupyter Book are because:

  1. They can use jupyter notebooks as inputs (and with pre-populated outputs so they don't need to run them each time)
  2. The CLI is pretty simple to use, and the "out of the box" result works well enough (I think this is quite important, us power-users want to configure everything, but most users just want to run a single command or two and have it look great)
  3. The final output looks nice and has some nifty features (I basically just tried to copy as much from the tufte document philosophy as I could)
  4. It has the word "jupyter" in the project name and it's under the jupyter github org (I know this sounds unimportant, but it's surprisingly impactful at getting adoption)

IMO we could totally rip out the underlying build system for Jupyter Book and still satisfy all three things (full disclosure: I'd love for this grant to do this). The main things that I want to keep are:

  1. The command-line interface should "just work" with only a folder of notebooks (optionally, a configuration file, optionally, a table of contents file)
  2. The general out-of-the-box look and feel shouldn't be too different from where it is now
  3. As much of the work as possible should be done with other, modular tools (ideally, tools that already existed before Jupyter Book existed), and any changes to those tools are made by upstreaming improvements
  4. The tool should be easy to understand and maintain by newcomers to the project without much supervision from us.

OK well this has become a wall of text anyway :-P I'll stop there, but would love to hear what other folks think about any/all of this.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 15, 2019

Thanks @choldgraf

Just to pick on a few points:

converting notebooks to HTML: nbconvert (the html template is extremely simple, and the preprocessors are almost totally "out of the box".

Yes for now... But trust me, they won't be so simple when you start wanting to add in all the Bookdown-esque features. For example, I would point you towards ipy-sphinx.yaml.j2. Also, if you want to convert to other document types like LaTeX, you have to create a separate template for each type.

What I like about pandoc, is that it treats the document as "a document", with a set of document elements (as outlined in the panflute API). Knitr then essentially adds a single additional element to this API in the form of chunks, which are a specialized version of CodeBlocks.
In this system a Jupyter Notebook code cell is just one of an arbitrary number of chunk types, and a raw cell is just a RawBlock.
You then achieve modularity by walking through the document, and when you find a chunk, you check its type, then parse its metadata and contents to a 'converter' assigned to that type, which converts the chunk to regular pandoc elements.
With this API its easy to keep the document output agnostic for as long as possible, before having to deal with any output specific aspects that pandoc doesn't already take care of.

nbconvert effectively inverts this system to say that there are only three top-level elements; markdown, code and raw cells. What this means in practice is that you end having to treat each markdown cell as a separate document (that you apply pandoc/knitr to), and the code and raw cells as special elements, and finally you have to combine these three into a single document. So basically what you have done, is just create a more complex system on top of the system that you have to use anyway lol.

Just so I understand - right now ipypublish converts everything to rST via nbconvert, and then to HTML via Sphinx, right? Would this proposal be to go directly from notebooks -> sphinx? And if so, I wonder if we can upstream this as an improvement to another project like nbsphinx.

That's correct.

The genesis of sphinx.ext.notebook, was a direct copy of nbsphinx, but then hooking into the main IPyPublish conversion class and adding all the extra functionality for cell output referencing and captions, etc.

I had to do this because nbsphinx 'hardwires' in a lot of the nbconvert aspects, like an RST template, and the use of the ExecutionProcessor. I'd note also that nbsphinx uses a single RST file to include HTML and LaTex output, using the sphinx .. only:: directive. However, at least in Sphinx v1.8, this wasn't compatible with figure referencing.

You might be able to go directly from panflute.Doc to a docutils.document (the API that sphinx uses), but probably its easier to go via an RST.

@choldgraf
Copy link
Member

Just a quick response for now:

Yes for now... But trust me, they won't be so simple when you start wanting to add in all the Bookdown-esque features.

I totally agree - I think Jupyter Book has a ceiling on its book-like functionality until we adopt a different kind of build system (or do some major overhauls to nbconvert etc). I'm definitely open to using pandoc, in fact it was one of the very first issues in jupyter book

@jlperla
Copy link

jlperla commented Nov 15, 2019

I created a list of some examples of more "book" style markdown I have found invaluable and posted it in #11 Feel free to ignore, or ask me for any clarifications if you wish. I tried to map everything to Rmd/bookdown, and most of the sphinx/jupinx stuff maps well.

A few comments on stuff above (without providing a specific critique, as I promised above)

I love the idea of writing a plugin for VS Code. Note that RStudio does offer metadata completion in R Markdown:

@mwouts The gui you show in RStudio is for syntax completion rather than metadata completion.  I think the difference is crucial.  R Markdown/Bookdown is a syntax for a specialized documentation and publishing language which also includes higher-level publishing features (some through variations on chunk options), not a generic set of white-space sensitive metadata definitions for a cell based notebook format.

Thanks for the link to jupinx, that's very interesting for me to discover it. Can it turn arbitrary rST files into meaningful notebooks? What do we loose when we do rST -> (jupinx) -> ipynb -> (jupytext) -> md?

As a user of jupinx, let me give one opinion:  Jupinx is defining a book/documentation layout format in the spirit of sphinx, not a notebook format.  Turning arbitrary RST into meaningful notebooks wouldn't make sense with that use-case (i.e. it is based on sphinx and is not notebook centric).  There are plenty of features for designing an online textbook that don't necessarily make any sense in a notebook format.  I also feel that the fact that it uses jupyter in the middle of the current build process is an irrelevant implementation detail.

To me, what I like about jupinx is not the RST part, but rather than it is a tool designed for books with multiple outputs (where jupyter is just one of them).  If you replaced the jupinx RST markup feature-by-feature with a markdown version, I would be even happier. After looking at bookdown, I think that it provides a close-enough approximation.

Contributing to Knitr or Bookdown may not be very easy unless you have much experience with R
That's the crux of the matter. I certainly don't want to have to start programming in R! 

I think there is somewhat of a strawman here in thinking about Rmd/bookdown as a format vs. the R/knitr/rstudio toolchain.  Certainly no one here actually wants to use R for a the build implementation, contribute code to an R project, or deal with coordinating with a commercial company.

But if you look at Rmd/Bookdown, there is very little R specific about it.  To me, the real question is whether a "as compatible as possible" format with the same syntax and semantics as Rmd/bookdown should be implemented as language and editor independent format (i.e. I mostly use Julia myself).

The main advantage of the existing RStudio/bookdown is that it provides a coherent design and a reference parser and set of unit tests for validating a python based parser/backend replacement.  The other advantage is that there are complementarities with the RStudio/bookdown crew on contributions to pandoc which I think it uses in the toolchain (as it is for that them).  A clone with the same open-source transformations in the toolchain might significantly speed up implementation and testing.

Make sure to look at Weave.jl (and PWeave, though I do very none of my "real" work in python so can't comment on its features).  I use the .jmd as my primary format for generating jupyter notebooks (i.e. note: it does not have the features of a book markdown specification) but it essentially has just ported the Rmd code-chunk specification and is a subset of Rmd, as far as I can tell.  No actual R in sight.

Also, note that there is a https://atom.io/packages/language-weave  package which I use as the main GUI for editing.  This isn't a complete port of Rmd (since it doesn't include the things necessary for typesetting a book) but it shows the point.  In fact, I am pretty sure that if a more complete Rmd port was made, everyone in the julia community would use it, replacing Weave completely, and would contribute to tooling in Atom/vscode editors.

Again, this is why I think it would be a big win to write an LSP/VS Code extension for RMarkdown, so that you weren't restricted to only using it in RStudio.

@chrisjsewell 100% on this.  A language (and RStudio) agnostic implementation of the bookdown/Rmd format with a nice vscode extension (and Hydrogen/Weave+Juno style inline code-execution for editing) would be my dreamworld.  There may be a few features to sneak in that I love in jupinx, but otherwise I think that Rmd/bookdown has figured everything out in a very clean way - and (surprisingly?) kept things independent of R in the specification.

But just to make sure we are talking about the same thing.  When I talk about a LSP/VS Code extension for RMarkdown, I am thinking about virtually all of the Rmd/bookdown features - not just the subset that map cleanly to a cell-based notebook format.

@mwouts
Copy link

mwouts commented Nov 16, 2019

Thanks @jlperla for #11, that is very instructive. Especially I liked the part on testing! Possibly in Rmd you could use the error option to see the error as the output of the test command rather than in the book itself.

Also that made me think that we may want to compile the book in different contexts, and execute only a subset of cells depending on the context. Something that one can do in R Markdown with eval=condition where condition is some R code. It also raises the question to compile a book in a given environment (i.e. global variables or parameters).

Now I have another question... we have described what the author writes, but we've not mentioned how the code should be executed. Do you have plans on this? If I understand correctly, you don't want to have to turn the document to a Jupyter notebook, yet maybe we could still use the Jupyter kernels for this?

In my experience, two years ago when RStudio had not yet released reticulate, R Markdown was not able to preserve the variables between cells in other languages than R, and because of that it was not really usable outside of R. Now it works well for both R and Python (text and plots are OK, but interactive outputs like Javascript plots are supported only for R). Do you know what is the status of Julia?

@jlperla
Copy link

jlperla commented Nov 16, 2019

Possibly in Rmd you could use the error option to see the error as the output of the test command rather than in the book itself.

Yeah, I think so. Something along those lines would be perfect. With the quantecon https://julia.quantecon.org/status.html I think we do something very similar with post-processing the output with a test = true flag set. But the test=true style flag isn't really necessary if code chunks can be completely hidden from the display in the absence of errors. It is better to always run the tests.

The jldoctest style stuff is really the state-of-the-art, though. It is what all julia documentation uses (e.g. https://github.com/JuliaLang/julia/edit/master/doc/src/manual/mathematical-operations.md). The key is the utility to automatically replace the stored output blocks from any code to make updating regression tests easier, because otherwise getting the output embedded is too fragile.

To me, the key benefit of having better testing integration is that it opens up collaboration and CI integration. Otherwise, you have no idea if someone contributing a tweak to the code has made it generate the wrong results.

we have described what the author writes, but we've not mentioned how the code should be executed.

This I will stay out of. I am only trying to give the persepective of a "executable book" author and relate to my experiences on 2 books and help manage multiple RAs on the production side.

@mmcky could provide details on how jupinx executes code, but my (personal) feeling is that it is an implementation detail that could be swapped out in a redesigned system.

In my experience, two years ago when RStudio had not yet released... Do you know what is the status of Julia?

I don't, but I also really wouldn't want to use RStudio for this. I was simply reading the Rmd/bookdown specification and relating it to my experiences with Jupinx.

But to me, I like the fact that Rmd has been battle tested to produce production quality printed/online/executable books. The importance of that sort of practical experience in building a specification cannot be understated. The sphinx/jupinx features are similarly battle-tested, but are much weaker on the production quality PDFs than Rmd.

For Julia notebooks rather than textbooks, I use and love the Weave.jl and Juno tooling (i.e. an Atom extension for development which is excellent). I didn't even realize at the time that the weave's jmd format was effectively a subset of Rmd. It is an indicator to me that people are coordinating on the Rmd chunk specification, even for things that have nothing to do with R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants