-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Initital proposal for JMarkdown design specification #9
Conversation
Hey @mwouts thanks for the mwouts/jupytext#383 fix :) Thought you might like to give you're take on the format proposed above. If you can't be bothered to read it all, just have a look at the basic format in test_nb2imd.md. |
This looks really promising - thanks @chrisjsewell for taking the time to think through it and share your ideas. I'll also take a deeper pass through this. Is there any particular feedback that you're looking for right now? Particular areas that you're unsure etc? also, ping @emdupre who I think would be quite interested in this conversation as well. |
Thanks @choldgraf, I wouldn't say there's anything I'm unsure of. As you can see, I have quite strong ideas about this lol! The main thing is to get back critique, and answer the question: Could you see this being adopted by the community and users? As we've discussed before, RMarkdown has a decent user base, which would take time to replicate. But for the technical/scientific document use case, and for compatibility with Jupyter notebooks in general, I really feel it has some key disadvantages. I hope I've demonstrated here how, with the few tweaks made for JMarkdown, it makes it much simpler to implement the more complex document elements that we are trying to achieve here, without diminishing the readability and user experience (and hopefully enhancing it) . |
@chrisjsewell is https://raw.githubusercontent.com/chrisjsewell/meta/imarkdown/tests/test_nb2imd/test_nb2imd.md representative of what a content author would write directly? |
Yes. Hopefully, with the aide of an LSP, akin to LaTeX-Workshop |
@chrisjsewell As a jupyter user, I would be strongly supportive of this format and it would make me more likely to use jupyter for "real" work (e.g. the builtin git diffs in something like github, along with discussions on PRs become possible whereas they are not right now). Right now I have all of my interactive code in In fact, I would love to have an option to natively store in that format instead of the current JSON one! In that case, the tooling isn't a problem since the jupyter notebook/lab interface gives the tools you need. In the design, it seems that everything is very cell-centric (which makes sense for jupyter but not for HTML or PDF) rather than being a general markup format where jupyter is just one output type. It also seems like it can faithfully reflect the metadata format of jupyter ipynb itself, which is a major advantage for software engineering reproducibility. But... as a content author for executable books, it is significantly more verbose than both the current RST based approach and bookdown/RMarkdown. This isn't because there are other approaches to reprsenting jupyter in text (what you propose makes complete sense) but rather that the user-cases for authoring books with multiple output formats are different from that of authoring jupyter notebooks. In particular, my gut tells me that the emphasis on the As an author, the other advantage of a minimal metadata based specification is that it makes it easier to bring on coauthors. RMarkdown, Weave, etc. files have been proven to be within the grasp of people with minimal programming experience, whereas a verbose format makes it harder to attract contributors. |
Thanks for the comments @jlperla. Here's my defence lol:
As I've mentioned, the document does not have to contain any metadata blocks, if you don't require any particular formatting of the cells or code outputs. My PoC parser would happily work with
Here the author doesn't really need to have any concept of 'cells'. Utilising metadata in the YAML format, I think is quite user friendly (that's exactly why it was designed, as an alternative to JSON), and because it is a standard format, it is very easy to implement editor support for things like syntax highlighting, autocompletion and validation: I think this makes it considerably easier for the user to grasp/work with, without the need to be constantly consulting the documentation for what options are available. |
Also:
Ermm, well everything in HTML is essentially cell centric, there just referred to as |
But for a book written with this format, it would. Take one of the existing quantecon RST files and represent all of the variations on settings/etc. into your format. It is going to be full of metadata cells and impossible to write cleanly by hand. Half of the document might end up yaml blocks. Note that Rmd/bookdown only has metadata where it is really needed because the markup language is designed to have direct features for authoring content and doesn't try to be general. Similarly, RST has custom directives which jupinx exploits. One issue is that there are multiple outputs, but Jupyter has the least "information" in it, and which is not designed as a higher-level DSL. If you try to use a general purpose format like jupyter for the content, the only way to store all of that stuff is a pile of metadata with every setting and permutation of the output, because the jupyter format isn't designed as a general document markup for direct authoring of interconnected content (unlike sphinx or rmarkdown/bookdown).
I don't think you need to worry about defending it at all! What you have written is the best possible representation of a jupyter notebook in a text format. I don't think there is any suggestion I could make that would be an improvement, and I would love to use it. My statements here is that I would rather use the existing RST or a RMarkdown-clone as a markup language for writing online book content because I don't think jupyter has the specialized semantics required for content authoring of this sort. Combining a few examples from Rmd, you can see
and all of the other stuff in https://bookdown.org/yihui/rmarkdown/bookdown-markdown.html#special-headers including theorems and proof environment. I think that is pretty tough to beat for user friendliness. The jupinx features are also excellent as a higher-level language, even if RST itself is verbose. Again, to reiterate: I think that your format is something that I would love to use for storing/working with jupyter notebooks directly that are not an online book. For that, I would prefer higher-level semantics in the markup language itself. |
Firstly I'd stress that I personally wouldn't want to write in any of these formats as they stand, because none has a descent LSP extension. If someone wants to write in a 'plain' text editor, fine, but they're living in the dark ages lol. To that end, it wouldn't really bother me if "half the document was metadata" because, you just have an option to 'collapse' them all.
It might be helpful if you could attach one of these files here, for me to have a look at. .. my-directive::
:option1: abc
:option2: efg
Some content You are writing something like:
Note that both RMarkdown and JMarkdown don't inherently support nested blocks, like RST does. In some respects I would be happy to just use RST, and IPyPublish already has relatively decent Notebook/RMarkdown -> RST -> HTML support (see Writing Markdown and sphinx.ext.notebook). However, as we've said, it's a bit verbose, not necessarily the most user-friendly, and you would need to implement an RST <-> Notebook round trip converter.
I'm not sure I understand you here. Notebooks are just JSON blobs, so surely they can store arbitrary amounts of information.
Again, I'm not sure what additional metadata you are referring to, that does not already need to be strored in an RST/RMarkdown file?
That's exactly what I was already proposing anyway, and is already implemented in IPyPublish?
As I mentioned before, my only 'gripe' with this format is that (a) it's not great if you want to set a lot of options, you are going to get a very long line, (b) it's a fairly bespoke standard, that is an added burden to support by parsers, e.g. if I feed this through pandoc, everything within the Also, I've just noted, that it doesn't appear that Bookdown has much support for code generated tables (e.g. |
Here's a typical rst file from the QuantEcon lectures, as requested: https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/rst/finite_markov.rst It's compiled to ipynb using Jupinx but we have no round trip support. I would love to have that because I want to jump backwards and forwards between text source and ipynb. I also think that we need to put a lot of emphasis on people who start out editing purely in Jupyter. That will probably be the most common use case. As their book expands, they will see the need to use the text representation more. So, again, the mapping from text source to ipynb and back is critical. I think the YAML approach will work with sufficent editor support. |
I agree that the metadata implementation we run with should be as concise as possible while retaining flexibility (lot of trade-offs to explore here). When we compare and contrast -- It would be helpful to explore ways to minimise metadata in the yaml blocks by using contextual information of single documents (within larger book style projects) in addition to ways for have default settings that can be overridden only when needed. For example, it would be nice to have general figure size settings for generated images from code, and then override if this needs to change in the metadata).
|
Of course I am talking about using things like vscode, but I think you overestimate the role of tooling. When I type in snippets in vscode's latex extension (which is what I use day to day), it is filling in the high-level latex language... not dropping all sorts of metadata and boilerplate into the code. The tool that is ideal for editing the format you proposed is Jupyter, not a vscode extension.
If tooling was enough to overcome verbose boilerplate, then Java might have remained the world's most popular language. Instead, its verbosity is killing it in the longrun, no matter how good the tooling has become. People love (and intro users can handle) Rmd because of its simplicity and lack of noise.
Sorry, I meant semantic information. Jupyter as the output format (as used by end-users directly, not thinking about it as an editor for content which is separate) has the least amount of formatting and special cases compared to pdf and html.
Rmd/bookdown is a specification written to make a lot of the concepts as a top-level DSL for content creation rather than as an generic file format. If you do not have top-level commands and constructions along the same lines as Rmd then you have to represent all of that in json metadata.
I don't know. Rmd has been a massive success, in part because it is a focused language which anyone feels they can read/write as text. I would go so far as to say that it is one of the key reasons for R's success. Buy lets say you are right that people would love to use jupyter as an editor. If you want jupyter as a key way for people to write content, that doesn't mean you need to roundtrip. Jupyer files could have a one-directional transformation to a specialized markdown that is done within the build process (i.e. it is one-way, not one-time). It is only when you need to use advanced features that you switch to generate those files and edit them directly. Everyone wins. Those who want to use jupyter can do so, and those who want a Rmd style markdown with high-level editing semantics win as well. But as soon as you round-trip, you end up having to use something compatible with JSON metadata in Jupyter and/or have fragile generated HTML or other markup inside of the jupyter blocks. To give an analogy: have you ever used scientific workplace or lynx? In theory, they both generate pure latex that is editable in vscode without the GUI (i.e. a roundtrip) but in reality, they create so much metadata and fragile comments that you can't really read the output on its own, and if you edit it direclty there is a good chance you will break the roundtrip compatibility. But I will stay out of it now that I have made my plea as a user not to have to edit a file filled with mounds of yaml blocks. I would rather use the current RST format, which at least allows us to have a DSL made from high-level specialized directives. |
No please keep opinionating, thanks 😀 It's great to hear another side of the argument. |
I will wait until you guys have some specifications you want feedback on. Otherwise too much opionating could easily get in the way of progress! And, just in case it wasn't sufficiently clear, I really would love to your your jmarkdown specification.... essentially as is... available as an alternative to the |
hey folks - I will read through this and digest a bit. Just want to say thanks very much to @chrisjsewell for fleshing out his ideas, and thanks to @jlperla for providing thoughtful perspective and feedback, I'm excited that we'll make some big progress on this! |
Hello @chrisjsewell! This is a super interesting conversation and I've learnt a lot by reading the comments above. Now I'll try to contribute back a few (opinionated 😄 ) thoughts...
|
Thanks for the feedback @mwouts, I've certainly learnt a lot as well! At this point, I find myself playing devils advocate, and asking: If packages like jupyter-book, jupyinx and IPyPublish are, in some respect, trying to replicate functionality that is already in Bookdown/Knitr, why not just use Bookdown/Knitr? @choldgraf sorry this is probably a complete misrepresentation of jupyter-book, but given the aim:
and current pipeline:
(Note the base Would it not be better to do:
Is it worth the 'extra functionality', to try to replicate Bookdown/Knitr in nbconvert/Pandoc, or would it not be better to add this functionality by making PRs to Bookdown/Knitr, or creating a separate fork of Bookdown that is more specialized for the Jupyter stack? |
Well I rather see RMarkdown as a great challenger. You'll want to compare your stack to theirs. But there's definitely room for a Python stack. People in using Python/Jupyter will find Python tools easier to install. Not even mentioning maintenance... contributing to Knitr or Bookdown may not be very easy unless you have much experience with R! For the same reason, unless you want to open your notebooks in RStudio, I think you probably want to use Jupytext to convert your notebooks to Jupytext Markdown (.md) rather than R Markdown (.Rmd). |
That's the crux of the matter. I certainly don't want to have to start programming in R! But are 'we' (the python/jupyter community) willing to create/maintain a proper python alternative to Knitr/Bookdown? For reasons I've already mentioned, at least in its current state, I don't think nconvert is the tool to achieve this.
Jupytext Markdown is certainly helpful, but it doesn't inherently support any of the syntax required to write technical/scientific documentation à la Bookdown. Again, this is why I think it would be a big win to write an LSP/VS Code extension for RMarkdown, so that you weren't restricted to only using it in RStudio. |
Hey all - I am going to pick a few ideas to respond to here, to avoid this becoming a wall of text in an already very-long thread :-) I'll refer to Chris' idea for a "Jupyter Markdown" as "Jmd" for the rest of the post. from @chrisjsewell / @jlperla
My intuition is that the language will be most-useful and most widely-adopted if it is pleasant and easy to look at without any extra tooling. It should work nicely with syntax highlighters, code folders, etc...but it can't only look nice if you have those tools or I suspect it'll be a non-starter for many people. (that said, I think yaml header blocks could look quite nice). I agree with @mmcky that it would be helpful to have a representative "page of content" that we can write in each of the formats we're considering, so that we can make comparisons about the simplicity and structure of each.
I think it'd be great if there were a separate Python package that goes from
Just so I understand - right now ipypublish converts everything to rST via nbconvert, and then to HTML via Sphinx, right? Would this proposal be to go directly from notebooks -> sphinx? And if so, I wonder if we can upstream this as an improvement to another project like Also, another note is that I'd love for this project to result in upstream changes and improvements to tools like nbconvert. I agree it is clunky and difficult to work with, but this is in many ways because it's had little resources over the years. There's renewed interest in improving it and I'd love to be a part of that effort as well (for me, this project is about making upstream contributions as well as building a publishing tool) from @mwouts
I think you and @chrisjsewell agree on this goal, something I like about Jmd is that it behaves nicely with pre-existing syntax highlighting in most editors, similar to jupytext md.
In my mind, there are two goals to this piece of the project:
I think no. 1 is something we can build in to Jupyter's interfaces via an extension, maybe a LSP, or conversion tools (be it nbconvert, sphinx, pandoc, etc). No. 2 is something that we can solve independent of no. 1, and should be done with the stakeholders who are already working in the space of "Jupyter notebooks as text files" - @mwouts I'd consider you one of the leading voices in this space!
Perhaps as a start, if a "Jmd" flavor exists that breaks from jupytext markdown (e.g. YAML metadata blocks), then we can upstream these to Jupytext and work to try and unify the two in the medium-term. Don't forget that this project can be in a "research and prototyping" phase for a while before we start "officially" shipping things. I'm a fan of exploring what's possible and walking down paths a bit before we make strong decisions.
To me these are all great things that we want to support on the Python / Jupyter side. It would also be great to be able to do things like interpolate variable values into markdown cells (e.g. I could write something like "the mean in group 1 was significantly greater than zero (p== And one more from @chrisjsewell since it's a longer answer:
This could be a reasonable approach to take. The reason that I didn't do this at the beginning of Jupyter Book, and why I'd probably still not do this now, is:
As an aside, the thing I love about Jupyter Book is that it's still a very simple project. Most of the heavy lifting is in the CSS / Javascript, and other than this, it's a fairly lightweight wrapper that orchestrates several other tools for doing things:
I suspect that the reason many people like Jupyter Book are because:
IMO we could totally rip out the underlying build system for Jupyter Book and still satisfy all three things (full disclosure: I'd love for this grant to do this). The main things that I want to keep are:
OK well this has become a wall of text anyway :-P I'll stop there, but would love to hear what other folks think about any/all of this. |
Thanks @choldgraf Just to pick on a few points:
Yes for now... But trust me, they won't be so simple when you start wanting to add in all the Bookdown-esque features. For example, I would point you towards ipy-sphinx.yaml.j2. Also, if you want to convert to other document types like LaTeX, you have to create a separate template for each type. What I like about pandoc, is that it treats the document as "a document", with a set of document elements (as outlined in the panflute API). Knitr then essentially adds a single additional element to this API in the form of chunks, which are a specialized version of nbconvert effectively inverts this system to say that there are only three top-level elements; markdown, code and raw cells. What this means in practice is that you end having to treat each markdown cell as a separate document (that you apply pandoc/knitr to), and the code and raw cells as special elements, and finally you have to combine these three into a single document. So basically what you have done, is just create a more complex system on top of the system that you have to use anyway lol.
That's correct. The genesis of sphinx.ext.notebook, was a direct copy of I had to do this because nbsphinx 'hardwires' in a lot of the nbconvert aspects, like an RST template, and the use of the You might be able to go directly from panflute.Doc to a docutils.document (the API that sphinx uses), but probably its easier to go via an RST. |
Just a quick response for now:
I totally agree - I think Jupyter Book has a ceiling on its book-like functionality until we adopt a different kind of build system (or do some major overhauls to nbconvert etc). I'm definitely open to using pandoc, in fact it was one of the very first issues in jupyter book |
I created a list of some examples of more "book" style markdown I have found invaluable and posted it in #11 Feel free to ignore, or ask me for any clarifications if you wish. I tried to map everything to Rmd/bookdown, and most of the sphinx/jupinx stuff maps well. A few comments on stuff above (without providing a specific critique, as I promised above)
@mwouts The gui you show in RStudio is for syntax completion rather than metadata completion. I think the difference is crucial. R Markdown/Bookdown is a syntax for a specialized documentation and publishing language which also includes higher-level publishing features (some through variations on chunk options), not a generic set of white-space sensitive metadata definitions for a cell based notebook format.
As a user of jupinx, let me give one opinion: Jupinx is defining a book/documentation layout format in the spirit of sphinx, not a notebook format. Turning arbitrary RST into meaningful notebooks wouldn't make sense with that use-case (i.e. it is based on sphinx and is not notebook centric). There are plenty of features for designing an online textbook that don't necessarily make any sense in a notebook format. I also feel that the fact that it uses jupyter in the middle of the current build process is an irrelevant implementation detail. To me, what I like about jupinx is not the RST part, but rather than it is a tool designed for books with multiple outputs (where jupyter is just one of them). If you replaced the jupinx RST markup feature-by-feature with a markdown version, I would be even happier. After looking at bookdown, I think that it provides a close-enough approximation.
I think there is somewhat of a strawman here in thinking about Rmd/bookdown as a format vs. the R/knitr/rstudio toolchain. Certainly no one here actually wants to use R for a the build implementation, contribute code to an R project, or deal with coordinating with a commercial company. But if you look at Rmd/Bookdown, there is very little R specific about it. To me, the real question is whether a "as compatible as possible" format with the same syntax and semantics as Rmd/bookdown should be implemented as language and editor independent format (i.e. I mostly use Julia myself). The main advantage of the existing RStudio/bookdown is that it provides a coherent design and a reference parser and set of unit tests for validating a python based parser/backend replacement. The other advantage is that there are complementarities with the RStudio/bookdown crew on contributions to pandoc which I think it uses in the toolchain (as it is for that them). A clone with the same open-source transformations in the toolchain might significantly speed up implementation and testing. Make sure to look at Also, note that there is a https://atom.io/packages/language-weave package which I use as the main GUI for editing. This isn't a complete port of Rmd (since it doesn't include the things necessary for typesetting a book) but it shows the point. In fact, I am pretty sure that if a more complete Rmd port was made, everyone in the julia community would use it, replacing Weave completely, and would contribute to tooling in Atom/vscode editors.
@chrisjsewell 100% on this. A language (and RStudio) agnostic implementation of the bookdown/Rmd format with a nice vscode extension (and Hydrogen/Weave+Juno style inline code-execution for editing) would be my dreamworld. There may be a few features to sneak in that I love in jupinx, but otherwise I think that Rmd/bookdown has figured everything out in a very clean way - and (surprisingly?) kept things independent of R in the specification. But just to make sure we are talking about the same thing. When I talk about a LSP/VS Code extension for RMarkdown, I am thinking about virtually all of the Rmd/bookdown features - not just the subset that map cleanly to a cell-based notebook format. |
Thanks @jlperla for #11, that is very instructive. Especially I liked the part on testing! Possibly in Rmd you could use the Also that made me think that we may want to compile the book in different contexts, and execute only a subset of cells depending on the context. Something that one can do in R Markdown with Now I have another question... we have described what the author writes, but we've not mentioned how the code should be executed. Do you have plans on this? If I understand correctly, you don't want to have to turn the document to a Jupyter notebook, yet maybe we could still use the Jupyter kernels for this? In my experience, two years ago when RStudio had not yet released |
Yeah, I think so. Something along those lines would be perfect. With the quantecon https://julia.quantecon.org/status.html I think we do something very similar with post-processing the output with a The To me, the key benefit of having better testing integration is that it opens up collaboration and CI integration. Otherwise, you have no idea if someone contributing a tweak to the code has made it generate the wrong results.
This I will stay out of. I am only trying to give the persepective of a "executable book" author and relate to my experiences on 2 books and help manage multiple RAs on the production side. @mmcky could provide details on how jupinx executes code, but my (personal) feeling is that it is an implementation detail that could be swapped out in a redesigned system.
I don't, but I also really wouldn't want to use RStudio for this. I was simply reading the Rmd/bookdown specification and relating it to my experiences with Jupinx. But to me, I like the fact that Rmd has been battle tested to produce production quality printed/online/executable books. The importance of that sort of practical experience in building a specification cannot be understated. The sphinx/jupinx features are similarly battle-tested, but are much weaker on the production quality PDFs than Rmd. For Julia notebooks rather than textbooks, I use and love the |
Following discussion with @choldgraf @jstac and @mmcky, and based on my experience developing IPyPublish, this PR presents an initial design proposal for a JMarkdown (or IMarkdown) design specification.
Aim
Have a pure text representation of a Jupyter Notebooks, that can be used to create technical/scientific documents. In particular, for conversion to PDF (via TeX) and HTML.
Key Design Goals
For me the key design goals are that it should:
Implementation Considerations
To support the syntax features, the three main options are: LaTeX, RST and Markdown.
LaTeX supports all of these syntaxes, however, the drawbacks are that (a) it is not intrinsically supported by Jupyter Notebooks, (b) It is not as easy to convert to HTML etc, (c) It's not as user-friendly as Markdown. (a) and (c) also apply to RST, so that leaves us with Markdown.
RMarkdown would then be the next natural consideration, however, this also has some issues:
```{`python a=1, b="c"}
), isn't 'official' pandoc syntax (so isn't automatically handled by it), and isn't ideal for usability/readability or parsing.This is where JMarkdown comes in. It is similar to RMarkdown, but:
metadata
code blocksAlso, where possible, the syntax will use the basic Pandoc Markdown elements, and keep boilerplate elements to a minimum.
Proof of Concept Implementation
In this PR I have created a small python package, as a proof-of-concept for:
Conversion from a Notebook
The initial notebook is imd_test.ipynb. With imd_poc.nb2imd.parse_nb2imd, I then parse this notebook to test_nb2imd.md (and the preview format), which demonstrates the basic format of JMarkdown. Note:
metadata
raw-cell
code-cell
and the language, obtained from the notebook metadata. Unfortunately, GitHub only syntax highlights the shortcut form;```python
, which I originally used, but this could be an issue if someone wanted to write python code in a markdown cell, so```{.python .code-cell}
is more robust.Span
is inserted with thenew-cell
class:[]{.new-cell}
. This ensures a parser can identify the start of all new cells (in the current jupytext RMarkdown converter, multiple markdown cells are merged, during the round-trip conversion).TODO handling cell attachments
Parsing to Pandoc AST
In imd_poc.imd2pandoc,
test_nb2imd.md
is parsed directly into pandoc AST (using panflute), then some intital processing is done, to convert the document to a more 'computationally friendly' format (inserting missing metadata, wrapping metadata & cell content in divs, and numbering cells). Converting back to Markdown, we obtain test_imd2pandoc.md.Code cell execution and output to HTML
In imd_poc.pandoc_exec, I iterate through the pandoc AST, and execute all the code cells, saving the outputs as YAML. Converting back to Markdown we get: test_pandoc_exec.md.
In imd_poc.pandoc2html, I build on this to process references and create a basic HTML document: test_pandoc2html.html
Note how we've now utilized the metadata to colour a note block, place a numbered caption under the generated pandas table and added an anchor link to it, and turned the reference into a hyperlink with the same name as the table caption.
You'll also note that I haven't used the nbconvert conversion mechanism. My approach has a number of benefits:
Finally, to create HTML, it is better to use Sphinx (as I already do in IPyPublish).
This would either be via the creation of an intermediate RST representation, or you might even be able to convert the pandoc AST directly to the docutils AST (which is what Sphinx uses at a lower level).
I think that's about it for now!