Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the CLI's documentation of the remap subcommand #89

Closed
felixhekhorn opened this issue Nov 17, 2021 · 31 comments
Closed

Improve the CLI's documentation of the remap subcommand #89

felixhekhorn opened this issue Nov 17, 2021 · 31 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@felixhekhorn
Copy link
Contributor

To continue the APPL comparison I will need to renormalize the grids - I saw that the remap command seems to be able to do such a thing, but I don't know how to use it; i.e. what is the "remap string" - can you provide us with an example?

@cschwan
Copy link
Contributor

cschwan commented Nov 17, 2021

The primary purpose of remap is to map a one-dimensional distribution to an n-dimensional one (many Monte Carlo programs only produce one-dimensional distributions, and remap is our way to work around it). I'm not quite sure this is the proper use in this case, but let's try it anyway to see what you need.

You can find an example here:

https://github.com/NNPDF/runcards/blob/a48e5d4c7c6d2dae36f0a0228a11e4739f87f7be/nnpdf31_proc/ATLAS_DY_8TEV_2D_0116_0200/postrun.sh#L3

What it does is to completely ignore whatever bin limits you previously set, and apply new ones, based on a cartesian product. In the case above, '116,150,200,...;0,0.2,0.4,...' produces the two-dimensional bin limits in the following order:

  • from 116 to 150 for the first observable, and from 0.0 to 0.2 for the second observable,
  • 116-150 and 0.2-0.4,
  • ...
  • 150-200 and 0.0-0.2,

The pipe | is used to modify the limits for the second observable for specific values of the first observable. If you count the number of fields divided by the pipe you'll find five, one field for each of the five bins of the first observable. An empty field just means the field previously specified. In the example above that means specifically

  • 300-500 and 0.0-0.4,
  • 300-500 and 0.4-0.8,
  • ...
  • 500-1500 and 0.0-0.4,
  • 500-1500 and 0.4-0.8,
  • ...

By default remap divides the result of each bin by the bin-width (for each observable). If you don't want that, you can disable it for specific observables using --ignore-obs-norm index where index is the index of the observable starting from 0. An example for that you can find here,

https://github.com/NNPDF/runcards/blob/master/nnpdf31_proc/ATLAS_DY_8TEV_3D_046080_0007/postrun.sh

where only the bin width for the first (0) observable is taken into account.

Finally, you can use --norm factor, which applies factor multiplicatively for all bins (for example for the femtobarn/picobarn conversion).

@cschwan cschwan changed the title CLI Remap example Improve documentation of the remap subcommand of the CLI Nov 17, 2021
@cschwan cschwan changed the title Improve documentation of the remap subcommand of the CLI Improve the CLI's documentation of the remap subcommand Nov 17, 2021
@alecandido
Copy link
Member

Maybe @cschwan it might be an idea to extend CLI help with examples, at some point.

It's fine that help is the only docs for a CLI (unless you want to write a man page, clap-rs/clap#552), but I'd say examples saved my day so many times...

You could also consider to add a couple to https://tldr.sh/

@cschwan cschwan added the enhancement New feature or request label Nov 19, 2021
@cschwan cschwan added this to the v0.5.0 milestone Dec 31, 2021
@cschwan
Copy link
Contributor

cschwan commented Dec 31, 2021

Commit 25fc84a adds additional documentation and even a few basic examples. Let me know what you think!

@alecandido
Copy link
Member

Sorry not to have replied before. It looks good, just a small observation:

  • it seems like you amended the commit or something like that, so the hash is actually c3197dd (the one you provided isn't belonging to any branch at the moment)
  • (and of course I don't like that the thing needed to be duplicated, but we've already discussed this)

@cschwan
Copy link
Contributor

cschwan commented Jan 3, 2022

The commit probably doesn't belong to any branch because I've merged it; what is being duplicated, where is the discussion?

@alecandido
Copy link
Member

The duplication is simply in the test: since you edited the message, you had to duplicate the string in the test.
Since this requires manual sync at any update, it is fragile (#95 (comment)).

@cschwan
Copy link
Contributor

cschwan commented Jan 3, 2022

I think the main Issue here has been addressed, therefore I'll close it.

@cschwan cschwan closed this as completed Jan 3, 2022
@felixhekhorn
Copy link
Contributor Author

@cschwan
Copy link
Contributor

cschwan commented Jan 7, 2022

Thanks for reviewing it, @felixhekhorn!

The | operation is exactly the + operation you described 😄! But it's an operation that's needed because there's no combination of the other operations that lets you mimick |. Does that make sense, @felixhekhorn?

@felixhekhorn
Copy link
Contributor Author

fine! the operations is to be read as "OR" ... maybe you could give this specific example?

@cschwan
Copy link
Contributor

cschwan commented Jan 7, 2022

What I understand is that the documentation needs to improved for |, so I'm opening this Issue again.

I wouldn't call the | operation OR because the number of fields is fixed by the previous dimensions. For instance, if you have the following prescription:

x0,x1,x2,...,xn;y0,y1,...,ym;bls1|bls2|...|blsp

there must be p = n * m bin limit specifications (bls) separated by |, as in bls1|bls2|...|blsp. If you have more dimensions before a specification containing | there there must be as many bin limit specifications separated by | as the product of the number of bins for each dimension coming before |.

@cschwan cschwan reopened this Jan 7, 2022
@cschwan
Copy link
Contributor

cschwan commented Jan 7, 2022

You can also think of | as making the bin limits dependent on the previous dimensions. Without it the cartesian product makes sure that all the bin limits are independent of each other. A good example of that is ATLAS_DY_8TEV_3D, which has three dimensions,

  • invariant mass of the lepton pair,
  • cosine of the Catani-Seymour angle
  • and rapidity of the lepton pair,

but the bins of rapidity depend on the invariant mass and the angle (the previous dimensions). The further you go away from the Z peak at roughly 90.0 and the more extreme the angle is, the more rapidity bins are 'cut out' of the cartesian product 'cube'.

@alecandido
Copy link
Member

alecandido commented Jan 7, 2022

I'm still missing something: in the example there is written

'0,1,2;0,1,2|0,2,4' expects a grid with a total of 4 bins and produces the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4)

so I get that the 0,1,2 before | is referred to the 0,1 in the first section (before ;) and 0,2,4 is related to 1,2 in the first section.

Later on, there is the other example on syntactic sugar for repetition:

for six bins '0,1,2;0,1,2|0,1,2|0,2,4' can be more succinctly written as '0,1,2;0,1,2||0,2,4'

but in the case there are 2 bins in the first dimension (before ;), i.e. 0-1 and 1-2, while they're mapped over three different slices in the second dimension.

I'm confused, I'd have expected something like '0,1,2,3;0,1,2|0,1,2|0,2,4'...

@alecandido
Copy link
Member

alecandido commented Jan 7, 2022

About the | operator: maybe you don't care that much, but in Python they assigned a meaning to that operator in the specific context of sets. It means union.

(Very recently they started using that operator even in the context of types, like in Typescript, but the one for actual sets existed at least from 3.0, if not before...)

@cschwan
Copy link
Contributor

cschwan commented Jan 7, 2022

I'm still missing something: in the example there is written

'0,1,2;0,1,2|0,2,4' expects a grid with a total of 4 bins and produces the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4)

so I get that the 0,1,2 before | is referred to the 0,1 in the first section (before ;) and 0,2,4 is related to 1,2 in the first section.

Later on, there is the other example on syntactic sugar for repetition:

for six bins '0,1,2;0,1,2|0,1,2|0,2,4' can be more succinctly written as '0,1,2;0,1,2||0,2,4'

but in the case there are 2 bins in the first dimension (before ;), i.e. 0-1 and 1-2, while they're mapped over three different slices in the second dimension.

I'm confused, I'd have expected something like '0,1,2,3;0,1,2|0,1,2|0,2,4'...

You're right, the example is clearly wrong, good job for spotting the mistake!

@felixhekhorn
Copy link
Contributor Author

About the | operator: maybe you don't care that much, but in Python they assigned a meaning to that operator in the specific context of sets. It means union.

That was actually what I was thinking about: "OR" in a bitwise sense, really meaning UNION (as implemented in C).

You can also think of | as making the bin limits dependent on the previous dimensions.

however, as @cschwan said, this is not really what the operator is doing ... "THEN" would be a better name, if I understood correctly

there must be p = n * m bin limit specifications (bls) separated by |, as in bls1|bls2|...|blsp

@felixhekhorn
Copy link
Contributor Author

there must be p = n * m bin limit specifications (bls) separated by |, as in bls1|bls2|...|blsp

and this should really be mentioned in the docs

@alecandido
Copy link
Member

alecandido commented Jan 7, 2022

I guess there is no default name for that operator. @cschwan actually needs a separator further than ,, ;, and :, but unfortunately all common punctuation characters have some meaning as operators.

Most likely | is good enough: / might also be fine (and it also has a well defined, different meaning), and \ is bad, since it's used for escaping and would give plenty of issues.
All the others ($, %, @, #, !, &, ", ', =, ?, ^) are meaningless, and/or huge, and/or dangerous, and the other common separator - would be even more unexpected in this context.

For me, the three good ones are |, /, and _ (this last sometimes used as digit separator, without extra meaning). And maybe | is actually the best...

EDIT: of course I forgot +, the one you used in your first example, but it really looks weird as a separator, and , should have precedence over +. There is really no clean solution...

@felixhekhorn
Copy link
Contributor Author

The operator | is perfectly fine - I think, this is just a documentation issue; how to explain the operation in a proper way ... (actually + would have been also a good choice, but let's stick to what we have )

@alecandido
Copy link
Member

The operator | is perfectly fine - I think, this is just a documentation issue; how to explain the operation in a proper way ... (actually + would have been also a good choice, but let's stick to what we have )

Yeah, compare to other choices + is not that bad, you're right. But | is good enough, I pointed out the other meaning just to keep it mind while explaining the local one.
And as you said, the union is somehow the refinement of the OR meaning, even though it's a different one. But being "compatible" it has been a good choice, unfortunately here we don't have such an option...

@cschwan
Copy link
Contributor

cschwan commented Jan 7, 2022

I'd say + is confusing, because you'd assume it to be part of a number, for instance as in +1,2;-1,2 (technically not needed, but instinctively I think I would read it like that). It would also be confusing if you have negative bin limits, for instance -1,1+-2,2. With the pipe this is clearly better: -1,1|-2,2. I agree with @felixhekhorn that this really is a documentation problem.

@cschwan
Copy link
Contributor

cschwan commented Jan 7, 2022

Commit f90f3ca improves the description. Please let me know what you think.

@felixhekhorn
Copy link
Contributor Author

Definitely better. Maybe we can improve even more? How much white space or space in general can we use? how about going more into actual Python doc syntax?

- ',': The comma ',' constructs 1-dimensional bin limits (1DBL).
The bin limits are listed in a consecutive order and separated by a single ','.

Examples
--------------
- '0,0.2,0.4,0.6,1' expects the grid to have 4 bins whose bin limits will be (0-0.2), (0.2-0.4),
(0.4-0.6) and (0.6,1)


- ';': The semicolon ';' constructs n-dimensional bin limits (NDBL) from a cartesian product of 1DBL.
The 1DBL are listed by in a consecutive dimensional order separated by a single ';'.
 
Examples
-------------
-  '0,0.5,1;0,1,2' expects the grid to have 4 bins, whose 2DBL will be are: (0-0.5;0-1), (0-0.5;1-2), (0.5-1;0-1) and
(0.5-1;1-2)

- '|': The pipe '|' constructs NDBL which change by a step in the previous dimension.
The previous operators are enough to contruct NDBL with differently-sized bins, but they can
not construct the following bin limits: (0-1;0-1), (0-1;1-2), (1-2;0-2), (1-2;2-4), (1-2;4-6); here
the 1DBL for the second dimension depend on the first dimension and also have a different number of
bins. For the first two bins the 1DBL is '0,1,2', but for the last three bins the 1DBL are
'0,2,4,6'. This can be achieved using the following remapping string: '0,1,2;0,1,2|0,2,4,6'. Note
that there have to be two 1DBL separated by '|', because the first dimension has two bins. If there
are more dimensions and/or bins, the number of 1DBL separated by '|' must match this number
accordingly. [some adjustments needed here, maybe]

Examples
-------------
'0,1,2;-2,0,2;0,1,2|1,2,3|2,3,4|3,4,5|4,5,6|5,6,7'. Here the third dimension has 6 1DBL separated
by '|' because the first dimension has 2 bins and the second dimension has 3 bins, so `6 = 2 * 3`.
If the 1DBL is an empty string, the previous 1DBL is repeated, for
example '0,1,2;0,1,2;0,1,2||0,2,4' is shorthand for '0,1,2;0,1,2;0,1,2|0,1,2|0,2,4'

[etc.]

@alecandido
Copy link
Member

For me, it's perfect: the concept is a bit tough, but there is nothing we can do, except explaining as good as possible, and I'd say it's well explained.

Unfortunately now it's a whole bunch of text, so I'd like to have more formatting options (like markdown) to make it more readable.
clap doesn't offer anything like that, so the only other option might be to move it to an external source (most likely the best would be extremely simple as md, or readable as html), with the option to keep it embedded (e.g. like rust docs in rustup), or move somewhere else (e.g. online and link it).

@cschwan
Copy link
Contributor

cschwan commented Jan 7, 2022

Separating the examples from the description is a very good idea and formatting it better also. As far as I understand the text is simply dumped after the help, but I'm looking into it the clap documentation to see if there's anything better...

@alecandido
Copy link
Member

Separating the examples from the description is a very good idea and formatting it better also. As far as I understand the text is simply dumped after the help, but I'm looking into it the clap documentation to see if there's anything better...

I already had a look some weeks ago, but that's all they offer (and as you said it's simply a print after the generated help, or before with the partner function).
There are libraries that output all termcodes for formatting, like https://github.com/Canop/termimad (in python we use https://github.com/Textualize/rich), but I don't know if it's a good idea to pass a string with termcodes to clap.

@felixhekhorn
Copy link
Contributor Author

felixhekhorn commented Jan 7, 2022

or move somewhere else (e.g. online and link it)

that might actually be the best option ... since online we can have infinite format

PS: and it would solve the duplication issue 🙃

@alecandido
Copy link
Member

alecandido commented Jan 7, 2022

PS: and it would solve the duplication issue upside_down_face

Duplication issue still remains, because it was mainly testing the part generated by clap.
To be honest, I like a lot that with rustup you can bring up docs offline, but maybe the solution would be to simply use docstrings (and document pineappl_cli crate), and then we can both link https://docs.rs, and you can also obtain offline with cargo doc --open.
Still, having it in terminal is much more direct...

@cschwan
Copy link
Contributor

cschwan commented Jan 7, 2022

We probably want this: clap-rs/clap#2389. I can try the verbatim_doc_comment attribute in the meantime. Moving the documentation to somewhere else online might also be a good idea.

@alecandido
Copy link
Member

I believe we already got something pretty good, what do you think is still missing for this? @cschwan

We could do a much better job if clap exposed more tools, but given the current status I believe is fine.

The other option of documenting pineappl_cli crate (and publish documentation) is always available, though I consider that it is not needed specifically for this issue (but we could want it in general).
In case, let's open another issue, but I'm not advocating for it: for me it's fine as it is.

@cschwan
Copy link
Contributor

cschwan commented Jan 31, 2022

@alecandido I agree, let's close this Issue.

@cschwan cschwan closed this as completed Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants