Skip to content

Conversation

@agoodm
Copy link
Member

@agoodm agoodm commented Sep 19, 2016

What is this PR for?

One feature built-in to Jupyter's ipykernel that is currently missing in Zeppelin are pre/post-execute "callbacks". This PR allows for the possibility of entering code that is set to be executed by the interpreter before the code entered into the paragraph (pre-execute) and after (post-execute). In addition to saving some users the effort of writing code that they want executed in every paragraph, this will help pave the way for more advanced features coming soon, like automatically displaying matplotlib figures without an explicit call to a show() function. See the JIRA issue for more details on this.

The core implementation for this is primarily contained in a new class called InterpreterCallbackRegistry which heavily mimics the Angular Display system (eg, AngularObjectRegistry). While this interface alone should suffice for Interpreter maintainers on the Java/JVM side, I also thought it might be useful to expose the system directly to the users in their notebooks. Hence, an implementation that works with the spark and pyspark interpreters (via the ZeppelinContext class) is also included here.

What type of PR is it?

New Feature

Todos

  • - Add unit tests
  • - Add new documentation

(Maybe) - Find a better way to check for errors in pre-execute as this can brick the REPL (see more info below)

What is the Jira issue?

ZEPPELIN-1423

How should this be tested?

In a new note, add the following lines of code to a paragraph:

%pyspark
z.registerCallback("post_exec", "print 'This code should be executed before the parapgraph code!'")
z.registerCallback("pre_exec", "print 'This code should be executed after the paragraph code!'")

Then run any other paragraph in the note containing some other code, eg

%pyspark
print "This code should be entered into the paragraph by the user!"

The output should be:

This code should be executed before the paragraph code!
This code should be entered into the paragraph by the user!
This code should be executed after the paragraph code!

You should also test out the other two methods (getCallback() and unregisterCallback()) specified in ZeppelinContext.java.

One final caveat that should be mentioned: If there are errors in the code you specify for a pre-execute event, it will render the interpreter useless since the current implementation prepends the the code specified in pre_exec directly to the paragraph entered code before calling interpret(). The current workaround for this would be to either restart the interpreter group or call unregisterCallback() via a different REPL within the interpreter group (eg, z.unregisterCallback("pre_exec", "pyspark") from the spark interpreter). I would appreciate if anyone here would be willing to share any better approaches here.

Questions:

  • Does the licenses files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? Yes

*/
/**
* Autogenerated by Thrift Compiler (0.9.2)
* Autogenerated by Thrift Compiler (0.9.3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @agoodm :)
could you regenerate these thrift-generated files with 0.9.2as the other thrift files did?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AhyoungRyu Done.

@agoodm
Copy link
Member Author

agoodm commented Sep 20, 2016

Ok, TODO's are now done. @AhyoungRyu @Leemoonsoo @bzz Please review, I myself am unsure if some of the errors I am seeing in some of the Travis CI builds are related to these changes.

checkedIterables = self.z.checkbox(name, defaultCheckedIterables, optionIterables)
return gateway.jvm.scala.collection.JavaConversions.asJavaCollection(checkedIterables)

def registerCallback(self, event, cmd, replName="pyspark"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style? space in replName = "pyspark"
also, not sure if we should hardcode 'pyspark' here - this is configurable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung The space is omitted in the optional argument in order to comply with PEP8, see here.

The main purpose of defaulting this argument to "pyspark" is solely for convenience, ie so you could easily call registerCallback() without specifying the REPL name manually every time it is called.

@felixcheung
Copy link
Member

That's great - would you like to update the rest of the file to make the style consistent?

I understand the default value - I'm only pointing out that there are many users and use cases out there that do not use "pyspark" as name?

@agoodm
Copy link
Member Author

agoodm commented Sep 23, 2016

@felixcheung Sure thing, I can do that in a bit.

As for the issue of the default value for replName, are you simply referring to the use case for different interpreters, or some variants of pyspark itself (eg, %<something>.pyspark)? If it's the latter, then I do agree that there should be a more elegant way to pick the default value in case the spark interpreter group is not set as the default, so any good ideas on how to go about this are appreciated. If you are just talking about how this is implemented on the scala (%spark) side though, I implemented a similar convenience method in the original ZeppelinContext class, but instead assumes "spark" is the default replName.

@felixcheung
Copy link
Member

right, I was referring when the interpreter group is not the default, or there is multiple etc..

@agoodm
Copy link
Member Author

agoodm commented Sep 23, 2016

@felixcheung Ok, I have changed the default behavior to now automatically detect the REPL name from the paragraph input text. The Travis CI build isn't running after I rebased my commits to resolve a merge conflict.

@agoodm agoodm closed this Sep 24, 2016
@agoodm agoodm reopened this Sep 24, 2016
@doanduyhai
Copy link
Contributor

Question, how interpreters other than Spark can leverage this new feature ? I mean you added a new registerCallback() function for ZeppelinContext object but this context is not exposed for shell or angular interpreter for example. How can people use the feature in this case ?

@agoodm
Copy link
Member Author

agoodm commented Sep 24, 2016

@doanduyhai To be precise, I designed the callback registry system with two different types of use cases. The first one is for the different interpreter maintainers to implement in their individual subclasses of Interpreter, which would be used for use cases that are specific for better integrating certain features / libraries of that interpreter with zeppelin. Right now a prime example of this is matplotlib integration with python / pyspark as I have discussed in more detail in the JIRA issue.

The second set of use cases are those in which the API is directly exposed to notebook users. Right now this is indeed only fully available in spark / pyspark via the ZeppelinContext class. The reason for this is that getting this sort of interface to work with every interpreter would require some sort of language specific API that can be used to expose the JVM (as py4j does with python), hence one reason I labeled it as experimental in the docs. It's a similar limitation to the angularBind() and angularWatch() API. Keep in mind though that the registry can still be applied for other interpreters if the replName is manually specified, eg:

%spark
z.registerCallback("post_exec", "ls -l", "sh")

Does this help clarify things?

@doanduyhai
Copy link
Contributor

Yes it's very clear. Indeed a bunch of internal features (Angular registry, Callback registry ...) are exposed to end-users only via Spark interpreter. I guess that each interpreter implementor have to find a way to expose these features using their own DSL

@Leemoonsoo
Copy link
Member

Leemoonsoo commented Sep 25, 2016

@agoodm Thanks for the contribution.

While documentation says "Interpreter Execution Callbacks" is experimental (which i agree), i think @Experimental annotation should be used for API instead of @ZeppelinApi.

The implementation keeps callback registry in ZeppelinServer side. But while it's volatile and per interpreter, I wonder if callback registry can be placed inside of Interpreter process, that might reduce complexity (no thrift communication) and increases flexibility (able to register actual function, not only the strings). What do you think? Could you elaborate these two choices?

Use Experimental annotation for callback registry interface
@agoodm
Copy link
Member Author

agoodm commented Sep 25, 2016

@Leemoonsoo Thanks for the feedback.

Indeed I think your first point on the Experimental annotation made sense, so I have changed it.

As for your second point, I am not really sure how to best go about it myself. When I first created the JIRA issue, I came up with a quick implementation that used kept separate registries for each Interpreter instance. It was certainly simpler, but the main problem that I ran into was that I couldn't figure out how to make sure those changes in the registry entered by the user were broadcasted to the ZeppelinServer side. I studied the implementation of AngularObjectRegistry in order to understand how this could be done in Zeppelin, and that is ultimately what led me to this current implementation which uses thrift. I am still in the process of learning the core Zeppelin codebase, so if you had any specific ideas, I would appreciate it.

Also, can you also be specific on how this change would make it possible to pass actual references to functions instead of code strings (hence why I used put "callbacks" in quotes in the original PR)? I don't consider this to be too important but am just curious.

Finally, can you provide some insight for some of the odd CI build failures I have been experiencing with more recent commits?

Thanks!

@Leemoonsoo
Copy link
Member

When callback registry places in ZeppelinServer process and want to let user register callback routine from the interpreter by exposing user callback registry api, then callback routine need to be serialized from Interpreter to send to the ZeppelinServer. That's why i think you send string representation of code, which is callback routine, through thrift message.

If callback registry places inside of Interpreter instance, Interpreter implementation can directly access callback registry. So callback routine does not need to be serialized. That means not only string representation of code, but also some java object (runnable object) can be registered into the registry. And then RemoteInterpreterServer] can run callback routines before and after it invokes Interpreter.interpret() method.

So ZeppelinServer doesn't really need to know callback registry, i think. Or are there reasons and use cases that changes of callback registry need to be broadcasted to ZeppelinServer?

@agoodm
Copy link
Member Author

agoodm commented Sep 26, 2016

Thank you very much for the detailed explanation. To be precise, I made the cmd argument that gets registered a String solely because Interpreter.interpret() accepts a String representation of the code rather than actual callable objects. I think actually being able to do that (eg, use some API like z.registerCallback("post_exec", callback_func) rather than z.registerCallback("post_exec", "callback_func()")would depend on the language that's being interpreted and what sort of libraries are available to connect it to the JVM (eg py4j with python).

Nevertheless I think you have made a very compelling point. The source of my problem with my first implementation of the registry system using the Interpreter class was that I only considered attempting an implementation of pre/post execute from a Paragraph execution perspective. This is why I thought that I needed to broadcast changes in the registry to the ZeppelinServer, as I processed the registry from within the Paragraph class rather than from the RemoteInterpreterServer class. As a result of your explanation I think I understand everything a lot better now, so I'll go ahead and work on an implementation tomorrow which works exclusively through the interpreter process.

@agoodm
Copy link
Member Author

agoodm commented Sep 29, 2016

Going to close this for now since I have made another implementation on a separate branch.

@agoodm agoodm closed this Sep 29, 2016
asfgit pushed a commit that referenced this pull request Oct 17, 2016
…erpreters

### What is this PR for?
See #1439. This is a second attempt based on prior feedback, particularly from Leemoonsoo who advised that it would be better to contain all of the code inside of the Interpreter Process. Aside from this, the major changes since the previous PR are:

- The terminology "hooks" is used instead of "callbacks". Since the information being passed to the interpreter isn't actually a "callback", strictly speaking, the more general term "hook" is probably a better choice. All variable / class names were updated correspondingly.
- The registry is now keyed by class name rather than the `replName` specified by the magic on top of the paragraph. The interface provided by `ZeppelinContext` allows for a `replName` to be converted to `className` through an internal mapping.
- Two new event codes, `PRE_EXEC_DEV` and `POST_EXEC_DEV` have been added. This is primarily intended to separate use cases for the interpreter maintainers (via subclasses of `Interpreter`) and the notebook users (via the `ZeppelinContext` class), as otherwise user attempts at registering a hook would overwrite that set by the interpreter maintainer.
- Global scope for hook registration is supported for the developer use cases.

### What type of PR is it?
New Feature

### What is the Jira issue?
[ZEPPELIN-1423](https://issues.apache.org/jira/browse/ZEPPELIN-1423)

### How should this be tested?
In a new note, add the following lines of code to a paragraph:
```python
%pyspark
z.registerHook("post_exec", "print 'This code should be executed before the paragraph code!'")
z.registerHook("pre_exec", "print 'This code should be executed after the paragraph code!'")
```

Then run any other paragraph in the note containing some other code, eg
```python
%pyspark
print "This code should be entered into the paragraph by the user!"
```

The output should be:
```
This code should be executed before the paragraph code!
This code should be entered into the paragraph by the user!
This code should be executed after the paragraph code!
```

You should also test out the other two methods (`getCallback()` and `unregisterCallback()`) specified in `ZeppelinContext.java`.

One final caveat that should be mentioned: If there are errors in the code you specify for a pre-execute event, it will render the interpreter useless since the current implementation prepends the the code specified in `pre_exec` directly to the paragraph entered code before calling `interpret()`. The current workaround for this would be to either restart the interpreter group or call `unregisterHook()` via a different REPL within the interpreter group (eg, `z.unregisterHook("pre_exec", "pyspark")` from the spark interpreter). I would appreciate if anyone here would be willing to share any better approaches here.

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? Yes

Author: Alex Goodman <[email protected]>

Closes #1470 from agoodm/ZEPPELIN-1423v2 and squashes the following commits:

56ede60 [Alex Goodman] Automatically detect default interpreter for registerHook()
044a99d [Alex Goodman] Ensure that registered hooks are applied after call to open()
1331fe1 [Alex Goodman] Update interpreters.md
07cac65 [Alex Goodman] Implemented user-defined hook registry system for spark/pyspark interpreters
8fad936 [Alex Goodman] Added Interpreter Hooks to Interpreter Process
darionyaphet pushed a commit to darionyaphet/zeppelin that referenced this pull request Oct 27, 2016
…erpreters

### What is this PR for?
See apache#1439. This is a second attempt based on prior feedback, particularly from Leemoonsoo who advised that it would be better to contain all of the code inside of the Interpreter Process. Aside from this, the major changes since the previous PR are:

- The terminology "hooks" is used instead of "callbacks". Since the information being passed to the interpreter isn't actually a "callback", strictly speaking, the more general term "hook" is probably a better choice. All variable / class names were updated correspondingly.
- The registry is now keyed by class name rather than the `replName` specified by the magic on top of the paragraph. The interface provided by `ZeppelinContext` allows for a `replName` to be converted to `className` through an internal mapping.
- Two new event codes, `PRE_EXEC_DEV` and `POST_EXEC_DEV` have been added. This is primarily intended to separate use cases for the interpreter maintainers (via subclasses of `Interpreter`) and the notebook users (via the `ZeppelinContext` class), as otherwise user attempts at registering a hook would overwrite that set by the interpreter maintainer.
- Global scope for hook registration is supported for the developer use cases.

### What type of PR is it?
New Feature

### What is the Jira issue?
[ZEPPELIN-1423](https://issues.apache.org/jira/browse/ZEPPELIN-1423)

### How should this be tested?
In a new note, add the following lines of code to a paragraph:
```python
%pyspark
z.registerHook("post_exec", "print 'This code should be executed before the paragraph code!'")
z.registerHook("pre_exec", "print 'This code should be executed after the paragraph code!'")
```

Then run any other paragraph in the note containing some other code, eg
```python
%pyspark
print "This code should be entered into the paragraph by the user!"
```

The output should be:
```
This code should be executed before the paragraph code!
This code should be entered into the paragraph by the user!
This code should be executed after the paragraph code!
```

You should also test out the other two methods (`getCallback()` and `unregisterCallback()`) specified in `ZeppelinContext.java`.

One final caveat that should be mentioned: If there are errors in the code you specify for a pre-execute event, it will render the interpreter useless since the current implementation prepends the the code specified in `pre_exec` directly to the paragraph entered code before calling `interpret()`. The current workaround for this would be to either restart the interpreter group or call `unregisterHook()` via a different REPL within the interpreter group (eg, `z.unregisterHook("pre_exec", "pyspark")` from the spark interpreter). I would appreciate if anyone here would be willing to share any better approaches here.

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? Yes

Author: Alex Goodman <[email protected]>

Closes apache#1470 from agoodm/ZEPPELIN-1423v2 and squashes the following commits:

56ede60 [Alex Goodman] Automatically detect default interpreter for registerHook()
044a99d [Alex Goodman] Ensure that registered hooks are applied after call to open()
1331fe1 [Alex Goodman] Update interpreters.md
07cac65 [Alex Goodman] Implemented user-defined hook registry system for spark/pyspark interpreters
8fad936 [Alex Goodman] Added Interpreter Hooks to Interpreter Process
pedrozatta pushed a commit to pedrozatta/zeppelin that referenced this pull request Oct 27, 2016
…erpreters

### What is this PR for?
See apache#1439. This is a second attempt based on prior feedback, particularly from Leemoonsoo who advised that it would be better to contain all of the code inside of the Interpreter Process. Aside from this, the major changes since the previous PR are:

- The terminology "hooks" is used instead of "callbacks". Since the information being passed to the interpreter isn't actually a "callback", strictly speaking, the more general term "hook" is probably a better choice. All variable / class names were updated correspondingly.
- The registry is now keyed by class name rather than the `replName` specified by the magic on top of the paragraph. The interface provided by `ZeppelinContext` allows for a `replName` to be converted to `className` through an internal mapping.
- Two new event codes, `PRE_EXEC_DEV` and `POST_EXEC_DEV` have been added. This is primarily intended to separate use cases for the interpreter maintainers (via subclasses of `Interpreter`) and the notebook users (via the `ZeppelinContext` class), as otherwise user attempts at registering a hook would overwrite that set by the interpreter maintainer.
- Global scope for hook registration is supported for the developer use cases.

### What type of PR is it?
New Feature

### What is the Jira issue?
[ZEPPELIN-1423](https://issues.apache.org/jira/browse/ZEPPELIN-1423)

### How should this be tested?
In a new note, add the following lines of code to a paragraph:
```python
%pyspark
z.registerHook("post_exec", "print 'This code should be executed before the paragraph code!'")
z.registerHook("pre_exec", "print 'This code should be executed after the paragraph code!'")
```

Then run any other paragraph in the note containing some other code, eg
```python
%pyspark
print "This code should be entered into the paragraph by the user!"
```

The output should be:
```
This code should be executed before the paragraph code!
This code should be entered into the paragraph by the user!
This code should be executed after the paragraph code!
```

You should also test out the other two methods (`getCallback()` and `unregisterCallback()`) specified in `ZeppelinContext.java`.

One final caveat that should be mentioned: If there are errors in the code you specify for a pre-execute event, it will render the interpreter useless since the current implementation prepends the the code specified in `pre_exec` directly to the paragraph entered code before calling `interpret()`. The current workaround for this would be to either restart the interpreter group or call `unregisterHook()` via a different REPL within the interpreter group (eg, `z.unregisterHook("pre_exec", "pyspark")` from the spark interpreter). I would appreciate if anyone here would be willing to share any better approaches here.

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? Yes

Author: Alex Goodman <[email protected]>

Closes apache#1470 from agoodm/ZEPPELIN-1423v2 and squashes the following commits:

56ede60 [Alex Goodman] Automatically detect default interpreter for registerHook()
044a99d [Alex Goodman] Ensure that registered hooks are applied after call to open()
1331fe1 [Alex Goodman] Update interpreters.md
07cac65 [Alex Goodman] Implemented user-defined hook registry system for spark/pyspark interpreters
8fad936 [Alex Goodman] Added Interpreter Hooks to Interpreter Process
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants