Skip to content

Conversation

@agoodm
Copy link
Member

@agoodm agoodm commented Oct 18, 2016

What is this PR for?

This PR is the first of two major steps needed to improve matplotlib integration in Zeppelin (ZEPPELIN-1344). The latter, which is a plotting backend with fully interactive tools enabled, will be done afterwards in a separate PR. This PR specifically for automatically displaying output from calls to matplotlib plotting functions inline with each paragraph. Thanks to the addition of post-execute hooks (ZEPPELIN-1423), there is no need to call any show() function to display an inline plot, just like in Jupyter.

What type of PR is it?

Improvement

Todos

The main code has been written and anyone who reads this is encouraged to test it, but there are a few minor todos:

  • - Add unit tests
  • - Add documentation
  • - Add screenshot showing iterative plotting with angular mode

What is the Jira issue?

ZEPPELIN-1345

How should this be tested?

In a pyspark or python paragraph, enter and run

import matplotlib.pyplot as plt
plt.plot([1, 2, 3])

The plot should be displayed automatically without calling any show() function whatsoever. A special method called configure_mpl() can also be used to modify the inline plotting behavior. For example,

z.configure_mpl(close=False, angular=True)
plt.plot([1, 2, 3])

allows for iterative updates to the plot provided you have PY4J installed for your python installation (which of course is always the case if you use pypsark). To clarify, this feature only currently works with pyspark (not python as there are no angularBind() and angularUnbind() methods yet). Doing something like:

plt.plot([3, 2, 1])

will update the plot that was generated by the previous paragraph by leveraging Zeppelin's Angular Display System. However, by setting close=False, matplotlib will no longer automatically close figures so it is now up to the user to explicitly close each figure instance they create. There's quite a bit more options for z.configure_mpl(), but I will save that discussion for the documentation.

Screenshots (if appropriate)

img

Questions:

  • Does the licenses files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? Yes

@agoodm agoodm changed the title [WIP]ZEPPELIN-1345 - Create a custom matplotlib backend that natively supports inline plotting in a python interpreter cell ZEPPELIN-1345 - Create a custom matplotlib backend that natively supports inline plotting in a python interpreter cell Nov 2, 2016
@agoodm
Copy link
Member Author

agoodm commented Nov 2, 2016

Ok, that should do it for now. @Leemoonsoo @bzz @felixcheung please feel free to review.

@Leemoonsoo
Copy link
Member

Thanks @agoodm for great contribution.
I have tested this and it works really well on both %pyspark and %python.
Also verified /lib/python dir is packaged with -Pbuild-distr flag.

One thing is, with the tutorial note /notebook/2BQA35CJZ, Zeppelin fails to start with error

Caused by: java.lang.NullPointerException
    at org.apache.zeppelin.notebook.Notebook.loadNoteFromRepo(Notebook.java:447)
    at org.apache.zeppelin.notebook.Notebook.loadAllNotes(Notebook.java:469)
    at org.apache.zeppelin.notebook.Notebook.<init>(Notebook.java:123)
    at org.apache.zeppelin.server.ZeppelinServer.<init>(ZeppelinServer.java:107)
    ... 29 more

Without tutorial note notebook/2BQA35CJZ, Zeppelin starts without problem.
Do you experience the same problem with tutorial note?

@agoodm
Copy link
Member Author

agoodm commented Nov 2, 2016

@Leemoonsoo
Yep, I also had trouble starting Zeppelin. I can run it on my main development directory, but couldn't when I cloned the current master branch to an empty directory and patched my commits manually. I confirmed this by deleting the note directory and then restarting Zeppelin, after which there were no problems.

@agoodm
Copy link
Member Author

agoodm commented Nov 2, 2016

@Leemoonsoo Try it now, I think it should work. The main thing I changed was printing the last image as HTML instead of ANGULAR (for the purpose of saving the image in the notebook). I also imported the old note into a new one (hence the directory name changing) from the UI before making these changes, then deleted the old note in place of this new one.

@Leemoonsoo
Copy link
Member

@agoodm Thanks for the quick update. It works well! Let's see if CI goes green.

@felixcheung
Copy link
Member

very cool! could you add some documentation on how html or angular output can plug in to this?
would it be possible to add the same tests to Spark pyspark interpreter too?

@agoodm
Copy link
Member Author

agoodm commented Nov 2, 2016

@felixcheung The main idea is that the output is set to HTML if angular=False (default). If angular=True, the figure data gets bound to the angular display system by figure id and the output is set to angular. This is essentially explained already in the tutorial notebook. Would you like to see some separate documentation somewhere else? If so, where do you think I should put it?

As for unit tests for pyspark, I believe this would be easy enough to do but in that case I would need to exclude them from the main test suite as we do with python because matplotlib is required. We should probably have python, matplotlib, pandas, and pandasql automatically downloaded for our CI tests, most likely using conda since those packages are in binary form.

@felixcheung
Copy link
Member

@agoodm how about a bit on the output in docs/interpreter/python.md? I do see your reference to the notebook but it is a bit harder for general user to access that way (and it is not readily readable off github either)

great point on more packages for python and more tests 😄 perhaps you could help later?

@felixcheung
Copy link
Member

about the /lib directory, would it make sense if it's under the interpreter directory (like interpreter/python/lib)? seems like that would be consistent with everything interpreter is isolated

@agoodm
Copy link
Member Author

agoodm commented Nov 3, 2016

@felixcheung
I updated the documentation and also added the example gif shown in this PR. In regards to getting python dependencies installed for our CI tests, I would be glad to help with that. With that in mind though I think I'll hold off on committing a unit test for pyspark until then since I'll probably get around to doing this anyway right after this PR gets merged, so not much point in going through the trouble of excluding the test in the pom.xml.

The reason I put the backend files in /lib and not in /interpreter/python/lib is because I didn't want to worry about editing multiple versions of the same file if these were to be used by another interpreter (eg, I would also want to have them in /interpreter/spark/lib). I suppose it may be possible to just have a top-level lib/python get copied directly to both directories through maven, but I didn't see the added complexity being worth it.

Also on a final note @Leemoonsoo , the previous CI failures have alerted me to another issue: With %pyspark, the output from the last statement in a paragraph no longer gets automatically printed, eg

%pyspark
x = 1
x

This is because the do-nothing displayhook() call is now always the last statement thanks to the hook registry. This is only a problem with %pyspark, but not in python. I don't know enough about the internals of the former that would explain the difference, but it might be something we should look into later.

@felixcheung
Copy link
Member

Great.

I think it will be better to have interpreter/lib - it might be easier to set ACL on a common interpreter root in the event there are user downloaded interpreters for examples.

@agoodm
Copy link
Member Author

agoodm commented Nov 4, 2016

@felixcheung
In that case, I think that should be fine. Will do shortly.

@Leemoonsoo
The latest commit should fix the issue I just mentioned. Two other important things to mention:

  1. Any print statements made prior to printing the %html or %angular won't show up in the results. This means that when close=False and when any plots are open, print statements won't display since the output type magic will always come after user entered print-statements. This can be fixed for the matplotlib backend specifically by adding a pre-execute hook that prints the magic of the currently open plot, but it's not really ideal.

  2. Earlier we both had issues loading the tutorial notebook. As it turns out, the reason appears to be due the angular display system. This happens whenever I stop Zeppelin whenever variables are still bound to the angular display system. If I make sure to unbind everything first, then I can safely start Zeppelin again. I am pretty sure this issue was introduced in more recent commits since I didn't have this problem when I first made the tutorial notebook. It was not until after I rebased my branch to reproduce the problem when you initially discovered it. This is actually a serious issue, and in this case it would be bad to leave it unaddressed with this PR since that would mean anyone that runs the last example in the tutorial notebook without calling plt.close() at the end will brick their zeppelin installation until removing the notebook. Seems like this was fixed as of [ZEPPELIN-1612] Fix NPE when initializing Notebook #1590.

How do you think we should proceed?

@Leemoonsoo
Copy link
Member

@agoodm
I think placement of common python lib and problem printing output when close=False can be handled in the separate issues. Do you mind create issues?

I have tested last commit and found that if i run %pyspark paragraph when spark interpreter is not initialized (i.e. right after Zeppelin start or spark interpreter restart), i'm seeing following error.

ERROR [2016-11-04 11:32:09,138] ({pool-1-thread-2} PySparkInterpreter.java[open]:168) - Error
java.lang.NullPointerException
    at org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:197)
    at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:166)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.open(RemoteInterpreterServer.java:250)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$open.getResult(RemoteInterpreterService.java:1621)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$open.getResult(RemoteInterpreterService.java:1606)
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
    at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
ERROR [2016-11-04 11:32:09,140] ({pool-1-thread-2} TThreadPoolServer.java[run]:296) - Error occurred during processing of message.
org.apache.zeppelin.interpreter.InterpreterException: java.lang.NullPointerException
    at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:169)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.open(RemoteInterpreterServer.java:250)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$open.getResult(RemoteInterpreterService.java:1621)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$open.getResult(RemoteInterpreterService.java:1606)
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
    at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:197)
    at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:166)
    ... 10 more

Do you see the same error?

@agoodm
Copy link
Member Author

agoodm commented Nov 4, 2016

@Leemoonsoo Sure, I will do that after this is merged.

As for your error, I cannot reproduce it. I tested both the latest code from master as well as my fork after rebasing. How are you performing the maven build? I use:

mvn clean package -Pbuild-distr -Ppyspark -DskipTests

without modifying any of the configuration files.

@Leemoonsoo
Copy link
Member

@agoodm It's my bad, i was testing it with #1596 merged.
Tested just last commit in the branch again, and the error has gone.

@agoodm
Copy link
Member Author

agoodm commented Nov 4, 2016

@Leemoonsoo Yep, I can confirm... right before you posted that comment, I manually patched in the changes from that PR to my branch and got the exact same stacktrace as you in my interpreter log file after running a %pyspark paragraph.

@Leemoonsoo
Copy link
Member

@agoodm now CI has fixed on master. Could you rebase and see if CI test goes green?

@agoodm
Copy link
Member Author

agoodm commented Nov 5, 2016

@Leemoonsoo Tests are done, the only failure seems to be from a known flaky test (ZEPPELIN-1623). Are we good to merge?

@Leemoonsoo
Copy link
Member

LGTM and merge to master if there's no further discussion.
Thanks @agoodm for the great contribution.

@felixcheung
Copy link
Member

How do we think about this #1534 (comment)

@agoodm
Copy link
Member Author

agoodm commented Nov 6, 2016

@felixcheung done, I have tested with the tutorial notebook and it seems to work as well as before. Good to merge now?

@felixcheung
Copy link
Member

LGTM.
Is there any known issue with SELENIUM tests? from a quick glance it seems to have failed a few times here.

@agoodm
Copy link
Member Author

agoodm commented Nov 6, 2016

There is a JIRA issue: (ZEPPELIN-1623)

@felixcheung
Copy link
Member

Great, thank you for digging it through.

Any more comment?

@Leemoonsoo
Copy link
Member

Looks good to me!

@Leemoonsoo
Copy link
Member

Merge to master if there's no further discussion.

@bzz
Copy link
Member

bzz commented Nov 7, 2016

Looks great to me, 👍 for extra tests. Let's merge

@asfgit asfgit closed this in 438dbca Nov 8, 2016
@agoodm agoodm deleted the ZEPPELIN-1345 branch November 8, 2016 20:59
@bzz
Copy link
Member

bzz commented Nov 13, 2016

@agoodm sorry for digging this out, but I have just realized that this PR changes only Python PyZeppelinContext and Pyspark PyZeppelinContext but does not touch Python Py4jZeppelinContext - wich is used for dynamic forms, in case py4j is installed on the system for things like

z.select("Choose a letter", 
    ([a,"a"], [b,"b"], [c,"c"]

In latter case Python interpreter becomes not useable - any line results in

x = 1

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Py4jZeppelinContext' object has no attribute '_displayhook'

Update: logged it under https://issues.apache.org/jira/browse/ZEPPELIN-1655

asfgit pushed a commit that referenced this pull request Nov 13, 2016
### What is this PR for?
After #1534 , Dynamic Forms were no longer working in the python interpreter. This is because the `Py4jZeppelinContext` constructor did not initialize the `_displayhook` which is always called on post-execute.

### What type of PR is it?
Bug Fix

### What is the Jira issue?
[ZEPPELIN-1655](https://issues.apache.org/jira/browse/ZEPPELIN-1655)

### How should this be tested?
Run the following `%python` paragraph, being sure that Py4j is installed:
```python
%python
a, b, c = (1, 2, 3)
z.select("Choose a letter", ([a,"a"], [b,"b"], [c,"c"] ))
```

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Alex Goodman <[email protected]>

Closes #1626 from agoodm/ZEPPELIN-1655 and squashes the following commits:

2e4ee2d [Alex Goodman] Make sure _displayhook is initialized in Py4jZeppelinContext
asfgit pushed a commit that referenced this pull request Nov 14, 2016
### What is this PR for?
There have been reports of #1534 causing the python interpreter to always show an error because `z` is not being set. As it turns out this is a result of improperly handling the case when matplotlib isn't found when initializing the interpreter.

### What type of PR is it?
Bug Fix

### What is the Jira issue?
[ZEPPELIN-1656](https://issues.apache.org/jira/browse/ZEPPELIN-1656)

### How should this be tested?
Run any simple python paragraph.

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Alex Goodman <[email protected]>

Closes #1628 from agoodm/patch-1 and squashes the following commits:

67f2ad5 [Alex Goodman] python interpeter should work when matplotlib is not installed
0a7a9d7 [Alex Goodman] Fix indent in bootstrap.py
@ghost
Copy link

ghost commented May 7, 2017

I can't make work the plot update from a different cell in a %python interpreter. I've described the problem in details in ZEPPELIN-2511.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants