NPE LivySparkSQLInterpreter thrown with %livy.sql interpreter function #1503

gss2002 · 2016-10-11T04:21:25Z

What is this PR for?

The LivySparkSQLInterpreter class does not correctly process the userSessionMap throwing back an NPE when using %livy.sql or %sql when livy is default binding. This prevents correct sharing between Spark Sessions and SparkSQL Sessions. I have attached a fix that implements seperate static single instance classes that manage userSessionMaps.

What type of PR is it?

Bug Fix/Improvement

Todos

Documentation/Unit Tests

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-1516

How should this be tested?

Run %sql or %livy.sql against a sql enabled dataset such as hive metastore.

Questions:

Does the licenses files need update? n
Is there breaking changes for older versions? n
Does this needs documentation? m

gss2002 · 2016-10-11T04:24:05Z

Attempted to use the work around (protected static Map<String, Integer> userSessionMap = new HashMap<>(); ) provided by @zjffdu it did not work. So I went with the more involved re-write and it has solved our problems in our large scale environment and provides the correct sharing between session contexts.

zjffdu · 2016-10-11T04:52:51Z

@gss2002 I have to admit there's lots of issues in livy interpreter. Your changes seems a little big for this fix. Would use ConcurrentHashMap directly work for you ? rather than declare LivyPySparkSessionMap and LivySparkSessionMap

gss2002 · 2016-10-11T13:40:06Z

@zjffdu I tried to do that initially problem is instantiation of the shared object. I can probably consolidate down to one SessionMap but I think if there is sharing needed between sparkSQL and Spark it's best to do it this way minus other solutions. I'll ping @vinayshukla I was discussing this issue with him at Hadoop World a few weeks ago. Maybe we can setup a call to discuss?

gss2002 · 2016-10-14T00:12:43Z

Also Static HashMaps are not threadsafe with multiple puts in the case of multiple users. Hence why I built out Singletons to keep track of these objects to guarantee only one instance gets created. this also allowed for the corrupt/lost session fix to be built out. Basically I propose to merge ZEPPELIN-1516 and ZEPPELIN-1546 and to solve the last remaining issues.

zjffdu · 2016-10-14T01:45:26Z

@gss2002 Do you think this change work for you ? Even this is a temp solution IMO. As I mentioned before, there's several issues in livy interpreter. What I plan is to do lots of improvements and refactoring, but before that, I'd like to merge the livy integration test #1462. Otherwise with more changes on the code, livy interpreter may become more fragile. What do you think ?

gss2002 · 2016-10-14T01:57:44Z

@zjffdu I tried that same code multiple times and it did not work to be completely honest.. I have an email out to a JDK Developer/Debugger for one of the large JDK vendors to find out why this isn't specifically working. The reason I moved to the Singleton model for those classes is to prevent some bad stuff from happening as there is no guarantee when you have multiple requests coming in at the same time that you won't end up with multiple maps. Also I opened ZEPPELIN-1546 to try to solve the 404 problem which I was able to solve. If you want go ahead and merge. I will refactor the code after the fact but honestly I would try to move to the singleton model if multiple request and users will be hitting the Interpreter. I do understand trying to keep this code simple but unfortunately this is not a simple function. if you would like to discuss this further offline shoot me an email.

I based my implementation off of this and some issues I hit supporting some large scale apps over the years.
http://stackoverflow.com/questions/11165852/java-singleton-and-synchronization

zjffdu · 2016-10-14T02:06:17Z

@gss2002 Isn't above change use Initialization on Demand mentioned in the stackoverflow post ? And could you paste the error you see of this approach. Regarding ZEPPELIN-1546, I think it make sense. \cc @prabhjyotsingh (original author of livy interpreter) for more comments.

gss2002 · 2016-10-14T02:08:51Z

@zjffdu testing the changes again with another build. I see a slight difference between your's and mine. Testing again ill advise shortly

gss2002 · 2016-10-14T02:26:05Z

@zjffdu The change you made solved the issue. I see difference now. Your quick workaround is good to go. I am going to fix the LivyHelper to handle the 404 with NestedException if that works under ZEPPELIN-1546. cc @prabhjyotsingh

gss2002 · 2016-10-14T02:44:49Z

@zjffdu Also those new maps added with ZEPPELIN-1430 those should be changed to ConcurrentHashMap. Basically any map that is static with gets/puts/removes etc should be ConcurrentHashMap if not trouble will ensue - https://dzone.com/articles/java-7-hashmap-vs

https://issues.apache.org/jira/browse/ZEPPELIN-1293 - The code in this needs to be adjust to check for null also. If json is null create new session as other errors could occur and cause issues as I saw during debug the last few days. I will adjust the ZEPPELIN-1546 JIRA to just handle the NestedRuntimeException.

zjffdu · 2016-10-14T02:51:56Z

Thanks @gss2002 for checking ZEPPELIN-1430, I will update that in another PR.

gss2002 · 2016-10-14T02:56:30Z

@zjffdu Not a problem. Also I assume we will want to fix SparkR and PySpark also to use ConcurrentHashMaps since they could have multiuser's hitting it. I noticed some of the other interpreters use ConcurrentHashmaps.

zjffdu · 2016-10-14T03:14:54Z

That's should be fine to use HashMap for SparkR and PySpark, as their scheduler is FIFOScheduler which only one thread will access the map at one time. But it would be better to use ConcurrentHashMap in case we share sparkContext across all the interpreters.

gss2002 · 2016-10-14T03:20:14Z

@zjffdu this may be me misunderstanding the code. but if you have 10 users in Zeppelin all accessing PySpark or SparkR there is no possibility of multiple users stepping on the map? So you are saying only 1 thread to interpret is available?

zjffdu · 2016-10-14T04:21:01Z

That's correct for LivySparkInterpreter, LivyPySparkInterpreter and LivySparkRInterpreter. Because they use FIFOScheduler. If you check the implenentaion of FIFOScheduler, there's only one thread in it, and each job will run in FIFO order. But this is different for LivySparkSQLInterpreter as it can use ParallelScheduler if set zeppelin.livy.concurrentSQL as true. This is why I mention before there're many issues in LivyInterpreter, it doesn't make sense to use FIFOScheduler for LivySparkInterpreter, LivyPySparkInterpreter and LivySparkRInterpreter. Because each user's sessions are isolated in livy.

zjffdu · 2016-12-23T00:38:56Z

@gss2002 Livy interpreter has been refactored, this issue has been fixed. Mind to close it ?

zjffdu · 2017-03-07T03:55:41Z

@gss2002 Would you mind to close it ? It is resolved in ZEPPELIN-2116

gss2002 · 2017-03-07T20:52:47Z

@zjffdu all set

ZEPPELIN-1516 Static/Singleton for userSessionMaps

11bcd2d

zjffdu mentioned this pull request Oct 14, 2016

ZEPPELIN-1516. NPE LivySparkSQLInterpreter thrown with %livy.sql interpreter #1514

Closed

1 task

gss2002 closed this Mar 7, 2017

NPE LivySparkSQLInterpreter thrown with %livy.sql interpreter function #1503

NPE LivySparkSQLInterpreter thrown with %livy.sql interpreter function #1503

Uh oh!

Conversation

gss2002 commented Oct 11, 2016

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Questions:

Uh oh!

gss2002 commented Oct 11, 2016

Uh oh!

zjffdu commented Oct 11, 2016

Uh oh!

gss2002 commented Oct 11, 2016

Uh oh!

gss2002 commented Oct 14, 2016

Uh oh!

zjffdu commented Oct 14, 2016

Uh oh!

gss2002 commented Oct 14, 2016

Uh oh!

zjffdu commented Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gss2002 commented Oct 14, 2016

Uh oh!

gss2002 commented Oct 14, 2016

Uh oh!

gss2002 commented Oct 14, 2016

Uh oh!

zjffdu commented Oct 14, 2016

Uh oh!

gss2002 commented Oct 14, 2016

Uh oh!

zjffdu commented Oct 14, 2016

Uh oh!

gss2002 commented Oct 14, 2016

Uh oh!

zjffdu commented Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjffdu commented Dec 23, 2016

Uh oh!

zjffdu commented Mar 7, 2017

Uh oh!

gss2002 commented Mar 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zjffdu commented Oct 14, 2016 •

edited

Loading

zjffdu commented Oct 14, 2016 •

edited

Loading