Skip to content

Conversation

@gss2002
Copy link

@gss2002 gss2002 commented Oct 11, 2016

What is this PR for?

The LivySparkSQLInterpreter class does not correctly process the userSessionMap throwing back an NPE when using %livy.sql or %sql when livy is default binding. This prevents correct sharing between Spark Sessions and SparkSQL Sessions. I have attached a fix that implements seperate static single instance classes that manage userSessionMaps.

What type of PR is it?

Bug Fix/Improvement

Todos

Documentation/Unit Tests

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-1516

How should this be tested?

Run %sql or %livy.sql against a sql enabled dataset such as hive metastore.

Questions:

  • Does the licenses files need update? n
  • Is there breaking changes for older versions? n
  • Does this needs documentation? m

@gss2002
Copy link
Author

gss2002 commented Oct 11, 2016

Attempted to use the work around (protected static Map<String, Integer> userSessionMap = new HashMap<>(); ) provided by @zjffdu it did not work. So I went with the more involved re-write and it has solved our problems in our large scale environment and provides the correct sharing between session contexts.

@zjffdu
Copy link
Contributor

zjffdu commented Oct 11, 2016

@gss2002 I have to admit there's lots of issues in livy interpreter. Your changes seems a little big for this fix. Would use ConcurrentHashMap directly work for you ? rather than declare LivyPySparkSessionMap and LivySparkSessionMap

@gss2002
Copy link
Author

gss2002 commented Oct 11, 2016

@zjffdu I tried to do that initially problem is instantiation of the shared object. I can probably consolidate down to one SessionMap but I think if there is sharing needed between sparkSQL and Spark it's best to do it this way minus other solutions. I'll ping @vinayshukla I was discussing this issue with him at Hadoop World a few weeks ago. Maybe we can setup a call to discuss?

@gss2002
Copy link
Author

gss2002 commented Oct 14, 2016

Also Static HashMaps are not threadsafe with multiple puts in the case of multiple users. Hence why I built out Singletons to keep track of these objects to guarantee only one instance gets created. this also allowed for the corrupt/lost session fix to be built out. Basically I propose to merge ZEPPELIN-1516 and ZEPPELIN-1546 and to solve the last remaining issues.

@zjffdu
Copy link
Contributor

zjffdu commented Oct 14, 2016

@gss2002 Do you think this change work for you ? Even this is a temp solution IMO. As I mentioned before, there's several issues in livy interpreter. What I plan is to do lots of improvements and refactoring, but before that, I'd like to merge the livy integration test #1462. Otherwise with more changes on the code, livy interpreter may become more fragile. What do you think ?

@gss2002
Copy link
Author

gss2002 commented Oct 14, 2016

@zjffdu I tried that same code multiple times and it did not work to be completely honest.. I have an email out to a JDK Developer/Debugger for one of the large JDK vendors to find out why this isn't specifically working. The reason I moved to the Singleton model for those classes is to prevent some bad stuff from happening as there is no guarantee when you have multiple requests coming in at the same time that you won't end up with multiple maps. Also I opened ZEPPELIN-1546 to try to solve the 404 problem which I was able to solve. If you want go ahead and merge. I will refactor the code after the fact but honestly I would try to move to the singleton model if multiple request and users will be hitting the Interpreter. I do understand trying to keep this code simple but unfortunately this is not a simple function. if you would like to discuss this further offline shoot me an email.

I based my implementation off of this and some issues I hit supporting some large scale apps over the years.
http://stackoverflow.com/questions/11165852/java-singleton-and-synchronization

@zjffdu
Copy link
Contributor

zjffdu commented Oct 14, 2016

@gss2002 Isn't above change use Initialization on Demand mentioned in the stackoverflow post ? And could you paste the error you see of this approach. Regarding ZEPPELIN-1546, I think it make sense. \cc @prabhjyotsingh (original author of livy interpreter) for more comments.

@gss2002
Copy link
Author

gss2002 commented Oct 14, 2016

@zjffdu testing the changes again with another build. I see a slight difference between your's and mine. Testing again ill advise shortly

@gss2002
Copy link
Author

gss2002 commented Oct 14, 2016

@zjffdu The change you made solved the issue. I see difference now. Your quick workaround is good to go. I am going to fix the LivyHelper to handle the 404 with NestedException if that works under ZEPPELIN-1546. cc @prabhjyotsingh

@gss2002
Copy link
Author

gss2002 commented Oct 14, 2016

@zjffdu Also those new maps added with ZEPPELIN-1430 those should be changed to ConcurrentHashMap. Basically any map that is static with gets/puts/removes etc should be ConcurrentHashMap if not trouble will ensue - https://dzone.com/articles/java-7-hashmap-vs

https://issues.apache.org/jira/browse/ZEPPELIN-1293 - The code in this needs to be adjust to check for null also. If json is null create new session as other errors could occur and cause issues as I saw during debug the last few days. I will adjust the ZEPPELIN-1546 JIRA to just handle the NestedRuntimeException.

@zjffdu
Copy link
Contributor

zjffdu commented Oct 14, 2016

Thanks @gss2002 for checking ZEPPELIN-1430, I will update that in another PR.

@gss2002
Copy link
Author

gss2002 commented Oct 14, 2016

@zjffdu Not a problem. Also I assume we will want to fix SparkR and PySpark also to use ConcurrentHashMaps since they could have multiuser's hitting it. I noticed some of the other interpreters use ConcurrentHashmaps.

@zjffdu
Copy link
Contributor

zjffdu commented Oct 14, 2016

That's should be fine to use HashMap for SparkR and PySpark, as their scheduler is FIFOScheduler which only one thread will access the map at one time. But it would be better to use ConcurrentHashMap in case we share sparkContext across all the interpreters.

@gss2002
Copy link
Author

gss2002 commented Oct 14, 2016

@zjffdu this may be me misunderstanding the code. but if you have 10 users in Zeppelin all accessing PySpark or SparkR there is no possibility of multiple users stepping on the map? So you are saying only 1 thread to interpret is available?

@zjffdu
Copy link
Contributor

zjffdu commented Oct 14, 2016

That's correct for LivySparkInterpreter, LivyPySparkInterpreter and LivySparkRInterpreter. Because they use FIFOScheduler. If you check the implenentaion of FIFOScheduler, there's only one thread in it, and each job will run in FIFO order. But this is different for LivySparkSQLInterpreter as it can use ParallelScheduler if set zeppelin.livy.concurrentSQL as true. This is why I mention before there're many issues in LivyInterpreter, it doesn't make sense to use FIFOScheduler for LivySparkInterpreter, LivyPySparkInterpreter and LivySparkRInterpreter. Because each user's sessions are isolated in livy.

@zjffdu
Copy link
Contributor

zjffdu commented Dec 23, 2016

@gss2002 Livy interpreter has been refactored, this issue has been fixed. Mind to close it ?

@zjffdu
Copy link
Contributor

zjffdu commented Mar 7, 2017

@gss2002 Would you mind to close it ? It is resolved in ZEPPELIN-2116

@gss2002
Copy link
Author

gss2002 commented Mar 7, 2017

@zjffdu all set

@gss2002 gss2002 closed this Mar 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants