You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Spark with external resources like a database, a somehow common pattern is to make the database client shared between tasks so the connection pool is shared. Otherwise, with a large number of tasks/threads, the database connections are exhausted and will lead to issues when scaling.
This rises some complications when using such an object, as it must implement some kind of singleton shared between threads that receive serialized objects.
Any idea on how to do this with MacWire? Any pattern that can be used?
A simple example:
importorg.apache.spark.SparkConfimportorg.apache.spark.sql.SparkSessionimportorg.scalatest.funspec.AnyFunSpecimportorg.scalatest.matchers.must.Matchers.{be, convertToAnyMustWrapper}
classModuleWithSparkSpecextendsAnyFunSpec {
it("runs module with spark") {
valparallelism=4valmodule=newModule {
overridelazyvalconnectionString:String=""overridelazyvalsparkConf:SparkConf=newSparkConf().setAppName("Test").setMaster(s"local[$parallelism]")
}
module.run(parallelism *3) must be(parallelism *3) // prints 4 thread ids and 4 different hash codes for 3 times
}
}
classRunner(valsparkConf:SparkConf, valdatabase:Database) extendsSerializable {
defrun(count: Int):Long= {
valdatabase=this.database
valsparkConf=this.sparkConf
SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
.sparkContext
.parallelize(0 until count)
.map { n => database.insert(n) }
.count()
}
}
traitModuleextendsSerializable {
defrun(count: Int):Long= runner.run(count)
importcom.softwaremill.macwire._protectedlazyvalconnectionString:String=""protectedlazyvalsparkConf:SparkConf=newSparkConf().setAppName("").setMaster("")
protectedlazyvaldatabase:Database= wire[Database] // this will be serialized and duplicated 4 timesprotectedlazyvalrunner:Runner= wire[Runner]
}
classDatabase(connectionString: String) extendsSerializablewithAutoCloseable {
definsert(n: Int):Unit= {
println(s"Insert $n on thread id = ${Thread.currentThread().getId}, instance hash code = ${hashCode()}")
}
overridedefclose():Unit= {}
}
So the idea would be to have something instead of wire, or beside, that would make it use a single instance. I was thinking to implement a shared singleton Scope that picks the instance from a concurrent collection, would this be the best way to do it?
protected lazy val database: Database = sharedSingleton(wire[Database])
The text was updated successfully, but these errors were encountered:
I think if you want to share some state between deserialised objects, you'll need some kind of global state, or deserialise a Database => Module function.
If you'd go with the global state, I think it's exactly as you write - you need some kind of cache. and Cache.get("db", wire[Database]), where the second argument is lazily-evaluated and provides the default value is the way to go. But that would be outside the scope of macwire.
But maybe serialising the funciton to create a Module, given a Database would work.
Thanks for the response. I played a little an implemented a sharedSingleton that keeps the instances in a TrieMap, however it needs some changes to the way the proxy is created to work with Spark. it needs a proxyFactory.setUseWriteReplace(false) in ProxyCreator.createProxy
When using Spark with external resources like a database, a somehow common pattern is to make the database client shared between tasks so the connection pool is shared. Otherwise, with a large number of tasks/threads, the database connections are exhausted and will lead to issues when scaling.
This rises some complications when using such an object, as it must implement some kind of singleton shared between threads that receive serialized objects.
Any idea on how to do this with MacWire? Any pattern that can be used?
A simple example:
So the idea would be to have something instead of wire, or beside, that would make it use a single instance. I was thinking to implement a shared singleton
Scope
that picks the instance from a concurrent collection, would this be the best way to do it?The text was updated successfully, but these errors were encountered: