-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add s3 backup #8
Conversation
2b3cdb5
to
f1fd7dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added all the feedback I could think off.
Overall seems to be good. But obviously without tests it's complicated.
core-backup/src/main/scala/aiven/io/guardian/kafka/backup/BackupClientInterface.scala
Outdated
Show resolved
Hide resolved
core-backup/src/main/scala/aiven/io/guardian/kafka/backup/BackupClientInterface.scala
Outdated
Show resolved
Hide resolved
sealed abstract class BackupStreamPosition | ||
|
||
object BackupStreamPosition { | ||
|
||
/** The backup stream has just started right now | ||
*/ | ||
case object Start extends BackupStreamPosition | ||
|
||
/** The backup stream is in the middle of a time period | ||
*/ | ||
case object Middle extends BackupStreamPosition | ||
|
||
/** The backup stream position has just hit a boundary for when a new period starts | ||
*/ | ||
case object Boundary extends BackupStreamPosition | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those are meant to be public, I would put them in their own file.
I personally prefer having 1 class per file, easier to navigate and locate things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may make this private because I don't think it needs to be public, but I note your point (I will put it in a different file if it ends up being public)
core-backup/src/main/scala/aiven/io/guardian/kafka/backup/BackupClientInterface.scala
Show resolved
Hide resolved
core-backup/src/main/scala/aiven/io/guardian/kafka/backup/BackupClientInterface.scala
Outdated
Show resolved
Hide resolved
core-backup/src/main/scala/aiven/io/guardian/kafka/backup/BackupClientInterface.scala
Outdated
Show resolved
Hide resolved
core-backup/src/main/scala/aiven/io/guardian/kafka/backup/BackupClientInterface.scala
Show resolved
Hide resolved
core-s3/src/main/scala/aiven/io/guardian/kafka/s3/configs/S3.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/aiven/io/guardian/kafka/models/ReducedConsumerRecord.scala
Outdated
Show resolved
Hide resolved
f1fd7dd
to
329124d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Would it be possible to use new commits instead of force pushing? Reviewing is easier with newer commits
329124d
to
3d4bdde
Compare
@jlprat Okay sure will do, this will be last time I will force push. Do not that github still keeps conversations on comments even if you have force pushed, it only gets hidden if the actual code gets changed. |
3d4bdde
to
ad88508
Compare
LGTM. Let me know when it's not a draft anymore |
ad88508
to
e3b59b6
Compare
e3b59b6
to
5071d25
Compare
@@ -117,7 +117,7 @@ trait BackupClientInterface { | |||
(reducedConsumerRecord, period) | |||
} | |||
|
|||
case None => throw new IllegalAccessException("") | |||
case None => throw Errors.ExpectedStartOfSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Via testing I have verified that this case isn't possible, you have to have at least a single element for a Source in order for it to even start
// TODO Is it possible to hit this branch? I assume if the Stream is started its impossible for | ||
// head to be empty | ||
??? | ||
case None => throw Errors.ExpectedStartOfSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Via testing I have verified that this case isn't possible, you have to have at least a single element for a Source in order for it to even start
@@ -204,5 +228,5 @@ object BackupClientInterface { | |||
reducedConsumerRecord: ReducedConsumerRecord | |||
): Long = | |||
// TODO handle overflow? | |||
ChronoUnit.MICROS.between(reducedConsumerRecord.toOffsetDateTime, initialTime) / period.toMicros | |||
ChronoUnit.MICROS.between(initialTime, reducedConsumerRecord.toOffsetDateTime) / period.toMicros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One bug that I found in initial implementation, turns out you can negative time between 2 time periods!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't really get this change
9e44344
to
65842fe
Compare
65842fe
to
606c048
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments mostly on the testing area.
import scala.annotation.nowarn | ||
import scala.concurrent.Await | ||
import scala.concurrent.duration.{FiniteDuration, _} | ||
import scala.language.postfixOps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is not needed
"BackupClientInterface" can { | ||
"splitAtBoundaryCondition" should { | ||
"BackupStreamPosition.Boundary happy case" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it improved or not, but nested matchers were bad at performance. A big flat spec was several times faster than a nested one.
"calculateBackupStreamPositions" should { | ||
|
||
"must always have at least one BackupStreamPosition.Boundary" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, either use must
instead of should
or remove it from the string
} | ||
} | ||
|
||
"Every ReducedConsumerRecord after a BackupStreamPosition.Boundary must be in the next consecutive time period" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sentence makes no sense:
"calculateBackupStreamPositions" should Every ReducedConsumerRecord after a BackupStreamPosition.Boundary must be in the next consecutive time period
} | ||
} | ||
|
||
"The time difference between two consecutive BackupStreamPosition.Middle has to be less then the time period" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Probably means the test belongs somewhere else, or you can structure the nesting differently
import scala.language.postfixOps | ||
|
||
trait ScalaTestConstants { | ||
val AwaitTimeout = 10 minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's huge. Probably seconds might be enough. What about 5 seconds?
@@ -204,5 +228,5 @@ object BackupClientInterface { | |||
reducedConsumerRecord: ReducedConsumerRecord | |||
): Long = | |||
// TODO handle overflow? | |||
ChronoUnit.MICROS.between(reducedConsumerRecord.toOffsetDateTime, initialTime) / period.toMicros | |||
ChronoUnit.MICROS.between(initialTime, reducedConsumerRecord.toOffsetDateTime) / period.toMicros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't really get this change
In regards to |
This is an in progress PR for backing up into Amazon S3. Some important notes
ReducedConsumerRecord
cursors is quite complicated. This is because you have a cursor for eachReducedConsumerRecord
however when we are feeding theFlow
into theS3.multipartUploadWithHeaders
Sink
we lose the concept of single elements ofReducedConsumerRecord
with aContext
(since at the point ofS3.multipartUploadWithHeaders
its only dealing with a stream of bytes with no beginning or end).ChronoUnit
(currently we useMICROS
). A lazy way of fixing this issue is to useBigInt
rather thanLong
(which can basically increase until you run out of memory. There may be a smarter way of handling this problem by detecting that if we are reachingLong.MaxValue
we just reset the counter but doing so will make the logic less simple.