Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams and BLOB escapes for large databases #13

Closed
wants to merge 28 commits into from

Conversation

BrandonRoehl
Copy link

@BrandonRoehl BrandonRoehl commented Jan 25, 2019

  • io.Writer for streams
    • Works with go "compress"
  • Partial writes not stored in memory and stream monitoring
  • Singular Data structure
  • Support unquoted numbers
  • Support BLOB dumps
  • Ability to ignore tables
  • SQL string quoting / sanitation
    • Required for BLOBs and JSON
  • Concurrency in dump
    • Speeds up writing to os.File and stream chains

mysqldump.go Outdated Show resolved Hide resolved

"github.com/JamesStewy/go-mysqldump"
_ "github.com/go-sql-driver/mysql"
"github.com/jamf/go-mysqldump"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few instances of this around to be updated

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have to close this PR then and open a new one from my fork then. But I can do that after the rest of this PR and the changes are done. I'd like to see table concurrency as an example but that might be beyond the scope of this PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with you just leaving changing this to the end just to keep it as one PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no you miss understand me I'd have to close this one and open a new one. This is a change I can't make on the jamf:master upstream and would have to do it on my fork

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. If you could make that fork and create a new PR referencing this one that would be great.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this see #14

@JamesStewy
Copy link
Owner

Hi @BrandonRoehl, thank you very much for your extensive contribution.

With regards to the addition of concurrency, I feel that the per table SQL queries (in createTable) could be added to the created goroutines to maximize the speed benefit. So the rough steps of Dump would become:

  1. Write header to output writer
  2. Get tables (getTables)
  3. Create a goroutine for each table with the following rough steps:
    1. Get table SQL (createTableSQL)
    2. Get table rows (createTableValues)
    3. Write table to output writer (writeTable)
  4. Wait for goroutines to finish
  5. Write footer to output writer

Finally, just to confirm, this PR would be a fix for #7 and would replace #9?

Thanks, James.

sanitize.go Outdated Show resolved Hide resolved
@BrandonRoehl
Copy link
Author

@JamesStewy This would fix #8, #7 and would replace #9

I though there was a reason I did the concurrency like this for stability because func (data *Data) dumpTable(name string) error was designed to run as go data.dumpTable(name) but the reason is escaping me of why that fails. You can replace line 130 with go data.dumpTable(name) and it will fail on large databases. This might be due to the nested concurrency and may be alleviated by forcing the write to be serial. I'll have to do more testing on Monday though before I can give a better answer about if that will fix the issue.

@JamesStewy
Copy link
Owner

This would fix #8

How so?

This might be due to the nested concurrency

If you where to make that change on line 130 then there would be no need for the second go call on line 151. That would remove the nested concurrency.

forcing the write to be serial

The writing is already serial in a sense because of the mutex.

Also, in regards to concurrency, the current implementation leaves potential errors on line 157 unchecked. Adding SQL calls into the go routines will then also add more potential errors that need to be checked and communicated back to the main routine.

@BrandonRoehl
Copy link
Author

BrandonRoehl commented Jan 28, 2019

#8 is for batch inserts that this version writes to the out to stream as soon as it can so you can use https://github.com/machinebox/progress to get progress just like they were intending to monitor progress.

The error on line 157 can only happen if the stream is closed or the template is incorrect. Commit coming to check stream errors like in io.Writer

Here is the error doing the changes with line 151 and line 157 where you run the select statements in parallel. Only the write to stream can be done in a coroutine.

Error with parallel streams.

Its because there are parallel reads from sql.DB and mutex locking them defeats the purpose so you might as well put them in serial it saves some overhead.

[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50072->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50073->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50077->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50076->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50075->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50088->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50074->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50084->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50083->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50082->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50087->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50086->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50081->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50085->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50079->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50080->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50078->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50096->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50095->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50094->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50093->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50091->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50092->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50090->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50089->[::1]:3306: read: connection reset by peer
[0.36 MB] - Estimated time remaining: 13m34s^C

EDIT: Would like to note that the only way you can run multiple at the same time is by opening parallel connections. This however introduces tons of insatiability. And, depending on the environment and how many open connections the MySQL instance allows total failure with scaling databases. Parallel connections for dump in clustered environments can get separate nodes and retrieve bad data.

dump.go Outdated Show resolved Hide resolved
@JamesStewy
Copy link
Owner

#8 states

get some notion of progress when executing the dump with mysql-workbench or other UIs

To me this reads as when the dump is being restored to the database, not when the dump is being read from the database. As you stated, this PR with the addition of a separate library, provides progress indication when reading from the database (which I like too btw).

Parallel connections for dump in clustered environments can get separate nodes and retrieve bad data

I see what you mean. But the docs for sql.DB say

DB is a database handle representing a pool of zero or more underlying connections. It's safe for concurrent use by multiple goroutines [1]

and reading further on suggests that it, by default can create and manage multiple underlying connections. So for one, I don't understand why you are having issues with concurrent SQL calls (I am pretty sure I have done that before), and second, wouldn't the default behaviour of sql.DB cause the same issue you described with a database cluster?

@BrandonRoehl
Copy link
Author

Oh I thought that meant "executing the dump" like executing the dump command.

sql.DB only wraps implementations and that is probably true for all internal ones. Here is a branch with the concurrent implementation that errors when using go-sql-driver/mysql where they collide. Inspecting the server shows only one open connection.

Because GoLang's sql package only holds the generic methods and not their implementation even though it can handle concurrency doesn't guarantee all implementations will support it.

Now I might still be doing something wrong here. So If you can get it to work consistently I'd have no problem with it, I'd be thrilled, but right now it doesn't and can throw tons of errors it something that can be very mission critical when people will need to rely on these backups to restore too. And the only time you realize a scheduled backup fails is when you need to restore it. The same reason we decided to let it keep trying to write other tables if one fails is so if it can get even one more I consider that a win.

dump.go Show resolved Hide resolved
@JamesStewy
Copy link
Owner

Okay, looks like the extra concurrency is going to be a lot of work to get right so I am happy to not do it. We can just go back to having the writes be concurrent as you had it when making this PR.

@BrandonRoehl
Copy link
Author

Closing to reference #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants