Streams and BLOB escapes for large databases #13

BrandonRoehl · 2019-01-25T18:14:11Z

io.Writer for streams
- Works with go "compress"
Partial writes not stored in memory and stream monitoring
- Works with https://github.com/machinebox/progress
Singular Data structure
Support unquoted numbers
Support BLOB dumps
Ability to ignore tables
SQL string quoting / sanitation
- Required for BLOBs and JSON
Concurrency in dump
- Speeds up writing to os.File and stream chains

mysqldump.go

JamesStewy · 2019-01-27T06:27:38Z

README.md


-	"github.com/JamesStewy/go-mysqldump"
-	_ "github.com/go-sql-driver/mysql"
+  "github.com/jamf/go-mysqldump"


There are a few instances of this around to be updated

Might have to close this PR then and open a new one from my fork then. But I can do that after the rest of this PR and the changes are done. I'd like to see table concurrency as an example but that might be beyond the scope of this PR.

I am okay with you just leaving changing this to the end just to keep it as one PR.

Oh no you miss understand me I'd have to close this one and open a new one. This is a change I can't make on the jamf:master upstream and would have to do it on my fork

Okay. If you could make that fork and create a new PR referencing this one that would be great.

Closing this see #14

JamesStewy · 2019-01-27T07:05:36Z

Hi @BrandonRoehl, thank you very much for your extensive contribution.

With regards to the addition of concurrency, I feel that the per table SQL queries (in createTable) could be added to the created goroutines to maximize the speed benefit. So the rough steps of Dump would become:

Write header to output writer
Get tables (getTables)
Create a goroutine for each table with the following rough steps:
1. Get table SQL (createTableSQL)
2. Get table rows (createTableValues)
3. Write table to output writer (writeTable)
Wait for goroutines to finish
Write footer to output writer

Finally, just to confirm, this PR would be a fix for #7 and would replace #9?

Thanks, James.

sanitize.go

BrandonRoehl · 2019-01-27T16:48:34Z

@JamesStewy This would fix #8, #7 and would replace #9

I though there was a reason I did the concurrency like this for stability because func (data *Data) dumpTable(name string) error was designed to run as go data.dumpTable(name) but the reason is escaping me of why that fails. You can replace line 130 with go data.dumpTable(name) and it will fail on large databases. This might be due to the nested concurrency and may be alleviated by forcing the write to be serial. I'll have to do more testing on Monday though before I can give a better answer about if that will fix the issue.

JamesStewy · 2019-01-28T05:50:03Z

This would fix #8

How so?

This might be due to the nested concurrency

If you where to make that change on line 130 then there would be no need for the second go call on line 151. That would remove the nested concurrency.

forcing the write to be serial

The writing is already serial in a sense because of the mutex.

Also, in regards to concurrency, the current implementation leaves potential errors on line 157 unchecked. Adding SQL calls into the go routines will then also add more potential errors that need to be checked and communicated back to the main routine.

BrandonRoehl · 2019-01-28T14:17:47Z

#8 is for batch inserts that this version writes to the out to stream as soon as it can so you can use https://github.com/machinebox/progress to get progress just like they were intending to monitor progress.

The error on line 157 can only happen if the stream is closed or the template is incorrect. Commit coming to check stream errors like in io.Writer

Here is the error doing the changes with line 151 and line 157 where you run the select statements in parallel. Only the write to stream can be done in a coroutine.

Error with parallel streams.

Its because there are parallel reads from sql.DB and mutex locking them defeats the purpose so you might as well put them in serial it saves some overhead.

[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50072->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50073->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50077->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50076->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50075->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50088->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50074->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50084->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50083->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50082->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50087->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50086->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50081->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50085->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50079->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50080->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50078->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50096->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50095->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50094->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50093->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50091->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50092->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50090->[::1]:3306: read: connection reset by peer
[mysql] 2019/01/28 08:04:59 packets.go:36: read tcp [::1]:50089->[::1]:3306: read: connection reset by peer
[0.36 MB] - Estimated time remaining: 13m34s^C

EDIT: Would like to note that the only way you can run multiple at the same time is by opening parallel connections. This however introduces tons of insatiability. And, depending on the environment and how many open connections the MySQL instance allows total failure with scaling databases. Parallel connections for dump in clustered environments can get separate nodes and retrieve bad data.

dump.go

JamesStewy · 2019-01-28T23:49:01Z

#8 states

get some notion of progress when executing the dump with mysql-workbench or other UIs

To me this reads as when the dump is being restored to the database, not when the dump is being read from the database. As you stated, this PR with the addition of a separate library, provides progress indication when reading from the database (which I like too btw).

Parallel connections for dump in clustered environments can get separate nodes and retrieve bad data

I see what you mean. But the docs for sql.DB say

DB is a database handle representing a pool of zero or more underlying connections. It's safe for concurrent use by multiple goroutines [1]

and reading further on suggests that it, by default can create and manage multiple underlying connections. So for one, I don't understand why you are having issues with concurrent SQL calls (I am pretty sure I have done that before), and second, wouldn't the default behaviour of sql.DB cause the same issue you described with a database cluster?

BrandonRoehl · 2019-01-29T14:45:27Z

Oh I thought that meant "executing the dump" like executing the dump command.

sql.DB only wraps implementations and that is probably true for all internal ones. Here is a branch with the concurrent implementation that errors when using go-sql-driver/mysql where they collide. Inspecting the server shows only one open connection.

Because GoLang's sql package only holds the generic methods and not their implementation even though it can handle concurrency doesn't guarantee all implementations will support it.

Now I might still be doing something wrong here. So If you can get it to work consistently I'd have no problem with it, I'd be thrilled, but right now it doesn't and can throw tons of errors it something that can be very mission critical when people will need to rely on these backups to restore too. And the only time you realize a scheduled backup fails is when you need to restore it. The same reason we decided to let it keep trying to write other tables if one fails is so if it can get even one more I consider that a win.

dump.go

JamesStewy · 2019-01-30T05:34:43Z

Okay, looks like the extra concurrency is going to be a lot of work to get right so I am happy to not do it. We can just go back to having the writes be concurrent as you had it when making this PR.

BrandonRoehl · 2019-01-31T20:43:37Z

Closing to reference #13

BrandonRoehl added 22 commits January 23, 2019 13:36

Fork and get a direct stream to modify

57f1b81

Spring cleaning

c8118b2

Stream based writer

cd7bc12

Update to MySQL 8

846720a

readme update

6112651

Go concurrency and backticks

47914c6

Breaking things

011beef

No need to panic

9a9da92

0.3.0 for tag

3542d95

update readme

523f634

Sanitize and blob values

90fdf91

Version bump for the blob sanitization

46db301

Correct escaping

30b13cd

travis to newer go

f253cf7

Failing tests on mock

c9037df

Ability to ignore tables

fd3ebcf

Update docs

79fd69d

Remove vscode artifacts

36aed1d

0.3.3 release for ignoring tables

219da40

Update docs

53d026e

Update urls

5671c80

Missed an upcase

ac1de70

JamesStewy reviewed Jan 27, 2019

View reviewed changes

mysqldump.go Outdated Show resolved Hide resolved

JamesStewy reviewed Jan 27, 2019

View reviewed changes

sanitize.go Outdated Show resolved Hide resolved

Collect error from goroutine

530393c

BrandonRoehl added 2 commits January 28, 2019 09:48

Clean up comments

06c8b9c

io.Closer will close any closable connection

c87ced3

JamesStewy reviewed Jan 28, 2019

View reviewed changes

dump.go Outdated Show resolved Hide resolved

Merge branch 'master' of github.com:jamf/go-mysqldump

fdc6a20

Data error to return out of execute

87c70ac

JamesStewy reviewed Jan 30, 2019

View reviewed changes

dump.go Show resolved Hide resolved

Lock on err

03b8ebd

BrandonRoehl mentioned this pull request Jan 31, 2019

Streams and BLOB escapes for large databases #14

Open

BrandonRoehl closed this Jan 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streams and BLOB escapes for large databases #13

Streams and BLOB escapes for large databases #13

BrandonRoehl commented Jan 25, 2019 •

edited

Loading

JamesStewy Jan 27, 2019

BrandonRoehl Jan 29, 2019

JamesStewy Jan 30, 2019

BrandonRoehl Jan 30, 2019

JamesStewy Jan 30, 2019

BrandonRoehl Jan 31, 2019

JamesStewy commented Jan 27, 2019

BrandonRoehl commented Jan 27, 2019

JamesStewy commented Jan 28, 2019

BrandonRoehl commented Jan 28, 2019 •

edited

Loading

JamesStewy commented Jan 28, 2019

BrandonRoehl commented Jan 29, 2019

JamesStewy commented Jan 30, 2019

BrandonRoehl commented Jan 31, 2019

Streams and BLOB escapes for large databases #13

Streams and BLOB escapes for large databases #13

Conversation

BrandonRoehl commented Jan 25, 2019 • edited Loading

JamesStewy Jan 27, 2019

Choose a reason for hiding this comment

BrandonRoehl Jan 29, 2019

Choose a reason for hiding this comment

JamesStewy Jan 30, 2019

Choose a reason for hiding this comment

BrandonRoehl Jan 30, 2019

Choose a reason for hiding this comment

JamesStewy Jan 30, 2019

Choose a reason for hiding this comment

BrandonRoehl Jan 31, 2019

Choose a reason for hiding this comment

JamesStewy commented Jan 27, 2019

BrandonRoehl commented Jan 27, 2019

JamesStewy commented Jan 28, 2019

BrandonRoehl commented Jan 28, 2019 • edited Loading

Error with parallel streams.

JamesStewy commented Jan 28, 2019

BrandonRoehl commented Jan 29, 2019

JamesStewy commented Jan 30, 2019

BrandonRoehl commented Jan 31, 2019

BrandonRoehl commented Jan 25, 2019 •

edited

Loading

BrandonRoehl commented Jan 28, 2019 •

edited

Loading