Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vinai/load data - Part 1 very basic functionality review #316

Merged
merged 31 commits into from
Mar 8, 2021

Conversation

VinaiRachakonda
Copy link
Contributor

@VinaiRachakonda VinaiRachakonda commented Feb 26, 2021

This PR introduces basic functionality for LOAD DATA.

@VinaiRachakonda VinaiRachakonda changed the title [WIP] Vinai/load data Vinai/load data - Part 1 very basic functionality review Mar 1, 2021
Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, but needs a different approach. See comments

}

// Creates a directory of files
func CreateDummyFiles(dir string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a confusing way to set up test data. The test data should be declared alongside the test scenario.

My suggestion is to just check files into source control in a testdata directory, and then issue actual LOAD DATA commands for the various files in the setup portion of the scripts above.

return err
}
}
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should die loudly here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to do this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to do this. Just return an error if it's not local

return nil, err
}

ignoreNumVal, ok = ign.(int8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be larger than 255 right?

}
}

return plan.NewLoadData(bool(d.Local), d.Infile, unresolvedTable, columnsToStrings(d.Columns), d.Fields, d.Lines, ignoreNumVal), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good first draft, but what you really need to do is create a new Insert node with the LoadData node as the data source. You need the Insert to be the top-level node so that other logic (like type conversions, trigger application, constraint checking) etc. just works. Otherwise you'll end up duplicating the insert logic anywhere, as a bunch of your TODOs mention)

Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer, but still a couple major issues with this approach. See commentsd

@@ -237,9 +237,15 @@ func (t numberTypeImpl) Convert(v interface{}) (interface{}, error) {
}
return uint32(num), nil
case sqltypes.Int32:
// If empty return the nil value.
if v == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is true for all the integer types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the MySQL behavior:

insert into mytable (i) values ("");
ERROR 1366 (HY000): Incorrect integer value: '' for column 'i' at row 1

)

var (
fieldsTerminatedByDelim = "\t"
Copy link
Member

@zachmu zachmu Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these settings in the query?

Vinai: There are defaults that mysql uses when nothing is specified.

func (l *LoadData) String() string {
pr := sql.NewTreePrinter()

_ = pr.WriteNode("LOAD DATA")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just do LOAD DATA (filename)

type LoadData struct {
Local bool
File string
Destination sql.Node
Copy link
Member

@zachmu zachmu Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer want this here

Vinai: I use this to infer some schema to determine the right default values.

return []sql.Node{l.Destination}
}

func (l *LoadData) RowIter(ctx *sql.Context, row sql.Row) (sql.RowIter, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to scan file lines one at a time as you load rows, not all at once. Otherwise we read the entire file before we begin making progress inserting rows, and we need the entire file in memory (which may be massive). So open the file and scanner in RowIter, then return a sql.RowIter implementation that reads lines off of it one at a time.

if l.Lines != nil {
ll := l.Lines
if ll.StartingBy != nil {
linesStartingByDelim = string(ll.StartingBy.Val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong approach. These vary by statement, so make them fields on the struct, not global vars

Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a little more work, the tests are a bit lacking

{
Name: "Basic load data with enclosed values.",
SetUpScript: []string{
fmt.Sprintf("create table %s(pk int primary key)", tableNameConst),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really no reason to have this const. Just hardcode it everywhere it's needed. Makes the tests easier to read

},
},
{
Name: "Load data with csv with prefix.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a test of loading data into a table that doesn't exist?

return err
}
}
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to do this

@@ -155,7 +156,7 @@ func (t numberTypeImpl) Compare(a interface{}, b interface{}) (int, error) {

// Convert implements Type interface.
func (t numberTypeImpl) Convert(v interface{}) (interface{}, error) {
if v == nil {
if v == nil || (reflect.ValueOf(v).Kind() == reflect.Ptr && reflect.ValueOf(v).IsNil()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, this actually necessary?

}

// updateParsingConsts parses the LoadData object to update the 5 constants that are used for file parsing.
func (l *LoadData) updateParsingConsts() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name and comment are no longer accurate

}

// parseLines finds the delim that terminates each line and returns the overall line.
func (l LoadData) parseLines(scanner *bufio.Scanner) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad name for this method.

Just make the splitFunc function top level, rename it splitLines, and call scanner.Split(splitLines) in RowIter

}

func (l loadDataIter) Close(ctx *sql.Context) error {
if !l.scanner.Scan() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who cares if the scanner is done or not, this method has to close everything up


func addNullsToValues(exprs []sql.Expression, diff int) []sql.Expression {
for i := diff; i > 0; i-- {
exprs = append(exprs, expression.NewLiteral(nil, sql.Null))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, just make the expr array the same size as the dest schema and the remaining entries will already be nil

// create the values that are returned as a row iter.
var values [][]sql.Expression
values = append(values, exprs)
newValue := NewValues(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a lot of garbage just to return a sql.Row. Just return the row directly

colDiff := len(l.destination.Schema()) - len(exprs)

// append NULLS for the rest of the fields
exprs = addNullsToValues(exprs, colDiff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a test of this -- all the test data files have the same number of columns as their target schema

Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, just a couple comments

@@ -33,12 +33,16 @@ type ScriptTest struct {
Expected []sql.Row
// For tests that make a single assertion, ExpectedErr can be set for the expected error
ExpectedErr *errors.Kind
// For tests that need to be skipped
Skip bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this, put tests that need to be skipped into their own blocks and call them from a different method (TestBrokenLoadFile)

return err
}
}
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to do this. Just return an error if it's not local

@@ -460,6 +460,22 @@ func TestInsertIntoErrors(t *testing.T, harness Harness) {
}
}

func TestLoadData(t *testing.T, harness Harness) {
for _, script := range LoadDataScripts {
if !script.Skip {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below about the right way to do skipped tests.

Also, whenever you skip a test in go, you need to explicitly call t.skip(), not just fail to run it. Otherwise we have no visibility into how many tests are being skipped like this.


var LoadDataScripts = []ScriptTest{
{
Name: "Basic load data with enclosed values.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need tests for when the file and the target schema have different number of columns

@VinaiRachakonda VinaiRachakonda merged commit 320ebfb into master Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants