Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi docs same tns. Hash based naming. #72

Closed
wants to merge 2 commits into from
Closed

Multi docs same tns. Hash based naming. #72

wants to merge 2 commits into from

Conversation

aaronmmanzano
Copy link

@aaronmmanzano aaronmmanzano commented Oct 25, 2018

Hello David,

I messaged you on Reddit about 2 weeks ago about making a pull request, and here it is. I'm a few months into learning GO and it was a real pleasure tooling around your code. It really gave me a good feel for well structured GO code. With that in mind I don't expect this first pull request to meet those standards and view it more as a code review than pull request. I know this project is no longer a major priority but if you have time to take a look at the code I'd appreciate it.

I'll make line comments to the PR after I create it, but I'll briefly explain my intentions here as well. I initially ran into an issue in with xsd/parse.go:Parse(). When Schema objects are being stored in the parsed map, they are stored with target namespace as the key. This causes to schemas with the same tns to collide in the map and only the last schema using the tns survives.

After correcting this I came into issue with XMLName collisions caused by joining the types of multiple schemas (that target the same namespace) into a map, which happens immediately after schema objects are created in Parse() . In short this lead me down the path of rewriting the copyEltNamesToAnonTypes() and nameAnonymousTypes() functions called within Normalize() to both be a 2-pass process. Gathering information needed to assess possible collisions in the 2nd, renaming pass. If collisions are detected name suffixes are used to prevent it while preserving context provided by using parent element names. The fewer types that pass through to nameAnonymousTypes() the better.

And for those types that do fall through to being renamed by nameAnonymousTypes() the name suffix
process is changed from an incremented number to the leading characters of the type's hash. This allows for more maintainable code as it prevents the situation I encountered where the addition of an anon type early in the xsd causes the renaming of all subsequent anon types with a suffix one higher than its previous. This way a GO type declaration only changes if the xsd type it models changes. It also allows for condensing of exact anon xsd type declarations into a single GO type.

Anyhow, please let me know what you think, and thanks so much for making this project public.
~Aaron

PS: I noticed there is an auto-build process that failed, I'll look into the tests.

Aaron M. Manzano added 2 commits October 25, 2018 03:50
Naming of anonymous types based on hash instead of counter. Decoupling
anon type names from ordering of xsd document arguments.
Instead of leaving anon types to be named _anon (if they would collide
with other types after inheriting a name from a parent element) append
a deterministically derrived suffix to the name to maintain context
while avoiding a name collision.
if err := s.parse(root); err != nil {
return nil, err
}
parsed[tns] = s
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing generated schemas by tns causes collisions, retaining only the types from the last schema to be generated targeting the specific namespace. Using a hash of the schema ensures all unique schemas are retained.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the approach I would have used. Being able to assume that your schema document is exhaustive for its target namespace makes later steps simpler. For example, if there was a 1:1 mapping from target namespace to xsd then you could remove one layer of maps for your prepCopyEltNamesToAnonTypes, and you would be able to push it into the CopyEltNamesToAnonTypes function.

To workaround the pesky XSD specification allowing schema to be split across multiple files you could add a pass that merges all *xmltree.Element trees for schema with the same target namespace. Another benefit of doing the transform on the tree structure is that it's easier to verify, by printing out the XML in intermediate steps. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will think on this

nTypes = namedTypesByNS[tns]
nTypesToCopy = namesToCopiedByNS[tns]
}

namedTypes := and(isType, hasAttr("", "name"))
for _, el := range root.SearchFunc(namedTypes) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we collect all the explicitly named types in a tns so we can use them to detect type name collisions in pass 2. We fail if there is a XMLName collision here because XMLNames should be unique per namespace.

@@ -210,22 +295,113 @@ func copyEltNamesToAnonTypes(root *xmltree.Element) {
hasAnonymousType)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next we identify candidates for inheriting parent element name but don't rename immediately so we can determine if there are other candidates for the same name. We store hashes of each type that is a candidate for a particular XMLName so we can assess if a suffix is required in the renaming pass.

or(isElem(schemaNS, "element"), isElem(schemaNS, "attribute")),
hasAttr("", "name"),
hasAnonymousType)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we begin the actual renaming process

isNamedType, suffix := getSuffix(xmlname, tHash)

// If the name/hash of the anon type matches that of an explicity
// declared type, refernce that type and throw away the anon type
Copy link
Author

@aaronmmanzano aaronmmanzano Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It the type, after inheriting the name, exactly matches the hash of an explicitly named type, then we throw away the anon type and point the parent element at the explicitly named type. Otherwise we determine if it requires a suffix to avoid collision and rename it.

@@ -256,13 +432,32 @@ to
</xs:sequence>
</xs:complexType>
*/
func nameAnonymousTypes(root *xmltree.Element) error {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same process is used to give anon types names

@droyo
Copy link
Owner

droyo commented Oct 27, 2018

Wow... first of all, thank you for this PR and your detailed explanation. I will make comments in-line.

Copy link
Owner

@droyo droyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a first pass. Again, thank you for your contribution. I have not fully grokked the code yet, so my current feedback is more around formatting. I will need a little more time to give specific feedback on the more significant changes.

Overall this looks pretty good. I have two qualms that will solidify over time:

  • You need tests for what you've fixed here. Specifically, you need to test the following:
    ** schema that are split over multiple files don't have missing types after Parse is done with them.
    ** changing the order of arguments to Normalize doesn't change the naming of anonymous or colliding types.

Other feedback: you're really fixing two issues here:

  • handling schema split over multiple files
  • improving naming of anonymous types with named parent elements.

It would be nice, (but not required!) to address each issue in a separate PR.

xsd/parse.go Show resolved Hide resolved
xsd/parse.go Show resolved Hide resolved
xsd/parse.go Show resolved Hide resolved
nTypes[xmlName] = hash(el)
} else {
return fmt.Errorf(
"Type collision - Name: [ %s ] Namespaces: [ %s ]\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure there are exceptions in the code base, but try to keep all of the error messages in plain english, with minimal notation.

Suggested change
"Type collision - Name: [ %s ] Namespaces: [ %s ]\n",
"collision for type %v in target ns %q",

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify a bit more on this one for me?
I wanted to place the XMLName and namespace in the error response to aid in trouble shooting the troublesome elements. Perhaps, "Two elements with same XMLName in target namespace "

Or would you prefer "Naming collision in namespace on XMLName " ?

If neither of those, any suggestions?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a suggestion with my initial comment -- can you see it? Something like that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I can see that, sorry I overlooked it.

xsd/parse.go Show resolved Hide resolved
xsd/parse.go Show resolved Hide resolved

// If we've encountered a type name for the first time
// store it with its hash
if _, prevUsed := nTypes[xmlName]; !prevUsed {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ok for the map presence boolean. Go programmers should recognize the idiom well enough. exists is fine too.

Copy link
Author

@aaronmmanzano aaronmmanzano Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed this and I agree. I do wonder if you feel the same way about the special case where the boolean is to be used multiple times outside the immediate proximity of the initial map access. I'm interested in your opinion on this one with regard to readability vs accepted idiom usage.

*/
func copyEltNamesToAnonTypes(root *xmltree.Element) {
used := make(map[xml.Name]struct{})
func prepCopyEltNamesToAnonTypes(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this function name. As time has passed i've tried to structure parsing as a series of discrete passes over the data, where each "pass" is a function that makes a well-defined transformation over the data. And these transformations should be somewhat self-contained.

From the name prepCopyEltNamesToAnonTypes I guess this is necessary to run before some CopyEltNamesToAnonTypes stage. That doesn't tell me anything about what this step is for.

I'm still digesting this function so I don't have a good name in mind.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, and that makes perfect sense. With that in mind, perhaps I could move the logic of prepCopyEltNamesToAnonTypes() into the body of CopyEltNamesToAnonTypes() as an anonymously defined function?
PROS:
1 - Don't have to define maps outside both functions to pass results of prepCopyEltNamesToAnonTypes() to CopyEltNamesToAnonTypes() cleaning up Normalize() a bit.
2 - CopyEltNamesToAnonTypes() continues to meet its original purpose as well as the naming convention.
CONS:
1 - CopyEltNamesToAnonTypes() get approximately 74% larger. 87 -> 151 lines.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no issue with big fat functions :) . I vote for moving it into CopyEltNamesToAnonTypes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that :-)

if err := s.parse(root); err != nil {
return nil, err
}
parsed[tns] = s
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the approach I would have used. Being able to assume that your schema document is exhaustive for its target namespace makes later steps simpler. For example, if there was a 1:1 mapping from target namespace to xsd then you could remove one layer of maps for your prepCopyEltNamesToAnonTypes, and you would be able to push it into the CopyEltNamesToAnonTypes function.

To workaround the pesky XSD specification allowing schema to be split across multiple files you could add a pass that merges all *xmltree.Element trees for schema with the same target namespace. Another benefit of doing the transform on the tree structure is that it's easier to verify, by printing out the XML in intermediate steps. What do you think?

@@ -140,20 +176,33 @@ func Parse(docs ...[]byte) ([]Schema, error) {
for _, root := range schema {
tns := root.Attr("", "targetNamespace")
s := Schema{TargetNS: tns, Types: make(map[xml.Name]Type)}
sHash := hash(root)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about this part. From my understanding of the hash() function, making changes to the tree structure in root could produce a different hash value. It looks like you don't make any transformations between here and when you lookup the schema on line 205, so there's no bug... yet :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will think on this

@aaronmmanzano
Copy link
Author

aaronmmanzano commented Oct 29, 2018

Thank you for taking the time to read and comment on the code. I've got a bit of a busy week at work, but I'll take a deeper look into your comments ASAP. Also I'll make sure to break the code into PR's with a better separation of concerns.

Thanks again,
Aaron

@droyo
Copy link
Owner

droyo commented Jan 3, 2019

I finished reviewing this and I'm pretty happy with it. I would accept it as-is if you add some tests for the new functionality (see my first comment).

@aaronmmanzano
Copy link
Author

I finished reviewing this and I'm pretty happy with it. I would accept it as-is if you add some tests for the new functionality (see my first comment).

Thanks David that's great news. I got on a bit of a side-track as far as getting these changes committed upstream but it is still very important to me. I've done some minor refactoring and fixed a few bugs in the mean time and I'll realistically be able to be back to this in about 3 weeks. At that time I'll break this large commit down (in terms of separation of concerns) and submit PR's that include tests. I'd really prefer to do it that way both for the experience and because I'd like the code to be as good as possible upon acceptance. In the mean time could you please leave this PR here for my reference??
Thanks David

~Aaron

@droyo
Copy link
Owner

droyo commented Jan 5, 2019

Sounds good to me! Thanks again for your contributions. As a heads up I fixed a bug with commit 48d1a5a that may cause some merge conflicts with you; you're refactoring some of the same bits anyway, so you might want to just copy the tests over and make sure they work.

@mpwalkerdine
Copy link

mpwalkerdine commented Jan 15, 2020

I was about to submit a much smaller PR to resolve the multiple files problem, but not the types problem because I can workaround this for the schemas I'm working on with judicious argument ordering...

FWIW this fixed the problem for me:

diff --git a/xsd/parse.go b/xsd/parse.go
index ab1b059..779b0ec 100644
--- a/xsd/parse.go
+++ b/xsd/parse.go
@@ -128,7 +128,7 @@ func Normalize(docs ...[]byte) ([]*xmltree.Element, error) {
 func Parse(docs ...[]byte) ([]Schema, error) {
 	var (
 		result = make([]Schema, 0, len(docs))
-		parsed = make(map[string]Schema, len(docs))
+		parsed = make(map[*xmltree.Element]Schema, len(docs))
 		types  = make(map[xml.Name]Type)
 	)
 
@@ -143,7 +143,7 @@ func Parse(docs ...[]byte) ([]Schema, error) {
 		if err := s.parse(root); err != nil {
 			return nil, err
 		}
-		parsed[tns] = s
+		parsed[root] = s
 	}
 
 	for _, s := range parsed {
@@ -153,7 +153,7 @@ func Parse(docs ...[]byte) ([]Schema, error) {
 	}
 
 	for _, root := range schema {
-		s := parsed[root.Attr("", "targetNamespace")]
+		s := parsed[root]
 		if err := s.resolvePartialTypes(types); err != nil {
 			return nil, err
 		}

Edit: Spoke too soon - looks like the _self references are overlapping still, but also the flattening seems to be discarding needed types 🤷‍♂️

@davidalpert
Copy link
Contributor

davidalpert commented Mar 11, 2021

this is a lot of good work.

I started poking at this issue yesterday without finding this PR and the approach I settled on (working locally for me so far based on the code that's currently in master) added a MergeTypes method to the xsd.Schema type that threw an error if you call it for two schemas with different target namespaces. using this and the existing Imports method to discover and recursively follow/load imported schema files seemed to work for me, but I didn't test it extensively. If interested I could submit that as another PR, but it seems that this one is solving the same problem through a different approach and has already been approved in theory.

Perhaps I can help to resolve the pending conflicts with master and then we merge it and go from there?

@davidalpert
Copy link
Contributor

not a trivial merge; would likely go faster with @droyo or @aaronmmanzano at the helm.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants