Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Psych does not complain about duplicate keys #79

Open
rehevkor5 opened this issue Aug 24, 2012 · 23 comments · May be fixed by #426
Open

Psych does not complain about duplicate keys #79

rehevkor5 opened this issue Aug 24, 2012 · 23 comments · May be fixed by #426

Comments

@rehevkor5
Copy link

Psych does not complain or raise an error when it encounters a duplicate key. Instead, it happily overwrites the previous value of the key. This is in contrast to the YAML spec, which maintains that the keys must be unique.

Agree/disagree that this is a problem?

@tenderlove
Copy link
Member

Well, Psych maintains the behavior that Syck had for this. Until now, nobody has had a problem with the way it works. I don't mind changing it, but it will not be backwards compatible.

@rehevkor5
Copy link
Author

Got it. I don't suppose Psych already has a place for configuration options, does it? I don't see one. Would that be a desirable approach? Then, for example, a Rails app in development mode could be more strict/fragile about the YAML than a production deployment.

@Nitrodist
Copy link

It's been about 2 years since this was brought up. Let's do it.

@Nitrodist
Copy link

So according to this tweet[0], we should add an option to enforce this behaviour (maybe the option is called 'strict').

In my opinion, with this option enabled we should raise a Psych::SyntaxError if we encounter the duplicate keys. This error is already a known error that can be raised on Psych.parse[1].

Psych has a number of interfaces that we would want this behaviour added to (cribbed from the docs[2]):

  • (Object) Pysch.load(yaml, filename = nil)
  • (Object) Psych.load_documents(yaml, &block)
  • (Object) Psych.load_file(filename)
  • (Object) Psych.load_stream(yaml, filename = nil)
  • (Object) Psych.parse(yaml, filename = nil)
  • (Object) Psych.parse_file(filename)
  • (Object) Psych.parse_stream(yaml, filename = nil, &block)

Right now none of these methods take an options hash. We can expand the signature with an optional options hash. I think this the right course of action, does anybody have any other suggestion?

Also, consider that the generation of YAML already has an options hash (in Psych.dump, but not Psych.dump_stream), so this isn't an unexpected behaviour, in my opinion.

[0] - https://twitter.com/tenderlove/status/468781805940641793
[1] - http://rubydoc.info/stdlib/psych/1.9.3/Psych#parse-class_method
[2] - http://rubydoc.info/stdlib/psych/1.9.3/Psych

@Nitrodist
Copy link

Basically I'm proposing changing from this:

Psych.parse(yaml, filename = nil)

to this:

Psych.parse(yaml, filename = nil, options = {})

(and on all the other methods listed above)

@tenderlove
Copy link
Member

@Nitrodist 👍 I think it's good. For a major release, I think we should change the filename to be an option like Psych.load(yaml, filename: name, strict: true) (but that's for a breaking major release). For now, lets add the options hash.

@Nitrodist
Copy link

We're having a hacknight tomorrow in Toronto. I'm hoping to work on a solution with @nwjsmith and have it ready by Thursday. 👐

@perryventas
Copy link

@Nitrodist What was the outcome of your hacking night? 😄

@Nitrodist
Copy link

@coin3d I was able to get it to 'work' but its efficiency left much to be desired. This is what I had:

diff --git lib/psych/handlers/document_stream.rb lib/psych/handlers/document_stream.rb
index e429993..bf04896 100644
--- lib/psych/handlers/document_stream.rb
+++ lib/psych/handlers/document_stream.rb
@@ -17,6 +17,18 @@ module Psych
         @last.implicit_end = implicit_end
         @block.call pop
       end
+
+      def end_mapping
+        mapping = pop
+        keys = []
+        mapping.children.each_slice(2) do |(key_scalar, _)|
+          next if key_scalar.is_a?(Psych::Nodes::Sequence) or key_scalar.is_a?(Psych::Nodes::Alias) or key_scalar.is_a?(Psych::Nodes::Mapping)
+          key = key_scalar.value
+          raise Psych::Exception, "Same key exists on this level" if keys.include? key
+          keys << key
+        end
+        mapping
+      end
     end
   end
 end
diff --git test/psych/test_hash.rb test/psych/test_hash.rb
index dac7f8d..b3ecbc0 100644
--- test/psych/test_hash.rb
+++ test/psych/test_hash.rb
@@ -21,6 +21,16 @@ module Psych
       assert_equal X, x.class
     end

+    def test_error_on_same_key
+      assert_raises(Psych::Exception) do
+        Psych.load <<-EOF
+        -
+          same_key: 'value'
+          same_key: 'value'
+        EOF
+      end
+    end
+
     def test_self_referential
       @hash['self'] = @hash
       assert_cycle(@hash)

@wteuber
Copy link

wteuber commented Jun 24, 2016

I would have liked a warning or error message like this... I now used https://github.com/adrienverge/yamllint for checking for duplicate keys.

@schmurfy
Copy link

schmurfy commented Sep 5, 2017

What is the state of this ? I don't really mind or care about the default but there should be a way to raise an error on duplicate keys, the consequences of a duplicate key are rather harsh since the first one will simply be entirely overwritten.

@imoverclocked
Copy link

If you use a hash instead of a list then the performance shouldn't suffer nearly as much:

+      def end_mapping
+        mapping = pop
+        keys = {}
+        mapping.children.each_slice(2) do |(key_scalar, _)|
+          next if key_scalar.is_a?(Psych::Nodes::Sequence) or key_scalar.is_a?(Psych::Nodes::Alias) or key_scalar.is_a?(Psych::Nodes::Mapping)
+          key = key_scalar.value
+          raise Psych::Exception, "Same key exists on this level" if keys.key? key
+          keys[key] = nil
+        end
+        mapping
+      end

any takers for testing and a PR?

@alex-harvey-z3q
Copy link

Is this likely to be implemented any time soon?

@wteuber
Copy link

wteuber commented Mar 25, 2018

@tenderlove you mentioned backwards compatibility...
Instead raising an error, just (optionally) printing out a warning wouldn't brake backwards compatibility but help many people manage their yaml files. What do you think?

@imaginejosephishak
Copy link

Just wasted 2 days investigating this very issue after a developer accidentally forgot to rename a job that was copy/pasted. Any thoughts on just printing a message as @wteuber suggests? Would save a lot of time - especially when multiple developers are working on automation.

@wteuber
Copy link

wteuber commented Apr 23, 2019

I helped programmatically avoiding this issue by writing and introducing a normalized YAML format in the projects I am involved in. Maybe you find this helpful, too: https://github.com/Sage/yaml_normalizer
@imaginejosephishak

@native-api
Copy link

Travis CI user has had a problem because of the current behavior: https://travis-ci.community/t/build-no-longer-installing-apt-packages-ignoring-config/4306

@native-api
Copy link

native-api commented Jul 23, 2019

http://www.yamllint.com/ also has the same problem: it doesn't report duplicate keys and instead silently deletes all but the last one and reports YAML as "valid".

@mahemoff
Copy link

Since multiple keys will likely be supported by default, permanently, I think there should be a guarantee about the order they will be processed. It seems to use a "first come, first served" model rather than an override model, i.e. the first key encountered is the one that's used, with subsequent duplicates being discarded rather than acting as overrides.

Would it make sense to document this and add a test to ensure it remains stable?

While I realise it's not strictly YAML-compliant, it's actually useful for merging YAMLs from multilpe sources, so if it's in the library anyway, may as well make it work deterministically.

@getaaron getaaron linked a pull request Dec 18, 2019 that will close this issue
@AvdN
Copy link

AvdN commented Dec 2, 2021

That this is not an error is problematic, especially in the context of merge keys. Gitlab (working with Ruby for YAML) seems to allow this because of this bug, and the effect when mulitple mapped in merges have the same key (but different values) was allowed but impossible for me to trace what it means. I.e. is

   <<: *a
   <<: *b

the same as <<: [*a, *b] or as <<: [*b, *a] or as <<: *b or as <<: *a ? These all have different results depending on the keys in the mappings &a and &b and the mapping in which things are merged.

FWIW, PyYAML overwrites a key's value with any value with the same that occur later in the document text in a mapping.
https://github.com/yaml/pyyaml/blob/8cdff2c80573b8be8e8ad28929264a913a63aa33/lib/yaml/constructor.py#L132

@matsubo
Copy link

matsubo commented Jun 2, 2022

I strongly recommend to implement this issue because I encountered issue related to this implementation.

Related issue

I encountered following issue.

  • I use openapi3_parser gem to validate Open API 3 Specification syntax. The gem depends on psych.
  • There was a multiple key definition in the Open API 3 Specification file written by YAML but it's not detected.
  • As as point of view from Open API 3 Specification, multiple key definition leads to confuse users' understanding.
  • example

Reasons to implement

  • The multiple same key is not allowed according to the RFC
  • If the multiple same key is allowed, OpenAPI3 syntax checker would need to validate YAML to assure the multiple key is not definied in the YAML file.

@feliperaul
Copy link

As pretty much anyone that has arrived here, I also have been bitten by this default behavior.

Working with a large Rails yml translation file, I didn't realize I declared a key that was already present in a lower part in the same file.

Couldn't understand why it wasn't loading properly. Took me a long time debugging the ActiveModel and i18n source code directly until I finally realized my mistake 😕

I'd love to see this feature implemented, either with a warning or an error, and IMHO Rails should even default to raising an error for i18n translation files in the future.

@wteuber
Copy link

wteuber commented Oct 30, 2023

I helped programmatically avoiding this issue by introducing normalized YAML files in projects I am involved in. Maybe you find this helpful, too: https://github.com/Sage/yaml_normalizer
@feliperaul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.