Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ HSL ] Dict\chunk() behaves differently from doc block when given a KeyedTraversable with duplicate keys #9271

Open
lexidor opened this issue Oct 21, 2022 · 0 comments

Comments

@lexidor
Copy link
Collaborator

lexidor commented Oct 21, 2022

Describe the bug
When given a KeyedTraversable with duplicate keys, Dict\chunk() will return chunks smaller than the chunk size. Some (but not all) duplicate keys will be lost.

 * Returns a vec containing the original dict split into chunks of the given
 * size.
 *
 * If the original dict doesn't divide evenly, the final chunk will be
 * smaller.

The vec will not contain all the elements of the KeyedTraversable ("original dict") and the chunks will not be of the "given size".

Standalone code, or other way to reproduce the problem

function keyed_traversable_with_duplicate_keys(
): KeyedTraversable<string, int> {
  // a => 0, b => 1, c => 2, a => 3, b => 4, c => 5, ...
  for ($i = 0; $i < 15; $i += 3) {
    yield 'a' => $i;
    yield 'b' => $i + 1;
    yield 'c' => $i + 2;
  }
}

<<__EntryPoint>>
function main(): void {
  echo \json_encode(\HH\Lib\Dict\chunk(keyed_traversable_with_duplicate_keys(), 5));
  // [{"a":3,"b":4,"c":2},{"c":8,"a":9,"b":7},{"b":13,"c":14,"a":12}]
}

Steps to reproduce the behavior:

  1. Run the repro listed above

Expected behavior

There is no "correct" behavior for a KeyedTraversable with duplicate keys that makes much sense.

  • Consuming items until the accumulating dict is 5 elements long would result in [{"a":12,"b":13,"c":14}].

  • The current behavior silently drops elements from the provided KeyedTraversable. The elements that are dropped depend on the requested size. The order of the sub dicts differs from the order of the supplied KeyedTraversable.


My preferred solution would be to restrict the input type to KeyedContainer<Tk, Tv>. KeyedContainers can not have duplicate keys. This forces callers with a KeyedTraversable to call the function like this:

Dict\chunk(dict(keyed_traversable_with_duplicate_keys()), 5);

The loss of data will be more reasonable this way. Later values of duplicate keys overwrite previous occurrences. The order of the elements is maintained. The sub dicts will all be of the correct size, except for the trailing dict.

In order to support users who depended on the current behavior, I'd move the current implementation to the Legacy_FIXME namespace.

Actual behavior

[{"a":3,"b":4,"c":2},{"c":8,"a":9,"b":7},{"b":13,"c":14,"a":12}]

Environment

  • Operating system

'Ubuntu 18.04'

  • Installation method

'apt-get with dl.hhvm.com repository'

  • HHVM Version
HipHop VM 4.169.0 (rel) (non-lowptr)
Compiler: 1663640616_852546864
Repo schema: ce4145c518e16986e860178941c05dc9f11eab61

Additional context

This behavior has existed since the first public release of the hsl.

https://github.com/hhvm/hsl/blob/e3bb5e944f0639819f4d6416205449418cbc96c5/src/dict/transform.php#L20-L35

/**
 * Returns a vec containing the original dict split into chunks of the given
 * size. If the original dict doesn't divide evenly, the final chunk will be
 * smaller.
 */
function chunk<Tk, Tv>(
  KeyedTraversable<Tk, Tv> $traversable,
  int $size,
): vec<dict<Tk, Tv>> {
  invariant($size > 0, 'Chunk size must be positive.');
  $result = vec[];
  $ii = 0;
  foreach ($traversable as $key => $value) {
    if ($ii % $size === 0) {
      $result[] = dict[];
    }
    $result[(int)($ii / $size)][$key] = $value;
    $ii++;
  }
  return $result;
}

https://github.com/hhvm/hsl/blob/2615e1a30ced5518339a46f971c0434249ab5617/src/dict/transform.php#L15-L42

/**
 * Returns a vec containing the original dict split into chunks of the given
 * size.
 *
 * If the original dict doesn't divide evenly, the final chunk will be
 * smaller.
 *
 * Time complexity: O(n)
 * Space complexity: O(n)
 */
function chunk<Tk as arraykey, Tv>(
  KeyedTraversable<Tk, Tv> $traversable,
  int $size,
)[]: vec<dict<Tk, Tv>> {
  invariant($size > 0, 'Expected positive chunk size, got %d.', $size);
  $result = vec[];
  $ii = 0;
  $chunk_number = -1;
  foreach ($traversable as $key => $value) {
    if ($ii % $size === 0) {
      $result[] = dict[];
      $chunk_number++;
    }
    $result[$chunk_number][$key] = $value;
    $ii++;
  }
  return $result;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant