Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Jun 22, 2023

Change Logs

Hfile format is known to be performant for on-demand lookup and prefix based lookup. But due to refactoring, we made minor tweaks to hfile scanner apis. Apparently, it could cause a latency hit w/ on-demand lookup when compared to full scan. The issue is, we switched reseekTo(Key) to seekTo(Key). Difference between seekTo() and reseekTo() is, both of them moves the cursor to the key of interest, but at the end of the call, seekTo() will rewind the cursor to the beginning of data block in Hfile, while reseekTo will leave the cursor at the same position.
Ref from HfileScanner docs:

  /**
   * SeekTo or just before the passed <code>cell</code>.  Examine the return
   * code to figure whether we found the cell or not.
   * Consider the cell stream of all the cells in the file,
   * <code>c[0] .. c[n]</code>, where there are n cells in the file.
   * @param cell
   * @return -1, if cell &lt; c[0], no position;
   * 0, such that c[i] = cell and scanner is left in position i; and
   * 1, such that c[i] &lt; cell, and scanner is left in position i.
   * The scanner will position itself between c[i] and c[i+1] where
   * c[i] &lt; cell &lt;= c[i+1].
   * If there is no cell c[i+1] greater than or equal to the input cell, then the
   * scanner will position itself at the end of the file and next() will return
   * false when it is called.
   * @throws IOException
   */
  int seekTo(Cell cell) throws IOException;
  /**
   * Reseek to or just before the passed <code>cell</code>. Similar to seekTo
   * except that this can be called even if the scanner is not at the beginning
   * of a file.
   * This can be used to seek only to cells which come after the current position
   * of the scanner.
   * Consider the cell stream of all the cells in the file,
   * <code>c[0] .. c[n]</code>, where there are n cellc in the file after
   * current position of HFileScanner.
   * The scanner will position itself between c[i] and c[i+1] where
   * c[i] &lt; cell &lt;= c[i+1].
   * If there is no cell c[i+1] greater than or equal to the input cell, then the
   * scanner will position itself at the end of the file and next() will return
   * false when it is called.
   * @param cell Cell to find (should be non-null)
   * @return -1, if cell &lt; c[0], no position;
   * 0, such that c[i] = cell and scanner is left in position i; and
   * 1, such that c[i] &lt; cell, and scanner is left in position i.
   * @throws IOException
   */
  int reseekTo(Cell cell) throws IOException;

We sort all the keys before any on-demand look up to ensure we would not need to go back in position once the search for a given key. Bcoz, next key is going to be lexicographically greater than the current key being searched.

This patch fixes it back to reseekTo(Key).
Also, found another bug where in we missed to sort keys to look up in base files in HoodieBackedTableMetadata in recent patch for RecordLevelIndex. Fixed the sorting towards this. Also, removed sorting from HfileData blocks, since now we have repeated sorting. So, the fix is to sort once per file Slice within lookupKeysFromFileSlice, so that individual look ups (base file, each log files) does not need to sort.

Fixed argument names and variable names in downstream consumers to avoid ambiguity on whether they keys are sorted or not.

Impact

This should improve the on-demand lookup with Hfile by a large factor. From our micro-benchmarking, we found latency improvement from 10s of seconds to 100 to 200ms(when 1000s are keys are looked up in a Hfile containing 100k entries).

Risk level (write none, low medium or high below)

medium

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan changed the title [HUDI-6420][WIP] Fixing seekTo() calls with hfile readers to reseek whereever applicable [HUDI-6420][WIP] Fixing seekTo() calls with hfile readers to reseek wherever applicable Jun 22, 2023
@nsivabalan nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Jun 22, 2023
@nsivabalan nsivabalan changed the title [HUDI-6420][WIP] Fixing seekTo() calls with hfile readers to reseek wherever applicable [HUDI-6420] Fixing Hfile on-demand and prefix based reads to use optimized apis Jun 22, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@nsivabalan nsivabalan merged commit 45a5397 into apache:master Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.14.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants