Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double-writing #2

Open
shashi opened this issue Dec 27, 2017 · 0 comments
Open

Fix double-writing #2

shashi opened this issue Dec 27, 2017 · 0 comments

Comments

@shashi
Copy link
Collaborator

shashi commented Dec 27, 2017

Example situation:

  • A 1GB table with 10 columns is created
  • An operation creates a different object with the same 10 columns (e.g. rows(t)), MemPool thinks that this operation needs to free up 1 GB of space, starts evicting objects to disk

The problem is that MemPool is not accounting for the vectors it writes to disk.

Based on @tanmaykm's suggested fix:

  • designate every vector with an ID when it gets written to wire or disk using MemPool
  • keep a shared dictionary which maintains a ref-count of each vector using its ID.
  • when writing a vector to disk to evict it from working memory, store the file and offset in the shared dictionary, point to offset and previous file name instead of writing the vector to the spilled object.

This has a few problems:

  • shared dictionary in a cluster is still not a thing
  • When only a vector is required from within a table, you still have to keep the whole file containing the table around. This involves pretty thorough bookkeeping. One solution is to write each vector into a separate file, but this can overwhelm a file system since it would increase the number of files. Another solution is to do manual page management.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant