Skip to content

propensive/ulysses

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Workflow

Ulysses

An implementation of Bloom filters for Scala

Bloom filters are useful when a probabilistic answer for whether a set contains a value is sufficient, particularly when the cost of storing every set element is prohibitive. Ulysses provides an immutable representation of a Bloom filter, with flexibility to tune its properties according to need.

Features

  • Provides a generic, immutable implementation of a Bloom filter
  • Can be configured to use a variety of different hash functions
  • Automatically determines Bloom filter parameters for target number of elements and error rate
  • Uses Gastronomy to derive hashes for primitive, product and sum types

Availability

Getting Started

About Bloom Filters

A Bloom filter provides a way to test if a value is probably in a set, or if it is certainly not in the set. That is, false positives for set membership are permitted, while false negatives are not. The memory required by a Bloom filter is dependent on the number of elements it should hold, and the probability of false positives when testing for set membership, and these parameters must be specified when the Bloom filter is created.

As well as determining the amount of memory the Bloom filter should use, the approximate number of elements and the acceptable false-positive rate determine the number of different hashes that will be used for each operation, with optimal values chosen for each, transparently.

The Bloom filter only has two core operations: adding a value to the set, and testing if a value is contained within the set. Since the Bloom filter does not actually store any values, there are no methods for retrieval.

Constructing a Bloom Filter

A new Bloom filter can be constructed with, for example:

import gastronomy.hashFunctions.crc32
val bloom = BloomFilter[Element](1000, 0.01)

This creates a new BloomFilter instance for storing elements of type Element, optimized for 1000 elements, with a target error rate of 1% (0.01). Additionally, the presence of the contextual value, hashFunctions.crc32 menas that the CRC32 hash function, defined in Gastronomy, will be used to calculate the hashes for addition and membership checks.

For creation, the Bloom filter additionally requires that its element types are digestible, that is, a contextual Digestible instance exists. Gastronomy provides Digestible instances for primitive types and will derive instances for product and coproduct types.

A newly-created BloomFilter is empty, but an element can be added with the + operator, or multiple elements with the ++ operator. Since BloomFilters are immutable, these will construct new BloomFilter instances.

Using a Bloom filter

BloomFilter also provides the method, mayContain, which takes an instance of the Bloom filter's type, and returns a Boolean. The interpretation of the result should not be mistaken: false means that the value is guaranteed not to be a member of the set represented by this Bloom filter, while true means that the value is probably a member of the set, but may not be.

For the Bloom filter above, constructed for approximately 1000 elements with an error rate of 1%, if it has, indeed, had 1000 elements added to it, then there is an estimated 1% chance that the mayContain method will return true for an element which has not been added. That false-positive probably will increase if significantly more elements are added to the Bloom filter, and would be smaller had significantly fewer elements been added.

Status

Ulysses is classified as fledgling. For reference, Soundness projects are categorized into one of the following five stability levels:

  • embryonic: for experimental or demonstrative purposes only, without any guarantees of longevity
  • fledgling: of proven utility, seeking contributions, but liable to significant redesigns
  • maturescent: major design decisions broady settled, seeking probatory adoption and refinement
  • dependable: production-ready, subject to controlled ongoing maintenance and enhancement; tagged as version 1.0.0 or later
  • adamantine: proven, reliable and production-ready, with no further breaking changes ever anticipated

Projects at any stability level, even embryonic projects, can still be used, as long as caution is taken to avoid a mismatch between the project's stability level and the required stability and maintainability of your own project.

Ulysses is designed to be small. Its entire source code currently consists of 76 lines of code.

Building

Ulysses will ultimately be built by Fury, when it is published. In the meantime, two possibilities are offered, however they are acknowledged to be fragile, inadequately tested, and unsuitable for anything more than experimentation. They are provided only for the necessity of providing some answer to the question, "how can I try Ulysses?".

  1. Copy the sources into your own project

    Read the fury file in the repository root to understand Ulysses's build structure, dependencies and source location; the file format should be short and quite intuitive. Copy the sources into a source directory in your own project, then repeat (recursively) for each of the dependencies.

    The sources are compiled against the latest nightly release of Scala 3. There should be no problem to compile the project together with all of its dependencies in a single compilation.

  2. Build with Wrath

    Wrath is a bootstrapping script for building Ulysses and other projects in the absence of a fully-featured build tool. It is designed to read the fury file in the project directory, and produce a collection of JAR files which can be added to a classpath, by compiling the project and all of its dependencies, including the Scala compiler itself.

    Download the latest version of wrath, make it executable, and add it to your path, for example by copying it to /usr/local/bin/.

    Clone this repository inside an empty directory, so that the build can safely make clones of repositories it depends on as peers of ulysses. Run wrath -F in the repository root. This will download and compile the latest version of Scala, as well as all of Ulysses's dependencies.

    If the build was successful, the compiled JAR files can be found in the .wrath/dist directory.

Contributing

Contributors to Ulysses are welcome and encouraged. New contributors may like to look for issues marked beginner.

We suggest that all contributors read the Contributing Guide to make the process of contributing to Ulysses easier.

Please do not contact project maintainers privately with questions unless there is a good reason to keep them private. While it can be tempting to repsond to such questions, private answers cannot be shared with a wider audience, and it can result in duplication of effort.

Author

Ulysses was designed and developed by Jon Pretty, and commercial support and training on all aspects of Scala 3 is available from Propensive OÜ.

Name

Ulysses is named after the novel by James Joyce, whose principal character is Leopold Bloom, a namesake of the creator of Bloom Filters, Burton H. Bloom.

In general, Soundness project names are always chosen with some rationale, however it is usually frivolous. Each name is chosen for more for its uniqueness and intrigue than its concision or catchiness, and there is no bias towards names with positive or "nice" meanings—since many of the libraries perform some quite unpleasant tasks.

Names should be English words, though many are obscure or archaic, and it should be noted how willingly English adopts foreign words. Names are generally of Greek or Latin origin, and have often arrived in English via a romance language.

Logo

The logo shows a blooming lotus flower, alluding to the Bloom filters that Ulysses provides.

License

Ulysses is copyright © 2024 Jon Pretty & Propensive OÜ, and is made available under the Apache 2.0 License.

About

An implementation of Bloom filters for Scala

Topics

Resources

Stars

Watchers

Forks

Languages