Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use-cases #267

Closed
segeljakt opened this issue Jul 12, 2021 · 0 comments
Closed

Use-cases #267

segeljakt opened this issue Jul 12, 2021 · 0 comments

Comments

@segeljakt
Copy link
Member

segeljakt commented Jul 12, 2021

This issue is about finding new use-cases for arc-script.

Problem

Applications are needed to motivate the design of arc-script. For arc-script we have multiple levels of abstractions:

  • Applications (e.g., anomaly detection, data cleaning). These are very specialized abstractions built using libraries.
  • Libraries (e.g., custom operators, composite data types). These are more generalized abstractions built using language constructs.
  • Language (e.g., tasks, native data types). These are abstractions built using Rust and runtime primitives. Runtime primitives are not exposed at the arc-script level (only inside Rust).
  • Runtime (e.g., channels, streams, components). These are abstractions built using Kompact and Rust.

In the ideal case an application is just about implementing an algorithm (such as PageRank). Generally however data must also be pre- and post-processed. Where things get hairy is when we need to also interact with other systems. For example, Alibaba's use case requires interaction with MySQL/HBase data sources, and Druid data sinks. Zalando requires interaction with Kafka/S3 data sources.

Application requirements

TODO: List of requirements which our runtime/language must offer

Application non-requirements

TODO: List of requirements which our runtime/language cannot offer at the moment (but likely never)

List of Applications

This is a continuously updated list of applications in data analytics which can help motivate the design of Arc-Script. I will fill in some details to explain the idea and requirements of the use cases. Some use cases here share similar patterns in their implementation. Use cases may also be building blocks for other use cases. Data analytics pipelines may be composed of many different algorithms. This makes the distinction about what should be provided by our language and what should be possible to implement using our language important.

Streaming Algorithms

Streaming algorithms have a well-established theory behind them (known bounds on time and space). However, their applications are very specific... too specific to cover the area of BigData analytics. BigData analytics can however sometimes involve the use of streaming algorithms.

Requirements

Requirements of such algorithms are that they:

  • Must process data in a single-pass
  • Must process data in the order it arrives
  • Must process data with limited memory
  • Must process data in a limited time per event (faster than the inbound throughput)

Examples

Some examples from Wikipedia are Frequency Moments, Frequent Elements, Event Detection, Counting Distinct Elements, Feature Hashing, Stochastic Gradient Descent, Bloom filter, Count-min sketch, Locality-sensitive hashing, MinHash, SimHash, w-shingling. We should not delve into individual algorithms since these algorithms are rarely useful on their own in the broader picture.

BigData Streaming

BigData streaming is about processing massive amounts of information in real-time. Analytics are both continuous and deep, and for this reason pose harder requirements than general streaming algorithms.

Requirements

Requirements of continuous deep data analytics are as follows:

  • Applications are generally represented as pipelines which involve multiple stages of processing and analysis. A stage may for example represent a streaming algorithm
  • Data is timestamped (algorithms often revolve around time)
  • Data items arrive out-of-order with respect to their timestamps
  • Data can come from multiple sources, each with a different throughput
  • Data often needs to be pre-processed and enriched
  • Data can be partitioned by-key
  • Data is continuously generated (algorithms never terminate)
  • Applications can contain iterative loops which feed-back data for further refinement
  • Results should ideally be reproducible

Examples

Following is a list of examples of BigData streaming:

  • Flink Blog Demo: PageRank on Flink
    • Specific requirements: 1) Ability to parse strings into adjacency lists (vector of integers). 2) Ability to join two datasets (ranks with adjacency lists).
  • Flink Blog Demo: Fraud Detection on Flink
    • “Whenever the sum of the accumulated payment amount from the same payer to the same beneficiary within the duration of a week is greater than 1 000 000 $ - fire an alert.”

    • “Whenever the sum of payments from the same payer to the same beneficiary within a 24 hour period is greater than 200 000 $ - trigger an alert.”

    • Specific requirements: 1) Ability to broadcast events to all operator instances (which is used to dynamically update the rules for deciding whether a transaction is fraudulent without recompilation). The rules are in other words an input stream of control events to the system relayed from a Kafka topic. 2) Ability to partition events by key. The key is the payer and beneficiary. 3) Ability to forward events (reusing the same key). 4) Sliding windows. 5) Managed state.
  • Course: Machine Learning with Graphs
  • StreamQL Paper Demo: Arterial Blood Pressure Pulse Detection
    • Arterial Blood Pressure (ABP) pulse detection [O’Rourke 1971; Zong et al. 2003] is a complex streaming computation, and is difficult to express with existing languages for stream processing. The use of a streaming query language for medical monitoring applications has been considered in [Abbas et al. 2018, 2019].

    • The ABP signal is collected from the MIT-BIH Polysomnographic database [Ichimaru and Moody 1999]. The signal measurements are of type VT = {val: V, ts: T}, where val is the value of the signal and ts is the timestamp. The signal is uniformly sampled at a frequency of 250Hz. (...) The ABP waveform contains rich information about the cardiovascular system (e.g., heart rate, systolic, mean, and diastolic arterial pressures). Reliable ABP pulse detection is crucial for extracting this information.

    • First, the algorithm preprocesses the signal stream using a low-pass IIR filter and a slope sum function (SSF), and then it performs the detection of the pulse onset.

    • The low-pass filter IIR suppresses high frequency noise, and is defined by 𝑦(𝑛)=2𝑦(𝑛−1)−𝑦(𝑛−2)+𝑥(𝑛)−2𝑥(𝑛−5)+𝑥(𝑛−10). The SSF is defined by 𝑧(𝑛)=Σ0≤𝑖≤31𝑚𝑎𝑥(0,𝑑(𝑛−𝑖)), where 𝑑(𝑛)=𝑦(𝑛)−𝑦(𝑛−1). It enhances the up-slope of the ABP pulse and restrains the remainder of the pressure waveform. The query getVTP : Q(VT, VTP) annotates each item {val, ts} of the input stream with an additional component pval, which is the result of the preprocessing. The type VTP = {val: V, ts: T, pval: V} extends VT with this additional component. These preprocessed values have a phase shift of 20ms (5 samples), which is introduced by low-pass filtering.

    • The detection of ABP onset is described by the following rules:

      • R1. In intervals where the SSF value exceeds a threshold Thred (i.e. a tentative pulse), the algorithm selects the first and the maximum SSF values.
      • R2. The pulse detection is accepted only if the difference between the first and the maximum SSF values exceeds 100.
      • R3. When the pulse is accepted, the algorithm chooses the first sample that crosses the threshold as the onset point. The detected onset is adjusted by 20ms (5 samples) to compensate for the phase shift of low-pass filtering.
      • R4. After an onset is detected, to avoid double detection of the same pulse, the detection falls silent for 300ms.
  • StreamQL Paper Demo: Signal Smoothing
    • "Assume that the input stream consists of signal measurements of type V (integer type) which are collected at a fixed frequency. We will consider a computation that is the composition of a smoothing filter and calculating the derivative. We use a low-pass filter to smooth the input into results f : F (floating point type), where f = (v1 + 2*v2 + 4*v3 + 2*v4 + v5)/10 for each five consecutive input items v1, v2, ..., v5. Then, we compute the derivative d : D (floating point type) where d = f2 − f1 for every two consecutive smoothed values."

  • StreamQL Paper Demo: Peak Detection
    • "Now, let us consider an algorithm for detect peaks in a stream of numerical values (suppose they are of type V). The algorithm searches for the first value that exceeds the threshold THRESH. Then, it search for the maximum over the next #PEAK_CNT elements, which is considered a peak. After that, the algorithm silences detection for #SILENCE_CNT elements to avoid a duplicate detection. This process is repeated indefinitely in order to detect all peaks."

Complex Event Processing Algorithms

In the words of Esper, "Complex Event Processing" is about analysing events to find situations of interest. CEP detects and derives information, which can be reacted to by deciding and doing an action. This is known as the 4D model (Detect-Derive-Decide-Do).

An example situation to be detected is: A suspicious account is derived whenever there are at least three large cash deposits in the last 15 days.

  • The "Detect" is about the raw event, for example a cash deposit event.
  • The "Derive" is about the situation, i.e. "did something happen?", for example there is a suspicious account.
  • The "Decide" is about the decision of what to do, for example the decision to determine a risk score or determine another course of action
  • The "Do" is the action, for example an action that opens an investigation

Resources

@segeljakt segeljakt pinned this issue Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant