|
| 1 | +# Pravega Python DataStream connector |
| 2 | + |
| 3 | +This Pravega Python DataStream connector provides a data source and data sink for Flink streaming jobs. |
| 4 | + |
| 5 | +Your Flink streaming jobs could use Pravega as their storage with these [Python API Wrappers](https://github.com/pravega/flink-connectors/tree/master/src/main/python). This page only describes the API usage and for parameter concepts please refer to [Configurations](configurations.md) and [Streaming](streaming.md) |
| 6 | + |
| 7 | +**DISCLAIMER: This python wrapper is an IMPLEMENTATION REFERENCE and is not officially published.** |
| 8 | + |
| 9 | +* [How to use](#How-to-use) |
| 10 | +* [PravegaConfig](#PravegaConfig) |
| 11 | +* [StreamCut](#StreamCut) |
| 12 | +* [FlinkPravegaReader](#FlinkPravegaReader) |
| 13 | +* [FlinkPravegaWriter](#FlinkPravegaWriter) |
| 14 | +* [Metrics](#Metrics) |
| 15 | +* [Serialization](#Serialization) |
| 16 | + |
| 17 | +## How to use |
| 18 | + |
| 19 | +Together with the connector jar and python wrapper files, you could submit your job with main compute code like this: |
| 20 | + |
| 21 | +```bash |
| 22 | +flink run --python ./application.py --pyFiles <connector-repo>/src/main/python/ --jarfile /path/to/pravega-connectors-flink.jar |
| 23 | +``` |
| 24 | + |
| 25 | +## PravegaConfig |
| 26 | + |
| 27 | +A top-level config object, `PravegaConfig`, is provided to establish a Pravega context for the Flink connector. |
| 28 | + |
| 29 | +```python |
| 30 | +from pravega_config import PravegaConfig |
| 31 | + |
| 32 | +pravega_config = PravegaConfig(uri=uri, scope=scope) |
| 33 | +``` |
| 34 | + |
| 35 | +|parameter|type|required|default value|description| |
| 36 | +|-|-|-|-|-| |
| 37 | +|uri|str|Yes|N/A|The Pravega controller RPC URI.| |
| 38 | +|scope|str|Yes|N/A|The self-defined Pravega scope.| |
| 39 | +|trust_store|str|No|None|The truststore value.| |
| 40 | +|default_scope|str|No|None|The default Pravega scope, to resolve unqualified stream names and to support reader groups.| |
| 41 | +|credentials|DefaultCredentials|No|None|The Pravega credentials to use.| |
| 42 | +|validate_hostname|bool|No|True|TLS hostname validation.| |
| 43 | + |
| 44 | +## StreamCut |
| 45 | + |
| 46 | +A `StreamCut` object could be constructed from the `from_base64` class method where a base64 str is passed as the only parameter. |
| 47 | + |
| 48 | +By default, the `FlinkPravegaReader` will pass the `UNBOUNDED` `StreamCut` which let the reader read from the HEAD to the TAIL. |
| 49 | + |
| 50 | +## FlinkPravegaReader |
| 51 | + |
| 52 | +Use `FlinkPravegaReader` as a datastream source. Could be added by `env.add_source`. |
| 53 | + |
| 54 | +```python |
| 55 | +from pyflink.common.serialization import SimpleStringSchema |
| 56 | +from pyflink.datastream import StreamExecutionEnvironment |
| 57 | + |
| 58 | +from pravega_config import PravegaConfig |
| 59 | +from pravega_reader import FlinkPravegaReader |
| 60 | + |
| 61 | +env = StreamExecutionEnvironment.get_execution_environment() |
| 62 | + |
| 63 | +pravega_config = PravegaConfig(uri=uri, scope=scope) |
| 64 | +pravega_reader = FlinkPravegaReader( |
| 65 | + stream=stream, |
| 66 | + pravega_config=pravega_config, |
| 67 | + deserialization_schema=SimpleStringSchema()) |
| 68 | + |
| 69 | +ds = env.add_source(pravega_reader) |
| 70 | +``` |
| 71 | + |
| 72 | +|parameter|type|required|default value|description| |
| 73 | +|-|-|-|-|-| |
| 74 | +|stream|Union[str, Stream]|Yes|N/A|The stream to be read from.| |
| 75 | +|pravega_config|PravegaConfig|Yes|N/A|Set the Pravega client configuration, which includes connection info, security info, and a default scope.| |
| 76 | +|deserialization_schema|DeserializationSchema|Yes|N/A|The deserialization schema which describes how to turn byte messages into events.| |
| 77 | +|start_stream_cut|StreamCut|No|StreamCut.UNBOUNDED|Read from the given start position in the stream.| |
| 78 | +|end_stream_cut|StreamCut|No|StreamCut.UNBOUNDED|Read to the given end position in the stream.| |
| 79 | +|enable_metrics|bool|No|True|Pravega reader metrics.| |
| 80 | +|uid|str|No|None(random generated uid on java side)|The uid to identify the checkpoint state of this source.| |
| 81 | +|reader_group_scope|str|No|pravega_config.default_scope|The scope to store the Reader Group synchronization stream into.| |
| 82 | +|reader_group_name|str|No|None(auto-generated name on java side)|The Reader Group name for display purposes.| |
| 83 | +|reader_group_refresh_time|timedelta|No|None(3 seconds on java side)|The interval for synchronizing the Reader Group state across parallel source instances.| |
| 84 | +|checkpoint_initiate_timeout|timedelta|No|None(5 seconds on java side)|The timeout for executing a checkpoint of the Reader Group state.| |
| 85 | +|event_read_timeout|timedelta|No|None(1 second on java side)|Sets the timeout for the call to read events from Pravega. After the timeout expires (without an event being returned), another call will be made.| |
| 86 | +|max_outstanding_checkpoint_request|int|No|None(3 on java side)|Configures the maximum outstanding checkpoint requests to Pravega.| |
| 87 | + |
| 88 | +## FlinkPravegaWriter |
| 89 | + |
| 90 | +Use `FlinkPravegaWriter` as a datastream sink. Could be added by `env.add_sink`. |
| 91 | + |
| 92 | +```python |
| 93 | +from pyflink.common.serialization import SimpleStringSchema |
| 94 | +from pyflink.datastream import StreamExecutionEnvironment |
| 95 | + |
| 96 | +from pravega_config import PravegaConfig |
| 97 | +from pravega_writer import FlinkPravegaWriter |
| 98 | + |
| 99 | +env = StreamExecutionEnvironment.get_execution_environment() |
| 100 | + |
| 101 | +pravega_config = PravegaConfig(uri=uri, scope=scope) |
| 102 | +pravega_writer = FlinkPravegaWriter(stream=stream, |
| 103 | + pravega_config=pravega_config, |
| 104 | + serialization_schema=SimpleStringSchema()) |
| 105 | + |
| 106 | +ds = env.add_sink(pravega_reader) |
| 107 | +``` |
| 108 | + |
| 109 | +|parameter|type|required|default value|description| |
| 110 | +|-|-|-|-|-| |
| 111 | +|stream|Union[str, Stream]|Yes|N/A|Add a stream to be read by the source, from the earliest available position in the stream.| |
| 112 | +|pravega_config|PravegaConfig|Yes|N/A|Set the Pravega client configuration, which includes connection info, security info, and a default scope.| |
| 113 | +|serialization_schema|SerializationSchema|Yes|N/A|The serialization schema which describes how to turn events into byte messages.| |
| 114 | +|enable_metrics|bool|No|True|Pravega writer metrics.| |
| 115 | +|writer_mode|PravegaWriterMode|No|PravegaWriterMode.ATLEAST_ONCE|The writer mode to provide *Best-effort*, *At-least-once*, or *Exactly-once* guarantees.| |
| 116 | +|enable_watermark|bool|No|False|Emit Flink watermark in event-time semantics to Pravega streams.| |
| 117 | +|txn_lease_renewal_period|timedelta|No|None(30 seconds on java side)|Report Pravega metrics.| |
| 118 | + |
| 119 | +## Metrics |
| 120 | + |
| 121 | +Metrics are reported by default unless it is explicitly disabled using enable_metrics(False) option. See [Metrics](metrics.md) page for more details on type of metrics that are reported. |
| 122 | + |
| 123 | +## Serialization |
| 124 | + |
| 125 | +See the [Data Types](https://ci.apache.org/projects/flink/flink-docs-stable/docs/dev/python/datastream/data_types/) page of PyFlink for more information. |
0 commit comments