Skip to content

Commit 7ee5d40

Browse files
oeyhnatebower
andauthored
Add rds source doc for Data Prepper (#10781)
* Add rds source doc Signed-off-by: Hai Yan <[email protected]> * Apply suggestions from code review Signed-off-by: Nathan Bower <[email protected]> --------- Signed-off-by: Hai Yan <[email protected]> Signed-off-by: Nathan Bower <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
1 parent 69c7026 commit 7ee5d40

File tree

1 file changed

+265
-0
lines changed
  • _data-prepper/pipelines/configuration/sources

1 file changed

+265
-0
lines changed
Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
---
2+
layout: default
3+
title: rds
4+
parent: Sources
5+
grand_parent: Pipelines
6+
nav_order: 95
7+
---
8+
9+
# rds
10+
11+
The `rds` source enables change data capture (CDC) on [Amazon Relational Database Service (Amazon RDS)](https://aws.amazon.com/rds/) and [Amazon Aurora](https://aws.amazon.com/aurora/) databases. It can receive database events, such as `INSERT`, `UPDATE`, or `DELETE`, using database replication logs and supports initial load using RDS exports to Amazon Simple Storage Service (Amazon S3).
12+
13+
The source supports the following database engines:
14+
- Aurora MySQL and Aurora PostgreSQL
15+
- RDS MySQL and RDS PostgreSQL
16+
17+
The source includes two ingestion options for ingesting data from Aurora/RDS:
18+
19+
1. Export: A full initial export from Aurora/RDS to S3 gets an initial load of the current state of the Aurora/RDS database.
20+
2. Stream: Stream events from database replication logs (MySQL binlog or PostgreSQL WAL).
21+
22+
## Usage
23+
24+
The following example pipeline specifies an `rds` source. It ingests data from an Aurora MySQL cluster:
25+
26+
```yaml
27+
version: "2"
28+
rds-pipeline:
29+
source:
30+
rds:
31+
db_identifier: "my-rds-instance"
32+
engine: "aurora-mysql"
33+
database: "mydb"
34+
authentication:
35+
username: "myuser"
36+
password: "mypassword"
37+
s3_bucket: "my-export-bucket"
38+
s3_region: "us-west-2"
39+
s3_prefix: "rds-exports"
40+
export:
41+
kms_key_id: "arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012"
42+
export_role_arn: "arn:aws:iam::123456789012:role/rds-export-role"
43+
stream: true
44+
aws:
45+
region: "us-west-2"
46+
sts_role_arn: "arn:aws:iam::123456789012:role/my-pipeline-role"
47+
```
48+
49+
## Configuration options
50+
51+
The following tables describe the configuration options for the `rds` source.
52+
53+
Option | Required | Type | Description
54+
:--- | :--- | :--- | :---
55+
`db_identifier` | Yes | String | The identifier for the RDS instance or Aurora cluster.
56+
`cluster` | No | Boolean | Whether the `db_identifier` refers to a cluster (`true`) or an instance (`false`). Default is `false`. For Aurora engines, this option is always `true`.
57+
`engine` | Yes | String | The database engine type. Must be one of `mysql`, `postgresql`, `aurora-mysql`, or `aurora-postgresql`.
58+
`database` | Yes | String | The name of the database to connect to.
59+
`tables` | No | Object | The configuration for specifying which tables to include or exclude. See [tables](#tables) for more information.
60+
`authentication` | Yes | Object | Database authentication credentials. See [authentication](#authentication) for more information.
61+
`aws` | Yes | Object | The AWS configuration. See [aws](#aws) for more information.
62+
`acknowledgments` | No | Boolean | When `true`, enables the source to receive [end-to-end acknowledgments]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/pipelines#end-to-end-acknowledgments) when events are received by OpenSearch sinks. Default is `true`.
63+
`s3_data_file_acknowledgment_timeout` | No | Duration | The amount of time that elapses before the data read from an RDS export expires when used with acknowledgments. Default is 30 minutes.
64+
`stream_acknowledgment_timeout` | No | Duration | The amount of time that elapses before the data read from database streams expires when used with acknowledgments. Default is 10 minutes.
65+
`s3_bucket` | Yes | String | The name of the S3 bucket in which RDS export data will be stored.
66+
`s3_prefix` | No | String | The prefix for S3 objects in the export bucket.
67+
`s3_region` | No | String | The AWS Region for the S3 bucket. If not specified, uses the same Region as specified in the [aws](#aws) configuration.
68+
`partition_count` | No | Integer | The number of folder partitions in the S3 buffer. Must be between 1 and 1,000. Default is 100.
69+
`export` | No | Object | The configuration for RDS export operations. See [export](#export-options) for more information.
70+
`stream` | No | Boolean | Whether to enable streaming of database change events. Default is `false`.
71+
`tls` | No | Object | The TLS configuration for database connections. See [tls](#tls-options) for more information.
72+
`disable_s3_read_for_leader` | No | Boolean | Whether to disable S3 read operations for the leader node. Default is `false`.
73+
74+
### aws
75+
76+
Use the following options in the AWS configuration.
77+
78+
Option | Required | Type | Description
79+
:--- | :--- | :--- | :---
80+
`region` | No | String | The AWS Region to use for credentials. Defaults to the [standard SDK behavior for determining the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
81+
`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon RDS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
82+
`sts_external_id` | No | String | The external ID to use when assuming the STS role. Must be between 2 and 1,224 characters.
83+
`sts_header_overrides` | No | Map | A map of header overrides that the AWS Identity and Access Management (IAM) role assumes for the source plugin. Maximum of 5 headers.
84+
85+
### authentication
86+
87+
Use the following options for database authentication.
88+
89+
Option | Required | Type | Description
90+
:--- | :--- | :--- | :---
91+
`username` | Yes | String | The database username for authentication.
92+
`password` | Yes | String | The database password for authentication.
93+
94+
### tables
95+
96+
Use the following options to specify which tables to include in the data capture.
97+
98+
Option | Required | Type | Description
99+
:--- | :--- | :--- | :---
100+
`include` | No | List | A list of table names to include in data capture. Maximum of 1,000 tables. If specified, only these tables will be processed.
101+
`exclude` | No | List | A list of table names to exclude from data capture. Maximum of 1,000 tables. These tables will be ignored even if they match include patterns.
102+
103+
### export options
104+
105+
The following options let you customize the RDS export functionality.
106+
107+
Option | Required | Type | Description
108+
:--- | :--- | :--- | :---
109+
`kms_key_id` | Yes | String | The AWS Key Management Service (AWS KMS) key ID or Amazon Resource Name (ARN) to use for encrypting the export data.
110+
`export_role_arn` | Yes | String | The ARN of the IAM role that RDS will assume to perform the export operation.
111+
112+
### tls options
113+
114+
The following options let you configure TLS for database connections.
115+
116+
Option | Required | Type | Description
117+
:--- | :--- | :--- | :---
118+
`insecure` | No | Boolean | Whether to disable TLS encryption for database connections. Default is `false` (TLS enabled).
119+
120+
## Exposed metadata attributes
121+
122+
The following metadata will be added to each event that is processed by the `rds` source. These metadata attributes can be accessed using the [expression syntax `getMetadata` function]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/get-metadata/).
123+
124+
* `primary_key`: The primary key of the database record. For tables with composite primary keys, values are concatenated with a `|` separator.
125+
* `event_timestamp`: The timestamp, in epoch milliseconds, of when the database change occurred. For export events, this represents the export time. For stream events, this represents the transaction commit time.
126+
* `document_version`: A long integer number generated from the event timestamp to use as the document version.
127+
* `opensearch_action`: The bulk action that will be used to send the event to OpenSearch, such as `index` or `delete`.
128+
* `change_event_type`: The stream event type. Can be `insert`, `update`, or `delete`.
129+
* `table_name`: The name of the database table from which the event originated.
130+
* `schema_name`: The name of the schema from which the event originated. For MySQL, `schema_name` is the same as `database_name`.
131+
* `database_name`: The name of the database from which the event originated.
132+
* `ingestion_type`: Indicates whether the event originated from an export or stream. Valid values are `EXPORT` and `STREAM`.
133+
* `s3_partition_key`: Events are stored in an S3 staging bucket before processing. This metadata indicates the location in the S3 bucket where the event is stored before processing.
134+
135+
## Permissions
136+
137+
The following are the required permissions for running RDS as a source:
138+
139+
```json
140+
{
141+
"Version": "2012-10-17",
142+
"Statement": [
143+
{
144+
"Sid": "allowReadingFromS3Buckets",
145+
"Effect": "Allow",
146+
"Action": [
147+
"s3:GetObject",
148+
"s3:DeleteObject",
149+
"s3:GetBucketLocation",
150+
"s3:ListBucket",
151+
"s3:PutObject"
152+
],
153+
"Resource": [
154+
"arn:aws:s3:::s3_bucket",
155+
"arn:aws:s3:::s3_bucket/*"
156+
]
157+
},
158+
{
159+
"Sid": "AllowDescribeInstances",
160+
"Effect": "Allow",
161+
"Action": [
162+
"rds:DescribeDBInstances"
163+
],
164+
"Resource": [
165+
"arn:aws:rds:region:account-id:db:*"
166+
]
167+
},
168+
{
169+
"Sid": "AllowDescribeClusters",
170+
"Effect": "Allow",
171+
"Action": [
172+
"rds:DescribeDBClusters"
173+
],
174+
"Resource": [
175+
"arn:aws:rds:region:account-id:cluster:*"
176+
]
177+
},
178+
{
179+
"Sid": "AllowSnapshots",
180+
"Effect": "Allow",
181+
"Action": [
182+
"rds:DescribeDBClusterSnapshots",
183+
"rds:CreateDBClusterSnapshot",
184+
"rds:DescribeDBSnapshots",
185+
"rds:CreateDBSnapshot",
186+
"rds:AddTagsToResource"
187+
],
188+
"Resource": [
189+
"arn:aws:rds:region:account-id:cluster:*",
190+
"arn:aws:rds:region:account-id:cluster-snapshot:*",
191+
"arn:aws:rds:region:account-id:db:*",
192+
"arn:aws:rds:region:account-id:snapshot:*"
193+
]
194+
},
195+
{
196+
"Sid": "AllowExport",
197+
"Effect": "Allow",
198+
"Action": [
199+
"rds:StartExportTask"
200+
],
201+
"Resource": [
202+
"arn:aws:rds:region:account-id:cluster:*",
203+
"arn:aws:rds:region:account-id:cluster-snapshot:*",
204+
"arn:aws:rds:region:account-id:snapshot:*"
205+
]
206+
},
207+
{
208+
"Sid": "AllowDescribeExports",
209+
"Effect": "Allow",
210+
"Action": [
211+
"rds:DescribeExportTasks"
212+
],
213+
"Resource": "*"
214+
},
215+
{
216+
"Sid": "AllowAccessToKmsForExport",
217+
"Effect": "Allow",
218+
"Action": [
219+
"kms:Decrypt",
220+
"kms:Encrypt",
221+
"kms:DescribeKey",
222+
"kms:RetireGrant",
223+
"kms:CreateGrant",
224+
"kms:ReEncrypt*",
225+
"kms:GenerateDataKey*"
226+
],
227+
"Resource": [
228+
"arn:aws:kms:region:account-id:key/export-key-id"
229+
]
230+
},
231+
{
232+
"Sid": "AllowPassingExportRole",
233+
"Effect": "Allow",
234+
"Action": "iam:PassRole",
235+
"Resource": [
236+
"arn:aws:iam::account-id:role/export-role"
237+
]
238+
}
239+
]
240+
}
241+
```
242+
243+
## Metrics
244+
245+
The `rds` source includes the following metrics:
246+
247+
* `exportJobSuccess`: The number of RDS export tasks that have succeeded.
248+
* `exportJobFailure`: The number of RDS export tasks that have failed.
249+
* `exportS3ObjectsTotal`: The total number of export data files found in S3.
250+
* `exportS3ObjectsProcessed`: The total number of export data files that have been processed successfully from S3.
251+
* `exportS3ObjectsErrors`: The total number of export data files that have failed to be processed from S3.
252+
* `exportRecordsTotal`: The total number of records found in the export.
253+
* `exportRecordsProcessed`: The total number of export records that have been processed successfully.
254+
* `exportRecordsProcessingErrors`: The number of export record processing errors.
255+
* `changeEventsProcessed`: The number of change events processed from database streams.
256+
* `changeEventsProcessingErrors`: The number of processing errors for change events from database streams.
257+
* `bytesReceived`: The total number of bytes received by the source.
258+
* `bytesProcessed`: The total number of bytes processed by the source.
259+
* `positiveAcknowledgementSets`: The number of acknowledgement sets that are positively acknowledged in stream processing.
260+
* `negativeAcknowledgementSets`: The number of acknowledgement sets that are negatively acknowledged in stream processing.
261+
* `checkpointCount`: The total number of checkpoints in stream processing.
262+
* `noDataExtendLeaseCount`: The number of times that the lease is extended on a partition with no new data processed since the last checkpoint.
263+
* `giveupPartitionCount`: The number of times a partition is given up.
264+
* `replicationLogEntryProcessingTime`: The time taken to process a replication log event.
265+
* `replicationLogEntryProcessingErrors`: The number of replication log events that have failed to process.

0 commit comments

Comments
 (0)