You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/hdfs.md
+22-4Lines changed: 22 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,15 @@
1
1
---
2
-
post_title: Integration with HDFS
2
+
post_title: Integration with HDFS and S3
3
3
nav_title: HDFS
4
4
menu_order: 20
5
5
enterprise: 'no'
6
6
---
7
7
8
+
# HDFS
8
9
9
10
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark's classpath: `hdfs-site.xml`, which provides default behaviors for the HDFS client. `core-site.xml`, which sets the default filesystem name. You can specify the location of these files at install time or for each job.
10
11
11
-
# Spark Installation
12
+
##Spark Installation
12
13
Within the Spark service configuration, set `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`, use this example where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs:
13
14
14
15
```json
@@ -20,10 +21,10 @@ Within the Spark service configuration, set `hdfs.config-url` to be a URL that s
20
21
```
21
22
This can also be done through the UI. If you are using the default installation of HDFS from Mesosphere this is probably `http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints`.
22
23
23
-
# Adding HDFS files per-job
24
+
##Adding HDFS configuration files per-job
24
25
To add the configuration files manually for a job, use `--conf spark.mesos.uris=<location_of_hdfs-site.xml>,<location_of_core-site.xml>`. This will download the files to the sandbox of the Driver Spark application, and DC/OS Spark will automatically load these files into the correct location. **Note** It is important these files are called `hdfs-site.xml` and `core-site.xml`.
25
26
26
-
## Spark Checkpointing
27
+
###Spark Checkpointing
27
28
28
29
In order to use spark with checkpointing make sure you follow the instructions [here](https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing) and use an hdfs directory as the checkpointing directory. For example:
That hdfs directory will be automatically created on hdfs and spark streaming app will work from checkpointed data even in the presence of application restarts/failures.
35
36
37
+
# S3
38
+
You can read/write files to S3 using environment-based secrets to pass your AWS credentials. Your credentials must first be uploaded to the DC/OS secret store:
0 commit comments