-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8124] [SPARKR] [WIP] Created more examples on SparkR DataFrames #6668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
d705112
486f44e
2e8f724
275b787
2653573
8e0fe14
c6933af
cc55cd8
b95a103
90565dd
b6603e3
33f9882
a550f70
f7227f9
3a97867
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # | ||
| # Author: Daniel Emaasit (@emaasit) | ||
| # Purpose: This script shows how to install SparkR onto your workstation/PC | ||
| # and initialize a spark context and a SparkSQL context | ||
| # Date: 06/05/2015 | ||
| # | ||
|
|
||
|
|
||
| # Install SparkR from CRAN | ||
| install.packages("SparkR") | ||
|
||
|
|
||
| ## OR Install the dev version from Github | ||
| install.packages(devtools) | ||
| devtools::install_github("amplab-extras/SparkR-pkg", subdir="pkg") | ||
|
|
||
| # Load SparkR onto your PC | ||
|
||
| library(SparkR) | ||
|
|
||
| ## Initialize SparkContext on your local PC | ||
| sc <- sparkR.init(master = "local", appName = "MyApp") | ||
|
|
||
| ## Initialize SQLContext | ||
| sqlCtx <- SparkRSQL.init(sc) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| # | ||
| # Author: Daniel Emaasit (@emaasit) | ||
| # Purpose: This script shows how to create Spark DataFrames | ||
| # Date: 06/05/2015 | ||
| # | ||
|
|
||
| # For this example, we shall use the "flights" dataset | ||
| # The data can be downloaded from: https://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv | ||
| # The dataset consists of every flight departing Houston in 2011. | ||
| # The data set is made up of 227,496 rows x 14 columns. | ||
|
|
||
| source("0-getting-started.R") | ||
|
|
||
| # Create an R data frame and then convert it to a SparkR DataFrame ------- | ||
|
|
||
| ## Create R dataframe | ||
| install.packages("data.table") #We want to use the fread() function to read the dataset | ||
|
||
| library(data.table) | ||
|
|
||
| flights_df <- fread("flights.csv") | ||
| flights_df$date <- as.Date(flights_df$date) | ||
|
|
||
| ## Convert the local data frame into a SparkR DataFrame | ||
| flightsDF <- createDataFrame(sqlCtx, flights_df) | ||
|
|
||
| ## Print the schema of this Spark DataFrame | ||
| printSchema(flightsDF) | ||
|
|
||
| ## Cache the DataFrame | ||
| cache(flightsDF) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # | ||
| # Author: Daniel Emaasit (@emaasit) | ||
| # Purpose: This script shows how to explore and manipulate Spark DataFrames | ||
| # Date: 06/05/2015 | ||
| # | ||
|
|
||
| source("1-data.R") | ||
|
||
|
|
||
|
|
||
| # Install the magrittr pipeline operator | ||
| install.packages("magrittr") | ||
| library(magrittr) | ||
|
|
||
| # Print the first 6 rows of the DataFrame | ||
| showDF(flightsDF, numRows = 6) ## Or | ||
| head(flightsDF) | ||
|
|
||
| # Show the column names in the DataFrame | ||
| columns(flightsDF) | ||
|
|
||
| # Show the number of rows in the DataFrame | ||
| count(flightsDF) | ||
|
|
||
| # Show summary statistics for numeric colums | ||
| Describe(flightsDF) | ||
|
||
|
|
||
| # Select specific columns | ||
| destDF <- select(flightsDF, "dest", "cancelled") | ||
|
|
||
| # Using SQL to select columns of data | ||
| # First, register the flights DataFrame as a table | ||
| registerTempTable(flightsDF, "flightsTable") | ||
| destDF <- sql(sqlCtx, "SELECT dest, cancelled FROM flightsTable") | ||
|
|
||
| # Use collect to create a local R data frame | ||
| dest_df <- collect(destDF) | ||
|
|
||
| # Print the newly created local data frame | ||
| print(dest_df) | ||
|
|
||
| # Filter flights whose destination is JFK | ||
| jfkDF <- filter(flightsDF, "dest == JFK") ##OR | ||
| jfkDF <- filter(flightsDF, flightsDF$dest == JFK) | ||
|
|
||
| # Group the flights by date and then find the average daily delay | ||
| # Write the result into a DataFrame | ||
| groupBy(flightsDF, "date") %>% | ||
| avg(dep_delay = "avg", arr_delay = "avg") -> dailyDelayDF | ||
|
|
||
| # Stop the SparkContext now | ||
| sparkR.stop() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have the Apache License at the top of every file. You can see https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R#L1 for an example
Also per our style guide we don't put in Author names / dates in the file itself as this is tracked in the commit log