This tutorial shows how to run the code explained in the solution paper Recommendation Engine on Google Cloud Platform. In order to run this example fully you will need to use various components.
Disclaimer: This is not an official Google product.
This tutorial assumes that you have a Cloud Platform project. To set up a project:
- In the Cloud Platform Console, go to the Projects page.
- Select a project, or click Create Project to create a new Cloud Platform Console project.
- In the dialog, name your project. Make a note of your generated project ID.
- Click Create to create a new project.
The main steps involved in this example are:
- Setup a Spark cluster.
- Setup a simple Google App Engine website.
- Create a Cloud SQL database with an accommodation table, a rating table and a recommendation table.
- Run a Python script on the Spark cluster to find the best model.
- Run a Python script making a prediction using the best model.
- Saving the predictions into Cloud SQL so the user can see them when displaying the welcome page.
Cloud Platform offers various ways to deploy a Hadoop cluster to use Spark. This solution describes two ways:
- Using bdutil, a command line tool that simplifies the cluster creation, deployment and the connectivity to Cloud Platform
- Using Google Cloud Dataproc, a managed service that makes running your custom code seamless thanks to its easy way to create a cluster, deploy Hadoop, connect to Cloud Platform components, submit jobs, scale the cluster, and monitor the nodes. Cloud Dataproc does all that through a web user interface.
The quickest and easiest way is to use Cloud Dataproc.
Follow these steps to set up Apache Spark:
- Download
bdutil
from https://cloud.google.com/hadoop/downloads. - Change your environment values as described in the documentation:
a. CONFIGBUCKET="your_root_bucket_name"
b. PROJECT="your-project" - Deploy your cluster and log into the Hadoop master instance.
./bdutil deploy -e extensions/spark/spark_env.sh
./bdutil shell
Notes :
- Using the
bdutil
shell is equivalent to using the SSH command-line interface to connect to the instance. bdutil
uses Google Cloud Storage as a file system, which means that all references to files are relative to theCONFIGBUCKET
folder.
In order to be able to call the Python application file with the connector, download the JDBC connector to your working folder on the master instance. After you install it, you can use the connector when calling the Python file through the spark-submit
command line.
Because each worker needs to access the data, download the JDBC connector onto each instance:
- Download the connector to
/usr/share/java
. - Add the following lines to
/home/hadoop/spark-install/conf/spark-defaults.conf
. Don't forget to replace the names of the JAR files with the correct version.
spark.driver.extraClassPath /usr/share/java/mysql-connector-java-x.x.xx-bin.jar
spark.executor.extraClassPath /usr/share/java/mysql-connector-java-x.x.xx-bin.jar
Set up a cluster with the default parameters as explained in the Cloud Dataproc documentation on how to create a cluster. Cloud Dataproc does not require you to setup the JDBC connector.
Follow these instructions to create a Cloud SQL instance. We will use a Cloud SQL first generation in this example. To be make sure your Spark cluster can access your Cloud SQL database, you must:
- Whitelist the IPs of the nodes as explained in the Cloud SQL documentation. You can find the instances' IPs by going to Compute Engine -> VM Instances in the Cloud Console. There you should see a number of instances (based on your cluster size) with names like cluster-m, cluster-w-i where
cluster
is the name of your cluster andi
is a slave number. - Create an IPv4 address so the Cloud SQL instance can be accessed through the network.
- Create a non-root user account. Make sure that this user account can connect from the IPs corresponding to the Dataproc cluster (not just localhost)
After you create and connect to an instance, you need to create some tables and load data into some of them by following these steps:
- Connect to your project Cloud SQL instance through the Cloud Console.
- Create the database and tables as explained here, using the provided sql script. In the Cloud Storage file input, enter
solutions-public-assets/recommendation-spark/sql/table_creation.sql
. If for some reason, the UI says that you have no access, you can also copy the file to your own bucket. In a CloudShell window or in a terminal typegsutil cp gs://solutions-public-assets/recommendation-spark/sql/table_creation.sql gs://<your-bucket>/recommendation-spark/sql/table_creation.sql
. Then, in the Cloud SQL import window, provide<your-bucket>/recommendation-spark/sql/table_creation.sql
(i.e the path to your copy of the file on Google Storage, without the gs:// prefix). - In the same way, populate the Accommodation and Rating tables using the provided accommodations.csv and ratings.csv.
The appengine folder contains a simple HTML website built with Python on App Engine using Angular Material. While it is not required to deploy this website, it can give you an idea of what a recommendation display could look like in a production environment.
Make sure to update your database values in the main.py file to match your setup. If you kept the values of the .sql script, _DB_NAME = 'recommendation_spark'. The rest will be specific to your setup.
You can find some accomodation images here. Upload the individual files to your own bucket and change their acl to be public in order to serve them out. Remember to replace <YOUR_IMAGE_BUCKET>
in appengine/app/templates/welcome.html page with your bucket.
The main part of this solution paper is explained on the Cloud Platform solution page. In the pyspark
folder, you will find the scripts mentionned in the solution paper:
Both scripts should be run in a Spark cluster. This can be done on Cloud Platform either by using Cloud Dataproc or bdutil
.
There are two ways to run code in Spark: through the command line or by loading a Python file. In this case, it's easier to use the Python file to avoid writing each line of code into the CLI. Remember to pass the path to the JDBC JAR file as a parameter so it can be used by the sqlContext.load
function.
$ spark-submit \
--driver-class-path mysql-connector-java-x.x.xx-bin.jar \
--jars mysql-connector-java-x.x.xx-bin.jar \
find_model_collaborative.py \
<YOUR_CLOUDSQL_INSTANCE_IP> \
<YOUR_CLOUDSQL_INSTANCE_NAME> \
<YOUR_CLOUDSQL_USER> \
<YOUR_CLOUDSQL_PASSWORD>
Dataproc already has the connector enabled so there is no need to set it up.
The easiest way is to use the Cloud Console and run the script directly from a remote location (Cloud Storage for example). See the documentation.
It is also possible to run this command line from a local computer:
$ gcloud beta dataproc jobs submit pyspark \
--cluster <YOUR_DATAPROC_CLUSTER_NAME> \
find_model_collaborative.py \
<YOUR_CLOUDSQL_INSTANCE_IP> \
<YOUR_CLOUDSQL_INSTANCE_NAME> \
<YOUR_CLOUDSQL_USER> \
<YOUR_CLOUDSQL_PASSWORD>
The script above returns a combination of the best parameters for the ALS training, as explained in the Training the model part of the solution article. It will be displayed in the console in the following format, where Dist
represents how far we are from being the known value. The result might not feel satisfying but remember that the training dataset was quite small.
After you have those values, you can reuse them when calling the recommendation script.
# Build our model with the best found values
model = ALS.train(rddTraining, BEST_RANK, BEST_ITERATION, BEST_REGULATION)
Where, in our current case, BEST_RANK=15, BEST_ITERATION=20, BEST_REGULATION=0.1.
Run the app_collaborative.py
file with the updated values as you did before. The code makes a prediction and saves the top 5 expected rates in Cloud SQL. You can look at the results later.
$ spark-submit \
--driver-class-path mysql-connector-java-5.1.36-bin.jar \
--jars mysql-connector-java-5.1.36-bin.jar \
app_collaborative.py \
<YOUR_CLOUDSQL_INSTANCE_IP> \
<YOUR_CLOUDSQL_INSTANCE_NAME> \
<YOUR_CLOUDSQL_USER> \
<YOUR_CLOUDSQL_PASSWORD> \
<YOUR_BEST_RANK> \
<YOUR_BEST_ITERATION> \
<YOUR_BEST_REGULATION>
You can use the Cloud Console, as explained before, which would be equivalent of running the following script from your local computer.
$ gcloud beta dataproc jobs submit pyspark \
--cluster <YOUR_DATAPROC_CLUSTER_NAME> \
app_collaborative.py \
<YOUR_CLOUDSQL_INSTANCE_IP> \
<YOUR_CLOUDSQL_INSTANCE_NAME> \
<YOUR_CLOUDSQL_USER> \
<YOUR_CLOUDSQL_PASSWORD> \
<YOUR_BEST_RANK> \
<YOUR_BEST_ITERATION> \
<YOUR_BEST_REGULATION>
The code posted in GitHub prints the top 5 predictions. You should see something similar to a list of tuples, including userId
, accoId
, and prediction
:
[('0', '75', 4.6428704512729375), ('0', '76', 4.54325166163637), ('0', '86', 4.529177571208829), ('0', '66', 4.52387350189572), ('0', '99', 4.44705391172443)]
Running the following SQL query on the database will return the predictions saved in the Recommendation
table by app_collaborative.py
:
SELECT
id, title, type, r.prediction
FROM
Accommodation a
INNER JOIN
Recommendation r
ON
r.accoId = a.id
WHERE
r.userId = <USER_ID>
ORDER BY
r.prediction desc