|
| 1 | +(cluvio-tutorial)= |
| 2 | +# Data Analysis with Cluvio and CrateDB |
| 3 | + |
| 4 | +## Introduction |
| 5 | + |
| 6 | +In this tutorial, we'll explore how to leverage the power of [Cluvio](https://www.cluvio.com), a modern data analysis platform with [CrateDB Cloud](https://console.cratedb.cloud/) as the underlying database. |
| 7 | + |
| 8 | +## Prerequisites |
| 9 | + |
| 10 | +* [CrateDB Cloud cluster up and running](https://crate.io/docs/cloud/en/latest/tutorials/cluster-deployment/index.html) |
| 11 | +* [Cluvio account](https://www.cluvio.com/) |
| 12 | + |
| 13 | + |
| 14 | +## Set up CrateDB |
| 15 | + |
| 16 | +Deploying a CrateDB cloud cluster has never been easier, simply follow our tutorial [here](https://crate.io/docs/cloud/en/latest/tutorials/cluster-deployment/stripe.html#cluster-deployment-stripe) and you can have a cluster up and running within minutes. We offer a CRFREE plan which offers up to 2 vCPUs, 2 GiB of memory, and 8 GiB of storage completely for free. Ideal for small-scale testing and evaluation purposes. |
| 17 | + |
| 18 | +### Load data into CrateDB cluster |
| 19 | + |
| 20 | +In this tutorial we'll use 2 tables as our datasource. [flights](http://stat-computing.org/dataexpo/2009) and [airports](https://openflights.org/data.php) from January of 2008. |
| 21 | + |
| 22 | +#### Create tables |
| 23 | + |
| 24 | +Once you have access to the Admin UI of your cluster, execute these statements in the console: |
| 25 | + |
| 26 | +```sql |
| 27 | +CREATE TABLE airports ( |
| 28 | + code character varying(3) NOT NULL, |
| 29 | + name character varying(100) NOT NULL, |
| 30 | + city character varying(50) NOT NULL, |
| 31 | + country character varying(50) NOT NULL, |
| 32 | + latitude double precision NOT NULL, |
| 33 | + longitude double precision NOT NULL, |
| 34 | + elevation integer NOT NULL |
| 35 | +); |
| 36 | +``` |
| 37 | + |
| 38 | +```sql |
| 39 | +CREATE TABLE flights ( |
| 40 | + year integer, |
| 41 | + month integer, |
| 42 | + day_of_month integer, |
| 43 | + day_of_week integer, |
| 44 | + dep_time integer, |
| 45 | + crs_dep_time integer, |
| 46 | + arr_time integer, |
| 47 | + crs_arr_time integer, |
| 48 | + unique_carrier varchar(6), |
| 49 | + flight_num integer, |
| 50 | + tail_num varchar(8), |
| 51 | + actual_elapsed_time integer, |
| 52 | + crs_elapsed_time integer, |
| 53 | + air_time integer, |
| 54 | + arr_delay integer, |
| 55 | + dep_delay integer, |
| 56 | + origin varchar(3), |
| 57 | + dest varchar(3), |
| 58 | + distance integer, |
| 59 | + taxi_in integer, |
| 60 | + taxi_out integer, |
| 61 | + cancelled integer, |
| 62 | + cancellation_code varchar(1), |
| 63 | + diverted varchar(1), |
| 64 | + carrier_delay integer, |
| 65 | + weather_delay integer, |
| 66 | + nas_delay integer, |
| 67 | + security_delay integer, |
| 68 | + late_aircraft_delay integer, |
| 69 | + dep_timestamp timestamp, |
| 70 | + arr_timestamp timestamp |
| 71 | +); |
| 72 | +``` |
| 73 | + |
| 74 | +This creates 2 empty tables in your database. `flights` and `airports`, with the correct data types of the columns. |
| 75 | + |
| 76 | +#### Import data |
| 77 | + |
| 78 | +Now you should import the data into the tables. We will use Console "Import" feature in this example. Use the following links: |
| 79 | + |
| 80 | +* airports - https://s3.amazonaws.com/crate.sampledata/flights/dataset-airports.csv.gz |
| 81 | +* flights - https://s3.amazonaws.com/crate.sampledata/flights/dataset-flights.csv.gz |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +Make sure to use your pre-created tables in the "Table name" field, otherwise the column types may be created incorrectly. Do this for both .csv files: |
| 86 | + |
| 87 | + |
| 88 | + |
| 89 | +After this your tables should no longer be empty. `airports` contains 5876 records, and `flights` 150 000 records. |
| 90 | + |
| 91 | +## Connect CrateDB to Cluvio |
| 92 | + |
| 93 | +Now that you have the dataset to visualize, you can connect your cluster to Cluvio. |
| 94 | + |
| 95 | +In Cluvio, navigate to `Settings` -> `Datasources` -> `Add datasource`: |
| 96 | + |
| 97 | +{width=640} |
| 98 | + |
| 99 | +`Host` and `Password` will understandably differ for you. |
| 100 | + |
| 101 | +After filling out the details for your cluster, press `Next: Test Connection`. If the credentials are correct, you should see this success message: |
| 102 | + |
| 103 | +{width=640} |
| 104 | + |
| 105 | +## Dashboards |
| 106 | + |
| 107 | +A dashboard is the main point of Cluvio. It is a collection of interactive reports, giving great insight into any area of your data. These are the types of charts Cluvio offers: |
| 108 | + |
| 109 | +* [Table Chart](https://docs.cluvio.com/chart-types/table-chart) |
| 110 | +* [Pivot / Cohort Table Chart](https://docs.cluvio.com/chart-types/pivot-cohort-table-chart) |
| 111 | +* [Number Chart](https://docs.cluvio.com/chart-types/number-chart) |
| 112 | +* [Pie Chart](https://docs.cluvio.com/chart-types/pie-chart) |
| 113 | +* [Line Chart](https://docs.cluvio.com/chart-types/line-chart) |
| 114 | +* [Bar Chart](https://docs.cluvio.com/chart-types/bar-chart) |
| 115 | +* [Map Chart](https://docs.cluvio.com/chart-types/map-chart) |
| 116 | +* [Gauge Chart](https://docs.cluvio.com/chart-types/gauge-chart) |
| 117 | +* [XY / Bubble Chart](https://docs.cluvio.com/chart-types/xy-bubble-chart) |
| 118 | +* [Word Cloud Chart](https://docs.cluvio.com/chart-types/word-cloud-chart) |
| 119 | +* [Histogram Chart](https://docs.cluvio.com/chart-types/histogram-chart) |
| 120 | + |
| 121 | + |
| 122 | + |
| 123 | +Now, let's create some and see how Cluvio works. Head to **[Dashboards](https://app.cluvio.com/dashboards)** -> `New Dashboard`. After naming your Dashboard, you can create your first report. Click the `New report` in the upper right. |
| 124 | + |
| 125 | +### Number of flights and delays |
| 126 | + |
| 127 | +The first piece information you might be interested in, for a given period, is the number of flights and average delays of departures and arrivals. This is the code for this report: |
| 128 | + |
| 129 | +``` |
| 130 | +SELECT |
| 131 | + COUNT(*) AS "Number of flights", |
| 132 | + AVG(dep_delay) AS "Average Departure Delay", |
| 133 | + AVG(arr_delay) AS "Average Arrival Delay" |
| 134 | +FROM doc.flights |
| 135 | +ORDER BY 1 |
| 136 | +``` |
| 137 | +This is a pretty simple query that counts the number of rows in the `flights` as the number of flights, and averages values in the `dep_delay` and `arr_delay` for the departure delays and arrival delays respectively. |
| 138 | + |
| 139 | +{width=800} |
| 140 | + |
| 141 | +To see the information displayed this way, you need to switch to "Number" chart after running query. |
| 142 | + |
| 143 | +### Country distribution |
| 144 | + |
| 145 | +This query looks at the country distribution in the `airports` table: |
| 146 | + |
| 147 | +``` |
| 148 | +SELECT country, |
| 149 | + COUNT(1) |
| 150 | +FROM doc.airports |
| 151 | +GROUP BY country |
| 152 | +ORDER BY 2 DESC |
| 153 | +``` |
| 154 | + |
| 155 | +In this one, it's suitable to use pie chart to better see the distribution. We also used the `Value(%)` option for the legend, and edited the legend to show up to 25 values (countries). |
| 156 | + |
| 157 | +{width=800} |
| 158 | + |
| 159 | +## Filters |
| 160 | + |
| 161 | +[Filters](https://app.cluvio.com/settings/filters) offer a great way to quickly specify the condition under which you want to display your data. |
| 162 | + |
| 163 | +In the `flights` table in `day_of_week` column 1 represents Monday, 2 means Tuesday, etc. Using that, we can create a filter to display data for a specific day of the week without changing the SQL in our reports. |
| 164 | + |
| 165 | +``` |
| 166 | +VALUES |
| 167 | +(1, 'Monday'), |
| 168 | +(2, 'Tuesday'), |
| 169 | +(3, 'Wednesday'), |
| 170 | +(4, 'Thursday'), |
| 171 | +(5, 'Friday'), |
| 172 | +(6, 'Saturday'), |
| 173 | +(7, 'Sunday') |
| 174 | +ORDER BY 1 |
| 175 | +``` |
| 176 | + |
| 177 | +Now we can filter the data by day of the week: |
| 178 | + |
| 179 | +{width=800} |
| 180 | + |
| 181 | +Find out more about filters [here](https://docs.cluvio.com/filters/overview). |
| 182 | + |
| 183 | +## SQL snippets |
| 184 | + |
| 185 | +SQL snippets are small reusable pieces of code that can make your work easier within larger dataset. They are managed [here](https://app.cluvio.com/settings/sql-snippets). |
| 186 | + |
| 187 | +We used them to create JOIN statements: |
| 188 | + |
| 189 | +``` |
| 190 | +JOIN doc.airports AS origin_airport ON flights.origin = origin_airport.code |
| 191 | +JOIN doc.airports AS dest_airport ON flights.dest = dest_airport.code |
| 192 | +``` |
| 193 | + |
| 194 | +This snippet creates two joins between the `flights` and `airports` tables, aliasing the `airports` table as `origin_airport` and `dest_airport` for the origin and destination airports, respectively. |
| 195 | + |
| 196 | +Then create a report using the snippet: |
| 197 | + |
| 198 | +``` |
| 199 | +SELECT flights.year, |
| 200 | + flights.month, |
| 201 | + origin_airport.city AS origin_city, |
| 202 | + dest_airport.city AS destination_city, |
| 203 | + COUNT(*) AS number_of_flights |
| 204 | +FROM doc.flights |
| 205 | +[join_airports] -- Reference to the SQL Snippet |
| 206 | +WHERE [flight_filters] -- Reference our filters |
| 207 | +GROUP BY flights.year, flights.month, origin_airport.city, dest_airport.city |
| 208 | +ORDER BY number_of_flights DESC |
| 209 | +LIMIT 100; |
| 210 | +``` |
| 211 | + |
| 212 | +Using the SQL snippets and filters, we can quickly find out what is the most popular destination departing from Los Angeles (LAX) on a Tuesday. Pretty cool. |
| 213 | + |
| 214 | + |
| 215 | + |
| 216 | +## Conclusion |
| 217 | + |
| 218 | +That's it for this tutorial. If using Cluvio could help you make sense of your data, feel free to head to [Cloud Console](https://console.cratedb.cloud/), connect your cluster to [Cluvio](https://app.cluvio.com/) and get started! Make sure to visit their [documentation](https://docs.cluvio.com/) to explore all the features. |
0 commit comments