diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 00000000..b1b0d7a9 --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,26 @@ +# Security + +If you believe you have found a new security vulnerability in this repository, please report it to us as follows. + +## Reporting Security Issues + +* Please do **not** report security vulnerabilities through public GitHub issues. + +* Please create a draft security advisory on the Github page: the reporting form is under `> Security > Advisories`. The URL is https://github.com/exasol/ai-lab/security/advisories/new. + +* If you prefer to email, please send your report to `infosec@exasol.com`. + +## Guidelines + +* When reporting a vulnerability, please include as much information as possible, including the complete steps to reproduce the issue. + +* Avoid sending us executables. + +* Feel free to include any script you wrote and used but avoid sending us scripts that download and run binaries. + +* We will prioritise reports that show how the exploits work in realistic environments. + +* We prefer all communications to be in English. + +* We do not offer financial rewards. We are happy to acknowledge your research publicly when possible. + diff --git a/doc/changes/changes_3.4.0.md b/doc/changes/changes_3.4.0.md index 11a5823a..21d1e80f 100644 --- a/doc/changes/changes_3.4.0.md +++ b/doc/changes/changes_3.4.0.md @@ -22,7 +22,8 @@ The release also merges the developer CLI commands `setup-ec2-and-install-depend * #386: Simplified developer commands * #374: Added GPU option to Jupyter UI * #387: Added AI-Lab example for customizing CUDA-enabled template SLC -* #417: Added TE model mgmt notebook +* #417: Added Transformers Extension model management notebook +* #420: Added Notebook "First Steps" to demo basic Exasol features ## Refactorings diff --git a/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps.ipynb b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps.ipynb new file mode 100644 index 00000000..adacfbac --- /dev/null +++ b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps.ipynb @@ -0,0 +1,809 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "91051dad-8d45-435f-b2c9-7572e4050f58", + "metadata": {}, + "source": [ + "# First Steps\n", + "\n", + "This notebook guides you through the first steps using Jupyter notebooks with Exasol.\n", + "\n", + "The notebook demonstrates connecting to an Exasol database instance and using some of its features. \n", + "\n", + "## 1. Open Secure Configuration Storage\n", + "\n", + "First we need to open the Secure Configuration Storage (SCS) containing the connection information such as the database host, user, password, etc." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5970a003-896d-4f06-83ee-be98129c29d8", + "metadata": {}, + "outputs": [], + "source": [ + "%run utils/access_store_ui.ipynb\n", + "display(get_access_store_ui())" + ] + }, + { + "cell_type": "markdown", + "id": "d15075fa-8c9d-4e60-82d9-28d92d38ee0e", + "metadata": {}, + "source": [ + "## 2. Using JupySQL with Exasol AI Lab\n", + "\n", + "Next, we will activate the [JupySQL](https://jupysql.ploomber.io) magics:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73122275-3f4a-4981-b8b3-ee6c49734996", + "metadata": {}, + "outputs": [], + "source": [ + "%run utils/jupysql_init.ipynb" + ] + }, + { + "cell_type": "markdown", + "id": "342a7800-253d-43a0-9bd7-89932876317d", + "metadata": {}, + "source": [ + "JupySQL allows you to run SQL and plot large datasets in Jupyter via a %sql, %%sql, and %sqlplot magics. Using these magics, you can enter SQL directly into notebook cells.\n", + "In the background JupySQL uses SQLAlchemy, see also the [demo section on SQLAlchemy](#4.-SQLAlchemy) below.\n", + "\n", + "### 2.1 Create Database Tables\n", + "\n", + "We will use JupySQL to create 2 database tables but other sections will use the tables, too:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a7f5ec4-10f3-49e8-a6e4-48552a760542", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE OR REPLACE TABLE US_AIRLINES (\n", + " OP_CARRIER_AIRLINE_ID DECIMAL(10, 0) IDENTITY PRIMARY KEY,\n", + " CARRIER_NAME VARCHAR(1000)\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd466ee7-7b61-4b3c-8f13-a056be60c933", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE OR REPLACE TABLE US_FLIGHTS (\n", + " FL_DATE TIMESTAMP, \n", + " OP_CARRIER_AIRLINE_ID DECIMAL(10, 0),\n", + " ORIGIN_AIRPORT_SEQ_ID DECIMAL(10, 0),\n", + " ORIGIN_STATE_ABR CHAR(2),\n", + " DEST_AIRPORT_SEQ_ID DECIMAL(10, 0),\n", + " DEST_STATE_ABR CHAR(2),\n", + " CRS_DEP_TIME CHAR(4),\n", + " DEP_DELAY DOUBLE, \n", + " CRS_ARR_TIME CHAR(4),\n", + " ARR_DELAY DOUBLE,\n", + " CANCELLED BOOLEAN,\n", + " CANCELLATION_CODE CHAR(1),\n", + " DIVERTED BOOLEAN,\n", + " CRS_ELAPSED_TIME DOUBLE,\n", + " ACTUAL_ELAPSED_TIME DOUBLE, \n", + " DISTANCE DOUBLE,\n", + " CARRIER_DELAY DOUBLE,\n", + " WEATHER_DELAY DOUBLE,\n", + " NAS_DELAY DOUBLE,\n", + " SECURITY_DELAY DOUBLE,\n", + " LATE_AIRCRAFT_DELAY DOUBLE\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "1f83db93-e196-47ae-8e41-48eb21f50f41", + "metadata": {}, + "source": [ + "### 2.2 Importing CSV Files from Remote\n", + "\n", + "This section demonstrates how to import CSV files from a remote source into the database.\n", + "\n", + "First we will import a list of **US airlines**. The data is publicly accessible at the [Bureau of Transportation Statistics](https://www.transtats.bts.gov/Homepage.asp) of the US Department of Transportation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "821b0486-0871-44ab-a96e-66e875a28842", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "IMPORT INTO US_AIRLINES FROM\n", + " CSV AT 'https://dut5tonqye28.cloudfront.net/ai_lab/flight-info/' \n", + " FILE 'US_AIRLINES.csv' \n", + " COLUMN SEPARATOR = ',' \n", + " ROW SEPARATOR = 'CRLF'\n", + " COLUMN DELIMITER = '\"' \n", + " SKIP = 1" + ] + }, + { + "cell_type": "markdown", + "id": "c2428c14-dd23-4891-8c2f-533f5c2b8f05", + "metadata": {}, + "source": [ + "Next, we will import data about **flights** in February 2024:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12b76b4f-167b-4769-b454-49fa2871a3b3", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "IMPORT INTO US_FLIGHTS \n", + " FROM CSV AT 'https://dut5tonqye28.cloudfront.net/ai_lab/first_steps/' \n", + " FILE 'US_FLIGHTS_FEB_2024-fixed-booleans.csv'\n", + " (1 FORMAT = 'MM/DD/YYYY HH12:MI:SS AM', 2..21) \n", + " SKIP = 1" + ] + }, + { + "cell_type": "markdown", + "id": "6b764fd6-c3d9-49b8-bd0a-b36c6c382146", + "metadata": {}, + "source": [ + "Let's find out which is the airline with the highest delay per flight:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b31b5053-72c8-4846-941b-57d7dc267f30", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT\n", + " CARRIER_NAME \"Airline\",\n", + " SUM(CARRIER_DELAY) \"Combined Delay\",\n", + " COUNT(CARRIER_DELAY) \"Delayed Flights\",\n", + " COUNT(F.OP_CARRIER_AIRLINE_ID) \"Total flights\",\n", + " ROUND( SUM(CARRIER_DELAY) / COUNT(F.OP_CARRIER_AIRLINE_ID), 1 ) \"Delay per flight\"\n", + "FROM US_FLIGHTS F\n", + " JOIN US_AIRLINES A ON A.OP_CARRIER_AIRLINE_ID = F.OP_CARRIER_AIRLINE_ID\n", + "WHERE NOT (CANCELLED OR DIVERTED)\n", + "GROUP BY CARRIER_NAME\n", + "ORDER BY \"Delay per flight\" DESC" + ] + }, + { + "cell_type": "markdown", + "id": "c29ff5f1-1eb8-452c-8827-6c675c71dba5", + "metadata": {}, + "source": [ + "### 2.3 Importing a Parquet File from an AWS S3 Bucket\n", + "\n", + "This demo uses a file already uploaded to S3 bucket `ai-lab-example-data-s3`.\n", + "\n", + "First we will define a connection pointing to the S3 bucket:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5de58f6-b561-45c3-bc96-38523fcff58b", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE OR REPLACE CONNECTION AI_LAB_FIRST_STEPS_S3 \n", + " TO 'https://ai-lab-example-data-s3.s3.eu-central-1.amazonaws.com'" + ] + }, + { + "cell_type": "markdown", + "id": "23507a42-9459-4446-af7f-7d4f65294242", + "metadata": {}, + "source": [ + "Alternatively the connection can also use the following URL syntax, see also \"_Load data from Parquet files_\" on [docs.exasol.com](https://docs.exasol.com/db/latest/loading_data/load_data_parquet.htm#Overview):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "945dc2dd-8722-4455-b77e-cf6ad88eea23", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE OR REPLACE CONNECTION AI_LAB_FIRST_STEPS_S3 \n", + " TO 's3://ai-lab-example-data-s3'" + ] + }, + { + "cell_type": "markdown", + "id": "e118dd1a-40e6-412a-8624-314e7de1f893", + "metadata": {}, + "source": [ + "Then we will remove the data imported before:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a9ba8cad-8451-4f7a-8dec-3d04950d88b8", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "TRUNCATE TABLE US_FLIGHTS" + ] + }, + { + "cell_type": "markdown", + "id": "0dc80445-e5c0-43c1-ae34-f16326d6fb94", + "metadata": {}, + "source": [ + "**Please note**: \n", + "* Parquet import requires using **Exasol version 2025.\\* or higher**, see [docs.exasol.com](https://docs.exasol.com/db/latest/loading_data/load_data_parquet.htm).\n", + "* Hence, the import is currently commented out.\n", + "\n", + "Now we can import the Parquet file from S3 into the database:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8aa97292-b0e5-441b-a961-a3d1fba6d23e", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql \n", + "SELECT 1 -- avoid error from empty cell \n", + "\n", + "-- IMPORT INTO US_FLIGHTS FROM PARQUET AT AI_LAB_FIRST_STEPS_S3 \n", + "-- FILE 'first_steps/US_FLIGHTS_FEB_2024.parquet'" + ] + }, + { + "cell_type": "markdown", + "id": "b074a883-ae21-479c-b4fa-17962afe8149", + "metadata": {}, + "source": [ + "We will query table `US_FLIGHTS` again to display the imported data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "064158a4-8d37-42ab-905f-e13d6c2af3d2", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql \n", + "SELECT * FROM US_FLIGHTS" + ] + }, + { + "cell_type": "markdown", + "id": "5582ac55-9c56-47d9-b286-948835d976c4", + "metadata": {}, + "source": [ + "## 3. Using PyExasol with Exasol AI Lab\n", + "\n", + "[PyExasol](https://exasol.github.io/pyexasol/master/index.html) is the basic Python connector for interacting with Exasol databases.\n", + "\n", + "Please note\n", + "* Using PyExasol's import or export functions with Exasol database versions `2025.*` and higher requires Pyexasol version ≥ `1.2`.\n", + "* AI Lab currently is shipped with pyexasol version `0.27.0`.\n", + "* Hence, the Pyexasol examples can only be executed with Exasol database versions < `2025`.\n", + "\n", + "### 3.1 Importing a CSV File from the Local Filesystem\n", + "\n", + "This section demonstrates how to import a CSV file from the local file system into the database using Pyexasol.\n", + "\n", + "The function `open_pyexasol_connection()` opens a connection, using the configuration from the SCS." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "40546944-dd3e-4330-a9c0-8ef57c4b5deb", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "from exasol.nb_connector.connections import open_pyexasol_connection\n", + "\n", + "with open_pyexasol_connection(ai_lab_config, compression=True) as conn:\n", + " path = Path(\"first_steps/pyexasol.csv\")\n", + " import_params = {\n", + " \"column_delimiter\": '\"',\n", + " \"column_separator\": \",\",\n", + " \"row_separator\": \"CRLF\",\n", + " \"skip\": 1,\n", + " }\n", + " conn.import_from_file(path, (ai_lab_config.db_schema, \"US_AIRLINES\"), import_params)" + ] + }, + { + "cell_type": "markdown", + "id": "e023d10d-ccdb-4c2d-8fa5-655e227e8c76", + "metadata": {}, + "source": [ + "We will also verify the success by using a query with separate parameters and [PyExasol SQL formatting](https://exasol.github.io/pyexasol/master/user_guide/exploring_features/formatting_sql.html):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e333f0ae-ac41-4664-a214-65bdbc4541e2", + "metadata": {}, + "outputs": [], + "source": [ + "from exasol.nb_connector.connections import open_pyexasol_connection\n", + "\n", + "query = \"SELECT * FROM {table!q} WHERE CARRIER_NAME LIKE {pattern!s}\"\n", + "params = {\n", + " \"table\": (ai_lab_config.db_schema, \"US_AIRLINES\"),\n", + " \"pattern\": \"% local CSV file via pyexasol\",\n", + "}\n", + "\n", + "with open_pyexasol_connection(ai_lab_config, compression=True) as conn:\n", + " result = conn.execute(query, params).fetchone()\n", + "print(result)" + ] + }, + { + "cell_type": "markdown", + "id": "38838085-0ea2-4570-9581-7de2f9e1ba20", + "metadata": {}, + "source": [ + "### 3.2 Importing a CSV File from Remote\n", + "\n", + "This section demonstrates how to import a CSV file from a remote source into the database using PyExasol.\n", + "\n", + "First, we will truncate table `US_FLIGHTS` to be able to import the flight data again:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e2a0de0-7b72-4466-8f61-815d40ee0f06", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "TRUNCATE TABLE US_FLIGHTS" + ] + }, + { + "cell_type": "markdown", + "id": "0759b904-22b2-4725-a2ef-cb20cdba4e6a", + "metadata": {}, + "source": [ + "Now let's run the import—again using a query with separate parameter and PyExasol SQL formatting:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "317449c8-4347-40ae-8c7c-bcbe1eb06656", + "metadata": {}, + "outputs": [], + "source": [ + "from exasol.nb_connector.connections import open_pyexasol_connection\n", + "\n", + "query = \"\"\"\n", + " IMPORT INTO {flights_table!q} FROM CSV AT {url!s} FILE {file!s}\n", + " (1 FORMAT = {date_format!s}, 2..21) \n", + " SKIP = 1\n", + "\"\"\"\n", + "\n", + "params = {\n", + " \"flights_table\": (ai_lab_config.db_schema, \"US_FLIGHTS\"),\n", + " \"url\": \"https://dut5tonqye28.cloudfront.net/ai_lab/first_steps/\",\n", + " \"file\": \"US_FLIGHTS_FEB_2024-fixed-booleans.csv\",\n", + " \"date_format\": \"MM/DD/YYYY HH12:MI:SS AM\",\n", + "}\n", + "\n", + "with open_pyexasol_connection(ai_lab_config, compression=True) as conn:\n", + " result = conn.execute(query, params)\n", + "\n", + "print(f\"Imported {result.rowcount()} rows.\")" + ] + }, + { + "cell_type": "markdown", + "id": "23e99c7a-e6a2-4497-8a74-054dab3ca131", + "metadata": {}, + "source": [ + "### 3.2 Importing a Parquet File from an AWS S3 Bucket\n", + "\n", + "This demo uses \n", + "* a file already uploaded to S3 bucket `ai-lab-example-data-s3` \n", + "* and the Exasol connection object `AI_LAB_FIRST_STEPS_S3` created before.\n", + "\n", + "First we will truncate table `US_FLIGHTS`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce8ed658-07af-433b-8d77-0aedc3e51310", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "TRUNCATE TABLE US_FLIGHTS" + ] + }, + { + "cell_type": "markdown", + "id": "5dfdd7ec-5cd5-4082-a565-ee17166fd691", + "metadata": {}, + "source": [ + "**Please note**: \n", + "* Parquet import requires using **Exasol version 2025.\\* or higher**, see [docs.exasol.com](https://docs.exasol.com/db/latest/loading_data/load_data_parquet.htm).\n", + "* Hence, the import is currently commented out.\n", + "\n", + "Now we can import the Parquet file from S3 into the database:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70b4826c-12df-4b0f-a5f9-868779bc1905", + "metadata": {}, + "outputs": [], + "source": [ + "from exasol.nb_connector.connections import open_pyexasol_connection\n", + "\n", + "query = \"IMPORT INTO {table!q} FROM PARQUET AT {connection} FILE {file!s}\"\n", + "\n", + "params = {\n", + " \"table\": (ai_lab_config.db_schema, \"US_FLIGHTS\"),\n", + " \"connection\": \"AI_LAB_FIRST_STEPS_S3\",\n", + " \"file\": \"first_steps/US_FLIGHTS_FEB_2024.parquet\",\n", + "}\n", + "\n", + "#with open_pyexasol_connection(ai_lab_config, compression=True) as conn:\n", + "# result = conn.execute(query, params)\n", + "#print(f\"Imported {result.rowcount()} rows.\")" + ] + }, + { + "cell_type": "markdown", + "id": "61b8024f-69c1-4247-ab5a-946d5e98fe2b", + "metadata": {}, + "source": [ + "## 4. Using SQLAlchemy with Exasol AI Lab\n", + "\n", + "[SQLAlchemy](https://www.sqlalchemy.org/) is a Python SQL Toolkit and Object Relational Mapper for application developers.\n", + " \n", + "It provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.\n", + "\n", + "In Exasol AI Lab, the function `open_sqlalchemy_connection()` returns a SQLAlchemy engine, again using the configuration from the SCS.\n", + "\n", + "This engine will be used by the examples based on SQLAlchemy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "632a9139-f8e1-45bd-8ea8-43571d2ee3f9", + "metadata": {}, + "outputs": [], + "source": [ + "from exasol.nb_connector.connections import open_sqlalchemy_connection\n", + "engine = open_sqlalchemy_connection(ai_lab_config)" + ] + }, + { + "cell_type": "markdown", + "id": "e9c7e497-cf0a-4158-b656-221a80086a90", + "metadata": {}, + "source": [ + "### 3.1 Importing a CSV File from Remote\n", + "\n", + "This section demonstrates how to import a CSV file from a remote source into the database using SQLAlchemy.\n", + "\n", + "We will once again truncate table `US_FLIGHTS` and import the flight data, now using SQLAlchemy:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f00e6c4-4f44-47d7-a85f-5c1bb70708a8", + "metadata": {}, + "outputs": [], + "source": [ + "from sqlalchemy import text\n", + "\n", + "t = (\n", + " text(\n", + " f\"\"\"\n", + " IMPORT INTO \"{ai_lab_config.db_schema}\".\"US_FLIGHTS\" \n", + " FROM CSV AT 'https://dut5tonqye28.cloudfront.net/ai_lab/first_steps/' \n", + " FILE 'US_FLIGHTS_FEB_2024-fixed-booleans.csv'\n", + " (1 FORMAT = 'MM/DD/YYYY HH12:MI:SS AM', 2..21) \n", + " SKIP = 1\n", + " \"\"\"\n", + " )\n", + ")\n", + "with engine.connect() as conn:\n", + " conn.execute(\"TRUNCATE TABLE US_FLIGHTS\")\n", + " result = conn.execute(t)\n", + "\n", + "print(f\"Imported {result.rowcount} rows.\")" + ] + }, + { + "cell_type": "markdown", + "id": "f2228c1a-b6ce-432f-b869-ad18abf04b93", + "metadata": {}, + "source": [ + "### 3.3 Importing a Parquet File from an AWS S3 Bucket\n", + "\n", + "This section demonstrates how to import a CSV file from an AWS S3 Bucket into the database using SQLAlchemy.\n", + "\n", + "First we will truncate table `US_FLIGHTS` again:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75a19429-89b7-4ed2-b401-e1c2563c2940", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "TRUNCATE TABLE US_FLIGHTS" + ] + }, + { + "cell_type": "markdown", + "id": "d11dcb6d-d3e5-4e1e-8d61-ba59897c56e9", + "metadata": {}, + "source": [ + "**Please note**: \n", + "* Parquet import requires using **Exasol version 2025.\\* or higher**, see [docs.exasol.com](https://docs.exasol.com/db/latest/loading_data/load_data_parquet.htm).\n", + "* Hence, the import is currently commented out.\n", + "\n", + "Now we can import the Parquet file from S3 into the database:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "651b41c6-3cd1-4311-a5f3-ababdc9a4273", + "metadata": {}, + "outputs": [], + "source": [ + "from sqlalchemy import text\n", + "\n", + "t = (\n", + " text(\n", + " f\"\"\"\n", + " IMPORT INTO \"{ai_lab_config.db_schema}\".\"US_FLIGHTS\" \n", + " FROM PARQUET AT AI_LAB_FIRST_STEPS_S3\n", + " FILE 'first_steps/US_FLIGHTS_FEB_2024.parquet'\n", + " \"\"\"\n", + " )\n", + ")\n", + "\n", + "# with engine.connect() as conn:\n", + "# result = conn.execute(t)\n", + "# print(f\"Imported {result.rowcount} rows.\")" + ] + }, + { + "cell_type": "markdown", + "id": "35335a66-dace-4082-ba35-a3bcc1ed3d54", + "metadata": {}, + "source": [ + "### 3.4 Querying using SQLAlchemy’s Metadata Concept\n", + "\n", + "This section demonstrates using some SQLAlchemy features \n", + "* Instantiating class [sqlalchemy.MetaData](https://docs.sqlalchemy.org/en/20/core/metadata.html#sqlalchemy.schema.MetaData)\n", + "* Querying existing SQL tables with [.reflect()](https://docs.sqlalchemy.org/en/20/core/metadata.html#sqlalchemy.schema.MetaData)\n", + "* Retrieving a specific table \n", + "* Using its [select](https://docs.sqlalchemy.org/en/20/core/metadata.html#sqlalchemy.schema.Table.select) statement and\n", + "* Executing it in a SQLAlchemy [Session](https://docs.sqlalchemy.org/en/20/orm/session_api.html#sqlalchemy.orm.Session) object\n", + "* Iterating the result set" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "676b4ae9-19de-4099-92df-d990d242d1c4", + "metadata": {}, + "outputs": [], + "source": [ + "from sqlalchemy import MetaData\n", + "from sqlalchemy.orm import Session\n", + "\n", + "meta = MetaData()\n", + "db_schema = ai_lab_config.db_schema\n", + "meta.reflect(engine, schema=db_schema)\n", + "\n", + "flights = meta.tables[f\"{db_schema}.us_flights\"]\n", + "stmt = (\n", + " flights.select()\n", + " .where(flights.c.op_carrier_airline_id == 20452)\n", + " .where(flights.c.origin_state_abr == \"TX\")\n", + " .where(flights.c.dest_state_abr == \"NJ\")\n", + ")\n", + "with Session(engine) as session:\n", + " for r in session.execute(stmt):\n", + " print(f'Departure: {r[0].date()} {r[6]}') " + ] + }, + { + "cell_type": "markdown", + "id": "768ca230-817c-4226-a1b6-ac5c32d4b1c7", + "metadata": {}, + "source": [ + "## 5. Using the Exasol Bucket File System\n", + "\n", + "The [Exasol Bucket File System](https://docs.exasol.com/db/latest/database_concepts/bucketfs/bucketfs.htm) (BucketFS) is a powerful feature for exchanging non-relational data with the database nodes in an Exasol cluster.\n", + "\n", + "Such data can be arbitrary files including \n", + "* Data to be processed by [User Defined Scripts](https://docs.exasol.com/db/latest/database_concepts/udf_scripts.htm) (UDFs)\n", + "* [Script-Languages Containers](https://github.com/exasol/script-languages-release) (SLCs), see also the dedicated [Tutorial on Script-Languages Containers](script_languages_container/using_the_script_languages_container_tool.ipynb)\n", + "* AI Models\n", + "\n", + "### 5.1 Uploading a File to the BucketFS\n", + "\n", + "First we will create a sample file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30da84a6-c1bd-4159-830e-88a2fa94ec46", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile first_steps/text_file.txt\n", + "Hello World!" + ] + }, + { + "cell_type": "markdown", + "id": "644c5311-27e4-463c-b228-c44ce810b50e", + "metadata": {}, + "source": [ + "And now, let's upload the file into the BucketFS.\n", + "\n", + "The function `open_bucketfs_location()` returns a path into Exasols BucketFS, also using the configuration in the SCS." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f6ab083-d7db-4e76-8e1d-7cf7ff369c87", + "metadata": {}, + "outputs": [], + "source": [ + "from exasol.nb_connector.connections import open_bucketfs_location\n", + "from pathlib import Path \n", + "\n", + "file = Path(\"first_steps/text_file.txt\")\n", + "bfs = open_bucketfs_location(ai_lab_config)\n", + "file_in_bfs = bfs / file.name\n", + "file_in_bfs.write(file.read_bytes())" + ] + }, + { + "cell_type": "markdown", + "id": "82dc65f7-4d27-41b2-835a-100aa8b79d64", + "metadata": {}, + "source": [ + "### 5.2 Listing the Files in the BucketFS\n", + "\n", + "We can also list all the files currently available in the BucketFS:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7597aa88-5c37-4d2d-ae96-be90bf1a5baf", + "metadata": {}, + "outputs": [], + "source": [ + "from exasol.nb_connector.connections import open_bucketfs_location\n", + "\n", + "bfs = open_bucketfs_location(ai_lab_config)\n", + "for p in bfs.iterdir():\n", + " print(f'- {p.name}')" + ] + }, + { + "cell_type": "markdown", + "id": "50b22225-34de-435b-b29d-6712e960b926", + "metadata": {}, + "source": [ + "### 5.3 Reading the File in the BucketFS Using a User Defined Function\n", + "\n", + "This sections demonstrates reading the contents of a file in the BucketFS using a [User Defined Function](https://docs.exasol.com/db/latest/database_concepts/udf_scripts.htm) (UDF). \n", + "\n", + "UDFs can be used in Exasol SQL statements and can also access objects in the BucketFS.\n", + "\n", + "First we will define the UDF:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8dfc7b96-2be8-4bf9-9140-00a1097cec91", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "--/\n", + "CREATE OR REPLACE PYTHON3 SCALAR SCRIPT \n", + " bfs_file_content(\"path\" VARCHAR(256)) \n", + " EMITS (\"FILE\" VARCHAR(256), CONTENT VARCHAR(2000000)) AS\n", + "from pathlib import Path \n", + "def run(ctx): \n", + " ctx.emit(ctx.path, Path(ctx.path).read_text())\n", + "/" + ] + }, + { + "cell_type": "markdown", + "id": "18cb2e3b-c019-4266-ae0e-4e9cbd5f3778", + "metadata": {}, + "source": [ + "And now, let's read the contents of the file in the BucketFS we just created before. \n", + "\n", + "The file's BucketFS location is represented by an instance of class [exasol.bucketfs.PathLike](https://exasol.github.io/bucketfs-python/main/api.html#exasol.bucketfs._path.PathLike) and the method `as_udf_path()` returns the file's path as seen by a UDF:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ba88c64b-e3e2-4366-af0c-86c77db35132", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT bfs_file_content('{{file_in_bfs.as_udf_path()}}')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps/pyexasol.csv b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps/pyexasol.csv new file mode 100644 index 00000000..02e6c11d --- /dev/null +++ b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps/pyexasol.csv @@ -0,0 +1,2 @@ +Code,Description +"99991","From local CSV file via pyexasol" diff --git a/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps/sqlalchemy.csv b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps/sqlalchemy.csv new file mode 100644 index 00000000..674060c0 --- /dev/null +++ b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/first_steps/sqlalchemy.csv @@ -0,0 +1,2 @@ +Code,Description +"99993","From local CSV file via SQL Alchemy" diff --git a/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook_requirements.txt b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook_requirements.txt index 45004d29..0fee2963 100644 --- a/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook_requirements.txt +++ b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook_requirements.txt @@ -9,5 +9,5 @@ wordcloud==1.9.4 # required for Text AI notebooks pickleshare==0.7.5 # See https://github.com/exasol/ai-lab/issues/291 for details. ipyfilechooser==0.6.0 # required for SLC notebooks ipywidgets==8.1.5 # enable interactive Javascript widgets in the notebooks -exasol-notebook-connector==2.0.0 +exasol-notebook-connector [sqlalchemy, pyexasol, bucketfs, docker-db, slc, ibis, transformers, text-ai, sagemaker] ==2.2.0 pydantic==2.11.4 # See https://github.com/exasol/ai-lab/issues/428 diff --git a/nb_tests.yaml b/nb_tests.yaml index 218ad2cb..a52c8507 100644 --- a/nb_tests.yaml +++ b/nb_tests.yaml @@ -3,6 +3,10 @@ normal: additional_pytest_parameters: stable: tests: + - name: First Steps notebook + test_file: nbtest_first_steps.py + test_backend: onprem + wip: no-wip - name: CSE notebook test_file: nbtest_cloud.py test_backend: onprem diff --git a/test/notebooks/nbtest_first_steps.py b/test/notebooks/nbtest_first_steps.py new file mode 100644 index 00000000..60cc21a6 --- /dev/null +++ b/test/notebooks/nbtest_first_steps.py @@ -0,0 +1,14 @@ +# We need to manually import all fixtures that we use, directly or indirectly, +# since the pytest won't do this for us. +from notebook_test_utils import ( + backend_setup, + notebook_runner, + set_log_level_for_libraries, +) + +set_log_level_for_libraries() + + +def test_first_steps_notebook(notebook_runner) -> None: + notebook_runner('main_config.ipynb') + notebook_runner('first_steps.ipynb')