Skip to content
al4576 edited this page Dec 9, 2024 · 1 revision
  • al4576 Alice Lin
    • all

Name of the Technology/Tool

PRQL (Pipelined Relational Query Language)

The Problem and Solution

  • Explain the problem that it solves.

SQL, despite its widespread adoption since the 1970s, faces significant limitations that limit its effectiveness in modern database environments. When dealing with intricate queries and huge datasets, the language's scalability issues frequently result in resource-intensive procedures and performance loss. When working with numerous tables or sizable datasets, the intricacy of join procedures can significantly affect query speed. Developers can find it difficult to fully utilize SQL's capabilities due to its steep learning curve for advanced features and optimization strategies. As applications grow in size, performance problems become more noticeable because poorly constructed queries can result in hidden bottlenecks, decreased throughput, and higher resource usage.

  • How does the technology solve the problem?

To fix these problems, PRQL (Pipelined Relational Query Language) introduces a pipeline-based approach where transformations flow logically from top to bottom, making queries easier to read and maintain. The language incorporates modern programming features such as variables, functions, and improved null handling, while eliminating common SQL pain points like trailing comma issues and repeated column references. PRQL's design philosophy is to be pipelined (obviously), simple, open, extensible, and analytical. That is to say, each line of the query is based off the previous line’s result, queries are simple to compose, the language is open-source, is meant to be bulit upon, and prioritizes analytic queries.

PRQL's architecture is based on a curated set of orthogonal transformations that combine to form a coherent pipeline, making queries both composable and extensible. The language compiles to standard SQL, ensuring compatibility with existing database systems while providing additional benefits like column lineage tracking and type inspection capabilities. In addition, if PRQL doesn't offer an implementation for SQL features, SQL can be embedded directly in the code.

  • What are the alternatives, and what are the pros and cons of the technology compared with alternatives? (what makes it unique?)

There are alternative query languages to PRQL, however each come with specific limitations. For instance, dplyr and dbplyr offer pipeline-based data processing, but they are only compatible with R and need a R runtime in order to compile SQL. Kusto Query Language (KQL) also has a pipelined syntax, but only works with Kusto-compatible data. N1QL offers a SQL-like syntax and mild learning curve but is only intended for JSON data in Couchbase services.

PRQL distinguishes itself through several unique advantages. Because it's a new language, it allows compilation to multiple SQL dialects while maintaining compatibility with any SQL database. The developer experience is significantly enhanced through its clear pipeline structure, simple to use and read syntax, and straightforward query modification capabilities. This combination of factors sets PRQL apart from alternatives that typically restrict users to specific ecosystems or databases.

However, the biggest con of PRQL by far is that PRQL doesn't support INSERT, UPDATE, or DELETE operations. By design, PRQL restricts development to only SELECT operations in order to excel at data analysis and transformations. Thus, even simple operations like creating and populating temporary tables require traditional SQL. This poses data pipeline issues, where analysts must translate their PRQL code to SQL before they can be used in a production data pipeline.

PRQL's pipeline structure naturally mirrors internal query execution plans in database systems. Each transformation in PRQL explicitly represents a step in data processing, making the execution flow clear and predictable. Unlike SQL, where the execution order may differ from the written order, PRQL's linear pipeline directly translates to how databases process queries internally.

  • How it relates to concepts from 4111.

PRQL's pipeline structure naturally mirrors internal query execution plans in database systems, with each transformation explicitly representing a step in data processing that makes the execution flow clear and predictable. The linear pipeline of PRQL corresponds directly to the internal query processing of databases, in contrast to SQL, where the execution order may deviate from the written order. With transformations like 'select' implementing projection, 'filter' implementing selection, 'join' implementing various joins, 'aggregate' implementing grouping operations, and 'derive' enabling column transformations, the language directly builds upon the concepts of relational algebra while enhancing the intuitiveness of queries. PRQL makes it easier to read and write queries based off relational algebra, making the relational algebra learned in class more useful and the code to be written, easier to write.

Tutorial

Example

For the example, take a SQL database for a scheduler/class-tracking application. The following SQL query is rewritten using PRQL.

SELECT Students.sid AS student_id, Students.name AS student_name, Classes.name AS class_name
FROM Replies JOIN Posts ON Replies.pid = Posts.pid
JOIN Students ON Replies.sid = Students.sid
JOIN Classes ON Posts.cid = Classes.cid
WHERE Replies.date_created <= Posts.date_created + INTERVAL '1 week' AND Replies.date_created > Posts.date_created;

Tutorial

PRQL is a language that often needs to be compiled into SQL to be used to work with databases. There are several ways to do this, such as the built in bindings PRQL offers for several languages (eg Python: pip install prql-py), or the PRQL CLI where users install the CLI, write a file with the .prql suffix, and then compile the file to SQL. Documentation can be found here.

For actually writing in PRQL, the workflow typically follows this workflow: identify the table, do the necessary joins, filter by the criteria, then select the necessary information. For this example, the replies table is identified, then joined with posts, students, and classes. Then the criteria are used as the filters, in this case ensuring replies are created after the post and the reply was within a week of the post date. Finally, the desired columns are selected, student_id, student_name, and class_name. As mentioned prior, the workflow is much simpler than SQL, with each line building on top of the last and each line happening in sequential order.

from replies
join posts (
    pid == replies.pid
)
join students (
    sid == replies.sid
)
join classes (
    cid == posts.cid
)
filter replies.date_created > posts.date_created
filter replies.date_created <= (posts.date_created + days 7)
select {
    student_id = students.sid,
    student_name = students.name,
    class_name = classes.name
}

Further examples can be found here in the PRQL Playground.

Sources

Clone this wiki locally