Discover more from hrbrmstr's Daily Drop
Drop #189 (2023-01-31): But I Didn't Even Watch The First One
SQLGlot; Common Table Expressions; DuckDB SQL Tools For VS Code
Programming note: If you tapped the “Subscribe” button within the past 24 hours, you were likely greeted with Substack's menu of subscription support options. A kind and generous soul tapped the “support” button (I don't actually know what that said/looked like) and Substack put me through a “wizard” to be able to receive it and that made the subscribe interstitial appear. I need to noodle a bit on what extras I can guarantee I'll provide to paid subs, but rest assured the core daily Drops are and shall remain available to all.
Yes, I am one of the monsters that pronounces S-Q-L as “sequel” for two reasons: it's fewer syllables and annoys overly pedantic folks.
Today's Drop is all about the nigh ubiquitous Structured Query Language.
As is likely the case for most, if not all, readers, one of our core “business processes” at work relies heavily on SQL queries that operate within the PostgreSQL SQL dialect. This process works as intended, but also introduces some pain, since we also rely heavily on Amazon Athena (a.k.a. Trino/Presto) for many other business processes, and also performing exploratory work over a superset of the same core business processes production data. PostgreSQL and Trino are just different enough at the Structured Query Language level to force us to have a slightly different copy of each of the production queries. This is not ideal.
Since we're going to have multiple data environments for the foreseeable future (yay, deliberate technical debt!), I've been poking at various ways to reduce the pain in the above situation. Said poking brought me to one super-interesting possible solution to this pain: SQLGlot, a “no dependency Python SQL parser, transpiler, optimizer, and engine.”
SQLGlot can be used to format SQL or translate between nearly twenty different SQL dialects. The parser itself can be customized, and the framework can also help you analyze queries, traverse parsed expression trees, and incrementally (and, programmatically) build SQL queries. It doesn't just transpile active SQL code, too. I've found it does a fine job moving comments (you do comment your SQL code, right?) from one dialect to another.
Here’s a small example:
import sqlglot sqlglot.transpile( "SELECT EPOCH_MS(1618088028295)", read = "duckdb", write = "hive" ) --- 'SELECT FROM_UNIXTIME(1618088028295 / 1000)'
Included with all of this “plumbing” code is support for optimizing SQL queries, and performing semantic diffs. Plus, SQLGlot can be used to unit test queries through mocks based on Python dictionaries.
The README and PyPI docs are chock-full of examples, and since Substack’s
<pre> blocks leave much to be desired, I'll strongly suggest reading through the official text (vs supply any further examples) to see how all of these features work.
Common Table Expressions
Programming languages evolve, and data folks of a certain age can tell you stories of what SQL writing was like before 1999. That's when the
WITH clause was introduced in the SQL standard.
WITH helps organize and simplify protracted and labyrinthine hierarchical queries by breaking them down into smaller, more readable chunks.
Use of the
WITH keyword generally identifies a common table expression (CTE) which is used to generate a temporary named result set that you can perform other operations on. CTEs:
result in more readable code, which means easier debugging and reduced confusion between teams
can reference the results multiple times throughout the query
make it easier to perform multi-level aggregations which can then be summarized in the main query
This Metabase example does a good job showing some benefits of CTEs:
-- Original SELECT id, total FROM orders WHERE total > ( -- calculate average order total SELECT AVG(total) FROM orders WHERE -- exclude test orders product_id > 10 AND -- exclude orders before launch created_at > '2016-04-21' AND -- exclude test accounts user_id > 10 ) ORDER BY total DESC
-- CTE-ified -- CTE to calculate average order total -- with the name for the CTE (avg_order) and column (total) WITH avg_order(total) AS ( -- CTE query SELECT AVG(total) FROM orders WHERE -- exclude test orders product_id > 10 AND -- exclude orders before launch created_at > '2016-04-21' AND -- exclude test accounts user_id > 10 ) -- our main query: -- orders with above-average totals SELECT o.id, o.total FROM orders AS o -- join the CTE: avg_order LEFT JOIN avg_order AS a WHERE -- total is above average o.total > a.total ORDER BY o.total DESC
WITH syntax can be different across SQL engines, which makes SQLGlot even more appealing as just a general purpose SQL tool to have around.
The Metabase CTE documentation has extended expository and examples if you’re just getting into CTEs or have been tasked with managing CTEs in data workflows.
I included CTEs in this edition for a few reasons.
One is the “no assumptions” thing I mentioned a couple of issues ago (i.e., I didn’t want to assume everyone who uses SQL knows about
Another is that an increasing number of data folks (I'm not super-enamored with the whole 'data scientist' nom de plume anymore) are working with in-language, custom query domain-specific languages (DSLs), such as siuba and dplyr. While these are fine and helpful wrangling idioms (that, hopefully, become normalized via Substrait, covered back in July), they can lull new folks into never learning SQL-proper. And, while dplyr is getting much better at creating manageable SQL queries from dplyr chains, they are still not always as efficient or readable as ones written in pure SQL.
The second reason is that I care about ensuring the efficacy and integrity of production queries. CTEs can and do make it easier for folks who speak this “galactic basic standard” to grok and extend what's been written, as well as take advantage of all the SQL analysis tools that have been invented since 1986.
So, if you are a data person who doesn't know SQL, perhaps make that a 2023 resolution.
If you have a fav CTE guide, drop a note in the comments.
DuckDB SQL Tools For VS Code
This is a quick section since I'm just linking you to a very new VS Code extension that turns VS Code into a DuckDB querying powerhouse. With it, you can:
Connect to a local DuckDB instance
Create new in-memory DuckDB instance
View DuckDB tables, columns, and views
Run SQL queries on open DuckDB connections
Attach SQLite database files to in-memory DuckDB instances
Query remote CSV and Parquet data files with DuckDB HTTPFS extension
Create in-memory DuckDB tables from remote data sources and query results
Manage DuckDB connections in SQLTools Database Explorer
Autocomplete SQL keywords, table names, column names, and view names on open database connections in VSCode SQL editor
Save named SQL query Bookmarks
Use SQL Query History
Export SQL query results in CSV and JSON data formats
It works super well, and can integrate with the equally spiffy SQL Tools extension.
I'll cover the following in more depth in an upcoming edition, but if you have some cycles, keep an eye on this potential sequel to SQL, as it looks very promising. ☮