Drop #189 (2023-01-31): But I Didn't Even Watch The First One

SQLGlot; Common Table Expressions; DuckDB SQL Tools For VS Code

Programming note: If you tapped the “Subscribe” button within the past 24 hours, you were likely greeted with Substack’s menu of subscription support options. A kind and generous soul tapped the “support” button (I don’t actually know what that said/looked like) and Substack put me through a “wizard” to be able to receive it and that made the subscribe interstitial appear. I need to noodle a bit on what extras I can guarantee I’ll provide to paid subs, but rest assured the core daily Drops are and shall remain available to all.

Yes, I am one of the monsters that pronounces S-Q-L as “sequel” for two reasons: it’s fewer syllables and annoys overly pedantic folks.

Today’s Drop is all about the nigh ubiquitous Structured Query Language.

SQLGlot

brown wooden surface

As is likely the case for most, if not all, readers, one of our core “business processes” at work relies heavily on SQL queries that operate within the PostgreSQL SQL dialect. This process works as intended, but also introduces some pain, since we also rely heavily on Amazon Athena (a.k.a. Trino/Presto) for many other business processes, and also performing exploratory work over a superset of the same core business processes production data. PostgreSQL and Trino are just different enough at the Structured Query Language level to force us to have a slightly different copy of each of the production queries. This is not ideal.

Since we’re going to have multiple data environments for the foreseeable future (yay, deliberate technical debt!), I’ve been poking at various ways to reduce the pain in the above situation. Said poking brought me to one super-interesting possible solution to this pain: SQLGlot, a “no dependency Python SQL parser, transpiler, optimizer, and engine.”

SQLGlot can be used to format SQL or translate between nearly twenty different SQL dialects. The parser itself can be customized, and the framework can also help you analyze queries, traverse parsed expression trees, and incrementally (and, programmatically) build SQL queries. It doesn’t just transpile active SQL code, too. I’ve found it does a fine job moving comments (you do comment your SQL code, right?) from one dialect to another.

Here’s a small example:

import sqlglot

sqlglot.transpile(
  "SELECT EPOCH_MS(1618088028295)", 
  read = "duckdb", 
  write = "hive"
)[0]

---

'SELECT FROM_UNIXTIME(1618088028295 / 1000)'

Included with all of this “plumbing” code is support for optimizing SQL queries, and performing semantic diffs. Plus, SQLGlot can be used to unit test queries through mocks based on Python dictionaries.

The README and PyPI docs are chock-full of examples, and since Substack’s <pre> blocks leave much to be desired, I’ll strongly suggest reading through the official text (vs supply any further examples) to see how all of these features work.

Common Table Expressions

man in blue polo shirt and black pants standing beside blue wall

Programming languages evolve, and data folks of a certain age can tell you stories of what SQL writing was like before 1999. That’s when the WITH clause was introduced in the SQL standard. WITH helps organize and simplify protracted and labyrinthine hierarchical queries by breaking them down into smaller, more readable chunks.

Use of the WITH keyword generally identifies a common table expression (CTE) which is used to generate a temporary named result set that you can perform other operations on. CTEs:

  • result in more readable code, which means easier debugging and reduced confusion between teams

  • can reference the results multiple times throughout the query

  • make it easier to perform multi-level aggregations which can then be summarized in the main query

This Metabase example does a good job showing some benefits of CTEs:

-- Original
SELECT
  id,
  total
FROM
  orders
WHERE
  total > (
    -- calculate average order total
    SELECT
      AVG(total)
    FROM
      orders
    WHERE
	  -- exclude test orders
      product_id > 10
      AND -- exclude orders before launch
      created_at > '2016-04-21'
      AND -- exclude test accounts
      user_id > 10
  )
ORDER BY
  total DESC
-- CTE-ified
-- CTE to calculate average order total
-- with the name for the CTE (avg_order) and column (total)
WITH avg_order(total) AS (

-- CTE query
  SELECT
    AVG(total)
  FROM
    orders
  WHERE
    -- exclude test orders
    product_id > 10
    AND -- exclude orders before launch
    created_at > '2016-04-21'
    AND -- exclude test accounts
    user_id > 10
)

-- our main query:
-- orders with above-average totals
SELECT
  o.id,
  o.total
FROM
  orders AS o
  -- join the CTE: avg_order
  LEFT JOIN avg_order AS a
WHERE
  -- total is above average
  o.total > a.total
ORDER BY
  o.total DESC

The WITH syntax can be different across SQL engines, which makes SQLGlot even more appealing as just a general purpose SQL tool to have around.

The Metabase CTE documentation has extended expository and examples if you’re just getting into CTEs or have been tasked with managing CTEs in data workflows.

I included CTEs in this edition for a few reasons.

One is the “no assumptions” thing I mentioned a couple of issues ago (i.e., I didn’t want to assume everyone who uses SQL knows about WITH/CTEs).

Another is that an increasing number of data folks (I’m not super-enamored with the whole ‘data scientist’ nom de plume anymore) are working with in-language, custom query domain-specific languages (DSLs), such as siuba and dplyr. While these are fine and helpful wrangling idioms (that, hopefully, become normalized via Substrait, covered back in July), they can lull new folks into never learning SQL-proper. And, while dplyr is getting much better at creating manageable SQL queries from dplyr chains, they are still not always as efficient or readable as ones written in pure SQL.

The second reason is that I care about ensuring the efficacy and integrity of production queries. CTEs can and do make it easier for folks who speak this “galactic basic standard” to grok and extend what’s been written, as well as take advantage of all the SQL analysis tools that have been invented since 1986.

So, if you are a data person who doesn’t know SQL, perhaps make that a 2023 resolution.

If you have a fav CTE guide, drop a note in the comments.

DuckDB SQL Tools For VS Code

yellow rubber duck on white background

This is a quick section since I’m just linking you to a very new VS Code extension that turns VS Code into a DuckDB querying powerhouse. With it, you can:

  • Connect to a local DuckDB instance

  • Create new in-memory DuckDB instance

  • View DuckDB tables, columns, and views

  • Run SQL queries on open DuckDB connections

  • Attach SQLite database files to in-memory DuckDB instances

  • Query remote CSV and Parquet data files with DuckDB HTTPFS extension

  • Create in-memory DuckDB tables from remote data sources and query results

  • Manage DuckDB connections in SQLTools Database Explorer

  • Autocomplete SQL keywords, table names, column names, and view names on open database connections in VSCode SQL editor

  • Save named SQL query Bookmarks

  • Use SQL Query History

  • Export SQL query results in CSV and JSON data formats

It works super well, and can integrate with the equally spiffy SQL Tools extension.

FIN

I’ll cover the following in more depth in an upcoming edition, but if you have some cycles, keep an eye on this potential sequel to SQL, as it looks very promising. ☮

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.