Bonus Drop #19 (2023-07-23): Bonus Drop

Rusty Parquet; It’s All About That [Fast] Base[64]; Write!

Some major things came up at $WORK this afternoon, so I’m afraid I need to lean on releasing a Bonus Drop vs. the regularly scheduled programming.


A diverse array of topics is, once more, presented, today. Whether you’re interested in wrangling Parquet files at the command line, are WASM- or speed-curious, or want to strengthen your writing skills, there should be something for across the three sections.


Topics covered:

  • pqrs

  • fast-base64 (via WASM)

  • [Blog] Writing for Developers

    Subscribe now

Rusty Parquet

brown rusted bolts

We are awash in Parquet1 files at work, and I have a few of my own lying around across a half dozen SSDs as well. I remember, back in the day, when one had to use (ugh) Java-based tooling to do almost anything with these files, at least at the command-line. Thankfully, there are scads of non-Java, modern tools that let us do all sorts of natural and unnatural things to these files, and one super-handy one is pqrs (“Parquet Tools In Rust”).

This utility helps you perform various operations on one or more Parquet files. The functionality is broken down into sub-commands:

  • cat: prints the contents of parquet file(s)

  • head: prints the first n records of the parquet file

  • merge: merge file(s) into another parquet file

  • row-count: prints the count of rows in parquet file(s)

  • sample: prints a random sample of records from the parquet file

  • schema: prints the schema of parquet file(s)

  • size: prints the size of parquet file(s)

While the README for the linked repo is very comprehensive, I’ll drop a few sub-command examples here. FWIW, I’m not “redacting” any of the filenames you’ll see in them. The original ones are just far too long and are filled with cruft from AWS’ Glue service.

Curious about the structure of one or more Parquet files? schema‘s got your back:

$ pqrs schema file01.parquet
Metadata for file: file01.parquet

version: 1
num of rows: 250
created by: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
message hive_schema {
  OPTIONAL BYTE_ARRAY apps (UTF8);
  OPTIONAL group emails (LIST) {
    REPEATED group bag {
      OPTIONAL BYTE_ARRAY array_element (UTF8);
    }
  }
  OPTIONAL BYTE_ARRAY favicon_mmh3_128 (UTF8);
  OPTIONAL INT32 favicon_mmh3_32;
  OPTIONAL BYTE_ARRAY headers (UTF8);
  OPTIONAL group ips (LIST) {
    REPEATED group bag {
      OPTIONAL BYTE_ARRAY array_element (UTF8);
    }
  }
  OPTIONAL INT32 knock_port;
  OPTIONAL group links (LIST) {
    REPEATED group bag {
      OPTIONAL BYTE_ARRAY array_element (UTF8);
    }
  }
  OPTIONAL INT96 max_sensor_timestamp;
  OPTIONAL INT96 result_timestamp;
  OPTIONAL INT64 result_timestamp_ns;
  OPTIONAL BYTE_ARRAY s3_path (UTF8);
  OPTIONAL BYTE_ARRAY sensor_ip (UTF8);
  OPTIONAL BYTE_ARRAY sha256 (UTF8);
  OPTIONAL BYTE_ARRAY source_host (UTF8);
  OPTIONAL BYTE_ARRAY task_id (UTF8);
  OPTIONAL BYTE_ARRAY title (UTF8);
  OPTIONAL BYTE_ARRAY tlsh256 (UTF8);
  OPTIONAL BOOLEAN tor_exit;
  OPTIONAL BYTE_ARRAY words (UTF8);
  OPTIONAL BYTE_ARRAY xxh64 (UTF8);
  OPTIONAL BYTE_ARRAY jarm (UTF8);
}

As you can see, it does a fine job with both “plain” and complex columns.

Checking the row counts is also very straightforward:

$ pqrs row-count 2023-06-20*.parquet
File Name: 2023-06-20-10-category.parquet: 38549 rows
File Name: 2023-06-20-11-category.parquet: 37958 rows
File Name: 2023-06-20-12-category.parquet: 37656 rows
File Name: 2023-06-20-13-category.parquet: 36737 rows
File Name: 2023-06-20-14-category.parquet: 36461 rows
File Name: 2023-06-20-15-category.parquet: 36745 rows
File Name: 2023-06-20-16-category.parquet: 36636 rows
File Name: 2023-06-20-17-category.parquet: 37173 rows
File Name: 2023-06-20-18-category.parquet: 36455 rows
File Name: 2023-06-20-1-category.parquet: 31785 rows
File Name: 2023-06-20-20-category.parquet: 74221 rows
File Name: 2023-06-20-21-category.parquet: 36666 rows
File Name: 2023-06-20-22-category.parquet: 37320 rows
File Name: 2023-06-20-23-category.parquet: 36835 rows
File Name: 2023-06-20-2-category.parquet: 32360 rows
File Name: 2023-06-20-3-category.parquet: 75155 rows
File Name: 2023-06-20-4-category.parquet: 39310 rows
File Name: 2023-06-20-5-category.parquet: 40363 rows
File Name: 2023-06-20-6-category.parquet: 40897 rows
File Name: 2023-06-20-7-category.parquet: 40978 rows
File Name: 2023-06-20-8-category.parquet: 40202 rows
File Name: 2023-06-20-9-category.parquet: 39179 rows

Now, I am fully aware of how unoptimized those file sizes are. So, let’s fix it!

$ pqrs merge --output ../2023-06-bulk-data.parquet --input 2023-06*.parquet

$ pqrs row-count ../2023-06-bulk-data.parquet
File Name: ../2023-06-bulk-data.parquet: 16393758 rows

There are a few ways to take a look at the contents of Parquet files. One is by sampling:

$ pqrs sample -n 5 2023-06-bulk-data.parquet
{tsday: 2023-06-02, ip: "186.97.222.203", category: "isp", organization: "Colombia Móvil"}
{tsday: 2023-06-07, ip: "219.167.203.208", category: "isp", organization: "NTT Communications Corporation"}
{tsday: 2023-06-14, ip: "49.228.97.254", category: "mobile", organization: "AIS Fibre"}
{tsday: 2023-06-18, ip: "212.220.71.42", category: "isp", organization: "PJSC Rostelecom"}
{tsday: 2023-06-19, ip: "8.219.246.42", category: "hosting", organization: "Alibaba (US) Technology Co., Ltd."}

Adding --json to that will change that semi-useless format to lovely jsonlines/ndjson.

But, we can also just look at the first few (n) records as well with the head sub-command. It has both --json and --cve (with or without a header) options, so we’ll show the CSV output:

$ pqrs head --csv -n 5 2023-06-bulk-data.parquet
tsday,ip,category,organization
2023-06-01,51.15.226.136,hosting,SCALEWAY S.A.S.
2023-06-01,165.154.121.59,hosting,UCLOUD INFORMATION TECHNOLOGY (HK) LIMITED
2023-06-01,118.98.121.241,isp,PT Telekomunikasi Indonesia
2023-06-01,43.152.67.226,hosting,"Tencent Building, Kejizhongyi Avenue"
2023-06-01,14.47.0.120,isp,Korea Telecom

The cat sub-command has the same options as the head one and is useful for converting Parquet files to CSV.

Note that the CSV output only works for Parquet files with “simple” columns types.

And, finally, we can get the size of the file(s) either in raw bytes or something we humans can read:

$ pqrs size --pretty 2023-06-bulk-data.parquet
Size in Bytes:

File Name: 2023-06-bulk-data.parquet
Uncompressed Size: 297 MiB

It’s super handy, and one cargo install pqrs away.

It’s All About That [Fast] Base[64]

closeup photo of person playing guitar

We 💙 WebAssembly (WASM) here at the Drop. One reason for the cool feels is that it enables developers to create high-performance applications that run directly in the browser. One such example is the fast-base64 library, a WASM-based implementation of the popular base64 encoding and decoding algorithm.

What makes this one special (to me) is that it is lovingly hand-crafted in bespoke WASM (well, WAT — the text version of WASM). That and it’s fast.

Key features include:

  • converting base64 to and from Uint8Array

  • utilizing SIMD (Single Instruction, Multiple Data) instructions if the browser supports it

  • compatibility with Node.js and Deno

  • performance improvements over other JavaScript implementations, with up to 20x faster encoding and decoding on the first 100kB of input and 2-3x faster on the first 100MB

As the author notes, you likely do not need to use this library, but it’s a great example of both raw coding in WASM/WAT, packaging up WASM into and exposing core functionality to JS.

The WAT code is very readable (there are even comments!), so — if nothing else — folks interested in how all this fancy stuff works will likely find that interesting.

I threw together a small demo (mostly based on their examples) for how to use this in a basic web page context (no fancy “projects” required).

[Blog] Writing for Developers

black flat screen tv turned on at the living room

I’ve been challenging folks to create quite a bit in these Drops. That doesn’t mean I’m suggesting everyone starts writing mountains of code. Writing — and, in the case of this section, longer-form writing — holds immense power as a form of communication. Whether it be documentation, blogs, essays, or even giving threaded content away to social media sites, what we craft can foster learning and give our individual voices a chance to shine.

In a recent blog post, Robin Moffatt — a Principal DevEx Engineer at LakeFS —offers some clear and useful guidance on “Blog Writing for Developers”. As you read it, you’ll see that the advice works across many writing contexts.

Robin’s approach to writing relies on both structure and technique. To help understand why, he recommends taking a look at Larry McEnerney’s lecture on writing effectively, which offers invaluable insights into crafting compelling prose. These lectures can serve as a roadmap to navigate the challenges of writing efficiently and effectively.

When writing specifically for developers, Robin offers three key dimensions that should be considered: clarity, personality, and uniformity of content.

Clarity is a fundamental aspect of successful writing, often achieved through clear sentence construction, thoughtful paragraph breaks, and an overall structured approach. It’s worth noting that in the realm of writing, more is not always better. The efficiency of words often triumphs over sheer quantity.

The degree of personality and voice embedded in writing hinges largely on the intended audience and the purpose of the piece. For instance, blogs typically allow for more personal expression, while technical documentation usually requires a neutral tone.

Uniformity and standardization of content are largely context-dependent. Company blogs, for instance, often prioritize consistency to maintain brand voice and identity, whereas personal blogs offer more freedom for creative exploration and expression.

To maintain structure and clarity, the author recommends a particular methodology for blog writing:

  • first, inform the reader what the piece will cover,

  • then present the core content, and finally,

  • recap what has been discussed.

This format helps reinforce the key points and ensures the readers fully grasp the content.

I am in full agreement with Robin that no one should fear the writing process.

Def give the whole thing a 👀!


FIN

Thank y’all once more for your support! ☮

1

I love this word since it has a “GIF” vibe. I always pronounce it like the margarine (parkay) to see who does/doesn’t cringe.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.