WebAssembly: The Definitive Guide; Drift; Trafilatura
WebAssembly: The Definitive Guide
I discuss WebAssembly quite a bit as it's going to impact our lives, dear reader, whether we create anything with it directly, or not. It is starting to take over the "edge" computing space, has been sneaking its way into our browsers for a while, and is snuggling up to each of our favorite programming languages.
Brian Sletten (@bsletten) thinks WebAssembly will be one of the most transformative technology components of our time. He's so excited about it, he wrote WebAssembly: The Definitive Guide. He explains it better in his own words in the book's preface:
I believe WebAssembly is an ascendant technology that has the potential to transform the entire software development industry in one form or another. I do not believe WebAssembly is going to be transformative because I am writing a book on the topic. I’m writing a book on it because I believe it will be transformative.
Presumably you are interested in the technology as well. The problem is, I think I have less of an idea of who you are as a reader than many authors do. If this were a book about a particular programming language or a specific topic, there would be a self-selecting aspect to the audience and I could proceed apace. But WebAssembly is a much larger topic than most people realize, and I am trying to paint a very large picture with this book. Most of the other books that have been published have focused on a single aspect of it, and I can understand why.
In the lead-up to the publication of this book, I have mostly gotten positive support and excitement from people I have spoken to about the project. One limited form of pushback I have gotten is with respect to the title. Some folks felt it was premature to have “The Definitive Guide” for this new of a technology. That is a fair position to take, but because I am trying to describe an extremely big and encompassing technical landscape, I thought it was reasonable. I hope by the end of the book you agree.
All I ask is that you have an open mind and a bit of patience. WebAssembly touches a lot of languages, runtimes, and operational environments. In addition to teaching you about the low-level details, we will look at integrations with the dominant programming languages in this space and several different use cases. I have tried not to make too many assumptions about your background, so I have heavily annotated the text with breadcrumbs for further exploration and discovery via footnotes. If you are a more advanced developer just seeking details about WebAssembly, feel free to ignore these and don’t take offense. I expect a rather wide audience will be at least perusing this book, and I want them to feel welcome, too.
If you are on the junior side development-wise, this will be a challenging book. But I have tried to make it possible for you to at least see what is going on. Consider the various links and references as a personal guide into a more sophisticated development reality. Don’t get overwhelmed, just tackle things one at a time in whatever order interests you or makes sense. There is no single way into this industry, and however you get there is legitimate.
At the end of the day, WebAssembly is going to allow us to basically choose our programming languages and run them securely in just about any computational surface area. We have been promised this before, but I think this time it is more likely to come to fruition. Thank you for giving me the opportunity to explain why.
Brian covers every aspect of WebAssembly (check out the table of contents on the above O'Reilly link), from writing Wasm code manually, to how it works with other programming languages, where it fits in the browser and server/edge compute ecosystems, and more.
As an author, I encourage you to drop some coin on the tome, if you can. If times are tough, you're in luck! The folks at Redpanda have a free copy for you! Just drop some legit contact info and the ~300-page book is yours. Brian likely gets some revenue from that setup, but it has to be a fraction of the already tiny fraction of royalties most technical book writers receive. Why is Redpanda being so generous? They've got a Kafka API compatible streaming data platform that is, you guessed it, based on WebAssembly.
I vacillate between self-hosting and using third-party services all the time. This primarily happens when I've got enough IRL things going on that I don't want to or just cannot take the time to DIY something.
Lately, I've been using Carbon as an on-again/off-again "pastebin" service for code, as I feel horrible sending folks to Microsoft (read: GitHub) for anything. (Protip: don't trust GitHub or Microsoft for literally anything). Carbon is great, but its primary mission making source code pretty, so you can annoy your non-technical social media pals with "source code editor" screenshots of your code gibberish.
There are many self-hostable GitHub Gist-like code pastebin services available, and one relatively new kid on the block is Drift [GH].
Drift is written in TypeScript, has an ambitions plan that's nearly complete, and is pretty lightweight. It didn't seem to recognize R code (for syntax highlighting), but it did a pretty good job on some Wasm/WAT source code.
Trafilatura (paper [direct PDF]) [GH] is "a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats".
It has scads of features:
Web crawling and text discovery:
Focused crawling and politeness rules
Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS)
URL management (blacklists, filtering and de-duplication)
Seamless and parallel processing, online and offline:
URLs, HTML files or parsed HTML trees usable as input
Efficient and polite processing of download queues
Conversion of previously downloaded files
Robust and efficient extraction:
Main text (with LXML, common patterns and generic algorithms: jusText, fork of readability-lxml)
Metadata (title, author, date, site name, categories and tags)
Formatting and structural elements: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
Comments (if applicable)
Text (minimal formatting or Markdown)
CSV (with metadata)
JSON (with metadata)
XML (with metadata, text formatting and page structure) and
Language detection on extracted content
Graphical user interface (GUI)
On top of being a decent crawler, it's also one of the better "just the text" extraction libraries/tools out there. You can give it a go with something like:
$ trafilatura \ --json \ -u "https://www.npr.org/2022/07/31/1114792935/nichelle-nichols-dies-star-trek"
right from the command line.
R.I.P. Nichelle Nichols. ☮