My Embeddings Are My Passport; The ALAN Parsers Project; Million Mile Mexapixels
My Embeddings Are My Passport
(YouTube ancient reference hint for the readers who are youngsters.)
Despite promises of the forthcoming demise of "the password" we're likely going to be in password Hades for quite some time.
You can (and should!) use a password manager to create unique, "strong" passwords for each context you need to use credentials (and, hopefully, all the services you use also require multifactor authentication; ditch them if they don't). Notice the word "create". We humans are lazy and our ability to pick a decent "strong" password based on rules is just plain terrible. Different password managers have different options for password generation, and it'd be cool to know just how "guessable" a given generated password is. How one approaches said guesses matters. Random brute-forcing can take a long time. Even random brute-forcing from password in know breaches can take a bit.
It'd be neat if the same tool that can generate a password can also more practically determine the "strength" than arbitrary complexity rules. (I keep quoting the word "strength" because the cybersecurity intelligentsia got us into our current credential Hades by non-data-driven gut-calls on efficacious password policies.) We may have said tool!
Rajashekar Chintalapati (@rajashekar) and Gaurav Sood recently published "Pass-Fail: Using a Password Generator to Improve Password Strength". Here's a bit of the intro:
Despite most modern browsers providing password generator and management tools, many users do not pick long, randomly generated passwords. To build a tool that offers advice on a) strength of passwords (that goes beyond using password length) and b) how to make stronger passwords, we built an ML model. Using ~881M leaked passwords from various databases, we built a character-level password generator using an encoder-decoder model. (We also built a separate model using a database of 18M passwords.) We then validate the model against the HaveIBeenPwned (HIBP) database. Of the roughly 10,000 model-generated 5-character-long passwords, 58% match a password in the HIBP database compared to .1% of the 10,000 randomly generated passwords. As a more stringent test, we estimated the HIBP match rate for model-generated passwords that are not in the training corpus. For passwords starting with 10 of the most common characters, the match rate is about 10%.
We use the model to understand one of the correlates of a strong password---the starting character. In our data, the starting characters of passwords have a sharp skew, with about 30 characters covering about 83% of the passwords. And, understandably, passwords starting with more common characters can be guessed by the model more quickly (the correlation is 84% for five char. passwords on our data). This suggests that there is an opportunity to create passwords using less commonly used starting characters (see the distribution of first characters in our notebooks linked to above).
Our model can also be used to estimate the strength of a password. Admittedly, the job is computationally heavy. And approximate inference based on, e.g., maximum percentage matched in the first 100 tries, may be useful. For illustration, our generator can recover the password 'Password1' in ~ 1000 tries when the search space for nine-character passwords with 95 tokens is in the quadrillions.
Their notebooks and embeddings are available at the repo link as well as a bunch more expository. While I suspect none of us will be using that model in any product (or even at our own command lines), it's pretty cool seeing defenders use data science tools to improve practical things in the cybersecurity space.
FWIW, you may just want to focus on more basic generators that produce three random words vs ones that rely on Tensorflow, as complexity is the enemy of pretty much everything.
The ALAN Parsers Project
You have to respect a backronym when you see a solid one, and Trail of Bits' coining of "Automated Lexical Annotation and Navigation of Parsers" solely to get this section's title backronym is most certainly praise-worthy. If you don't know who/what Trail of Bits (ToB) is, drop this newsletter for the day, head over to their site and read everything you can. They're great, smart humans.
The ALAN Parsers Project consists of:
polyfile: A pure Python clean-room implementation of
libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer
polytracker: An LLVM-based instrumentation tool for universal taint tracking, dataflow analysis, and tracing.
Files are treacherous things. I'll let ToB explain why:
Parsing is hard, even when a file format is well specified. But when the specification is ambiguous, it leads to unintended and strange parser and interpreter behaviors that make file formats susceptible to security vulnerabilities. What if we could automatically generate a “safe” subset of any file format, along with an associated, verified parser? That’s our collective goal in Dr. Sergey Bratus’s DARPA SafeDocs program.
But wait—why is parsing hard in the first place? Design decisions like embedded scripting languages, complex context-sensitive grammars, and object models that allow arbitrary dependencies between objects may have looked like good ways to enrich a format, but they increase the attack surface of a parser, leading to forgotten or bypassed security checks, denial of service, privacy leakage, information hiding, and even hidden malicious payloads.
We've talked about polyglot files in a previous edition, but if you missed that and are still under the notion that files are generally benign things, clone this PDF file since it — in standalone form — is git repo that contains its own LaTeX Source and a copy of itself.
I introduce the ALAN Parsers Project, today, to primarily talk about a recent update to
polyfile. We'll hit up Katai struct and
polytracker in future editions.
In "libmagic: The Blathering", Evan Sultanik (@esultanik), Principal Security Engineer @ ToB runs through "a compendium of the oddities that we uncovered while developing our pure Python cleanroom implementation of libmagic". Evan set out on this magical quest as the previously mentioned
polyfile used the TrID File Identifier definition database to "guess" file contents, and it was becoming too slow and mis-identifying files more frequently.
Even if you don't code on-the-regular, Evan's post is a great read, and I think you'll like playing with
polyfile even if you do not have regular cause to poke around at the inner workings of specific files.
If you do want to check out the PDF mentioned above, I suggest
curling https://www.alchemistowl.org/pocorgtfo/pocorgtfo15.pdf, rename it to
pdf.zip, unzip it (at a command line) and then run
polyfile --html pdf.html PDFGitPolyglot.pdf. You can also just view said output here.
Million Mile Mexapixels
I'm fairly certain, even indigenous folks on undiscovered islands managed to see the James Webb Space Telescope images this week.
While most folks seem to be satisfied rolling with memes:
I'm still marveling that we humans are able to pull 57 gigabytes of downlink capacity per day from a second Lagrange point distance away. Perhaps there may yet be hope for us.