

Discover more from hrbrmstr's Daily Drop
Drop #292 (2023-07-10): It Was The Myth Of Fingerprints
Fingerprinting TLS Clients & Servers; Fingerprinting Web Content; Fingerprinting HTTP Server Headers
Unlike Paul Simon's lyrical assertion, fingerprints across many contexts are far from the same.
Today, we look at digital fingerprints across some diverse contexts, but with similar purposes to help cybersecurity researchers identify potentially bad things.
Even if you're not in cybersecurity, I think the concepts in each section could spark some reader creativity to apply similar techniques in your own domains.
Fingerprinting TLS Clients & Servers
Most readers use their [networking] superpowers for “good”. Y'all browse, curl
, wget
, {request}/{reqwest}/{httr}— directly or via some fancy client/app — with the best of intentions; more often than not to pore over pictures of cats/doggos. Similarly, you stand up nginx/Caddy instances, Shiny servers, Jupyter Labs, Flask apps, and more to share useful content to the world (and, sometimes, inane memes).
Anyone in cybersecurity knows said assertions are not universally true. Trickbot and Emotet are two classic examples of this. Each has a client and server component that use the same encryption we rely on for a faux sense of safety in internet communications to accomplish their respective malicious tasks. However, it turns out we can use their use of this encryption tech against them when trying to detect their presence in our networks or on the internet.
We can do said detection with JA3/JA3S hashes.
Client-side JA3 hashes are a method for creating SSL/TLS client fingerprints in an easy-to-produce and shareable way. The algorithm was created by John Althouse, Jeff Atkinson, and Josh Atkins, and was inspired by the research and works of Lee Brotherston and his TLS Fingerprinting tool, FingerprinTLS.
JA3 hashes are generated by gathering the decimal values of the bytes for the following fields in the Client Hello packet:
SSL Version
Accepted Ciphers
List of Extensions
Elliptic Curves
Elliptic Curve Formats.
These values are then concatenated together in order, using a “,” to delimit each field and a “-” to delimit each value in each field. The resulting string is then hashed using the MD5 algorithm to create the final JA3 hash.
Similar to JA3 hashes, JA3S hashes are a method for creating SSL/TLS server fingerprints. While JA3 hashes focus on the client side of the SSL/TLS communication, JA3S hashes target the server side. The fingerprint is generated using attributes from the Server Hello packet.
For example, these are samples of Trickbot's and Emotet's JA3/JA3S client/server hashes:
JA3 = 6734f37431670b3ab4292b8f60f29984 ( Fingerprint of Trickbot )
JA3S = 623de93db17d313345d7ea481e7443cf ( Fingerprint of Command and Control Server Response )
JA3 = 4d7a28d6f2263ed61de88ca66eb011e3 ( Fingerprint of Emotet )
JA3S = 80b3a14bccc8598a1f3bbe83e71f735f ( Fingerprint of Command and Control Server Response )
Despite their usefulness, JA3 hashes have some limitations.
It is not uncommon for different client applications to have the same JA3 fingerprint, leading to false positives in detection. This can be due to clients behaving similarly enough to have the same hash or through intentional deception by attackers and bot developers.
JA3 hashes are also a static approach to detecting suspicious traffic in encrypted communications, similar to the traditional deep-packet-inspection signature-based approach. This means that they still rely on maintaining a database of malicious JA3 hashes and can be prone to overlap and (again) false positives.
Attackers can (and, often, do) change the values used to generate JA3 hashes to make their malicious applications appear like legitimate ones, such as a Chrome browser.
The use of the MD5 message-digest algorithm is somewhat problematic, since there are more circumstances when hash collisions might occur. In retrospect, I'm not sure the benefit of being able to support JA3/JA3S on older/legacy systems was worth this trade off.
On mobile devices, apps tend to use the same client HTTP stack, which means they all tend to have common JA3 hashes. However, by introducing additional TLS features like the JA3S server hash and Server Name Indication (SNI) extension, more accurate identification becomes possible.
JA3S hashes, when used in conjunction with JA3 hashes, can significantly reduce the level of false positives in detecting suspicious traffic in encrypted communications. This combination provides a more comprehensive view of the encrypted communication, allowing for better identification of potentially malicious events and applications.
JA3S hashes share some of the limitations of JA3 hashes, such as the “false positive” problem and the ability to be fooled by clever attackers. They also highly dependent on the Client Hello (since that is an integral part of the overall Hello Request).
JARM hashes are a bit more useful on the server-side. The hash generation process involves sending 10 specially crafted TLS Client Hello packets to a target TLS server, then capturing specific attributes of the TLS Server Hello responses. They responses are aggregated and hashed using a combination of a reversible and non-reversible hash algorithm to produce a 62-character JARM fingerprint.
Unfortunately, they are heavily dependent on the operating system, packages, libraries, and other custom configurations of the server. This means that the fingerprint may not be unique to a specific application or service, but rather to the underlying server configuration.
Despite the potential limitations of JA3/JA3S/JARM, they do provide defenders with additional tools to help identify malicious infrastructure and behavior.
There are tons of implementations of these algos in pretty much every programming language, and you can find many fingerprint repositories like this one which house already computed and tagged entities.
Fingerprinting Web Content
The data science wonks that make up a fair portion of Drop readers likely know about Locality-Sensitive Hashing (LSH), but I'm going to talk about using it in a cybersecurity context with a use folks may not be aware of. So, let's quickly explain it, define some terms, then discuss this use case.
LSH is an algorithmic technique that hashes similar input items into the same “buckets” with high probability. Unlike conventional hashing techniques that aim to minimize hash collisions, LSH maximizes them, allowing for data clustering and nearest neighbor search. LSH is particularly useful for reducing the dimensionality of high-dimensional data while preserving relative distances between items.
It consists of a variety of methods, including shingling, MinHashing, and the final banded LSH function. The general idea of LSH is to find an algorithm that, when given the signatures of two documents, can determine their similarity. As noted, LSH functions are designed to maximize the probability of two similar objects falling into the same bucket. This allows for pre-processing of data, reducing dimensionality, and speeding up similarity searches.
This technique is across many disciplines and processes to de-duplicate a large document corpus, identify similar gene expressions, image search, and outright (lame) plagiarism.
They can also be used to help identify similar-technology compromised devices in botnets!
As I've noted in prior Drops, I help maintain a planetary scale internet sensor network that advertises no services through traditional means (such as DNS). These nodes listen for connections that should not exist! We process and classify all these connections, then make them searchable by you.
But, that is not all we do with the data. Every few hours, we take all the unique IP addresses we've seen in that time frame and use completely different infrastructure (that is not in any way associated with us) and poke back at those nodes. One way we poke back is to issue a few HTTP requests on various ports. We compute the LSH hashes for the HTML content and save it (along with a bunch of other metadata we collect) all in Parquet files for later research projects.
Sadly, a large number of IPs hitting us are compromised devices, such as home routers, cloud servers, cameras, etc. We can use the LSH hashes (and other hashes/metadata) to identify devices with common technologies. In fact, just using the LSH hash can provide a quick bit of insight into common compromised tech clusters.
Take this (random! honest!) slice of this data set:
source_ip lsh
1 68.201.35.244 c341953690c64513e0833475a6f2e6117c6cc22795018d3cf…
2 103.149.192.157 a0d000c00c0f3000f000303f0000000000000000000000000…
3 159.89.83.156 b181fc067d95b50da38043b26ad3f519ba2ae2278710dc44b…
4 24.182.54.169 c341953690c64513e0833475a6f2e6117c6cc22795018d3cf…
5 76.185.49.180 c341953690c64513e0833475a6f2e6117c6cc22795018d3cf…
6 24.217.56.193 c341953690c64513e0833475a6f2e6117c6cc22795018d3cf…
7 122.2.16.227 2ed022b14c294049f9e229ff99b3f20e3c2494840104c9080…
8 184.152.74.211 c341953690c64513e0833475a6f2e6117c6cc22795018d3cf…
9 65.24.147.201 c341953690c64513e0833475a6f2e6117c6cc22795018d3cf…
10 110.78.178.67 7181f026dd41c487b2318798eba2f508ef209013c345acbc7…
You can likely eyeball a few, likely common nodes just in that small snippet (the c341953…
ones).
In many cases, the values are identical, but there can be slight differences (say, if the IP address of the node is in the HTML somewhere).
Using something similar to:
textreuse::lsh_query(scan_backs, "68.201.35.244")
in your fav programming language (Python, JavaScript, Rust, and Go all have similar packages to R's {textreuse}), we can find all the similar nodes to that one.
Now, LSH hashes can and do produce false positives, so we'd use more than just the hash data for proper research, but that lsh_query
identified scads of this tech:
GoAhead is a commonly compromised garbage embedded web server since the developers play vulnerability whack-a-mole (vs. take a good, long, hard look for vulnerabilities and perform defensive coding). Apart from the regular cadence of new vulnerabilities in it, folks never patch the embedded devices they run on. So, attackers regularly swoop them up into botnets.
Now, if only there was a way to clean up all these vulnerable messes we find. But, at least LSH help us keep track of them, and, perhaps this section has given some readers an idea for how to use MinHash/LSH in your own contexts.
Fingerprinting HTTP Server Headers
(Apologies if some readers saw this over on my plain ol' blog, but I'm excited enough about this new hashing idiom that I felt it appropriate to cut/paste my own content over here as well.)
HTTP Headers Hashing (HHHash) is a technique developed by Alexandre Dulaunoy of CIRCL to generate a fingerprint of an HTTP server based on the headers it returns. It employs one-way hashing to generate a hash value from the list of header keys returned by the server. The HHHash value is calculated by concatenating the list of headers returned, ordered by sequence, with each header value separated by a colon. The SHA256 of this concatenated list is then taken to generate the HHHash value. HHHash incorporates a version identifier to enable updates to new hashing functions.
While effective, HHHash's performance relies heavily on the characteristics of the HTTP requests, so correlations are typically only established using the same crawler parameters. Locality-sensitive hashing (LSH) could be used to calculate distances between sets of headers for more efficient comparisons. There are some limitations with some LSH algorithms (such as the need to pad content to a minimum byte length) that make the initial use of SHA256 hashes a bit more straightforward.
Alexandre made a Python library for it, and I cranked out an R package, Golang CLI, and C++ CLI for it as well.
They all turn the headers from a site into something usable for mass comparison.
These are the headers from https://www.circl.lu/:
Date
Server
Strict-Transport-Security
Last-Modified
ETag
Accept-Ranges
Content-Length
Content-Security-Policy
X-Content-Type-Options
X-Frame-Options
X-XSS-Protection
Content-Type
And, this is the v1 HHHash:
hhh:1:78f7ef0651bac1a5ea42ed9d22242ed8725f07815091032a34ab4e30d3c3cefc
Since, at work, we collect headers as well as full content in the system I mentioned in the middle section, I'm going to see how LSH stands up in this and, perhaps, suggest a “v2” HHHash that uses it instead.
FIN
🇺🇸 folk: if you're in one of the listed 17 states, remember to poke your noggins outside Thursday night to catch a glimpse of the Northern Lights. ☮