Discover more from hrbrmstr's Daily Drop
Ngrok 3.0; Data Factories; Data Science-enabled Accessibility
As both a developer and cybersecurity professional, there is a category of tools and services which regularly pits both of those personas at odds with each other. Ngrok is one of those tools.
Daniel Miessler is also a cybersecurity professional who put together a great 'splainer/tutorial on Ngrok back in January of this year. For the Ngrok uninitiated, I’ll invoke my DRY rule (extending it a bit when I’d just be saying what others have) and quote from Daniel:
Ngrok is an application that gives you external (internet) access to your private systems that are hidden behind NAT or a firewall. It’s basically a super slick, encrypted TCP tunnel that provides an internet-accessible address that anyone can get to, and then links the other side of that tunnel to functionality running local.
Here’s what it does:
You run ngrok from a local system with a service you want to make available to people on the internet
Just run the command and give it the protocol you want to use, along with the local port it’s listening on
Ngrok then creates an address in the cloud that people can get to
It then connects those two things over an encrypted tunnel, so when you hit the Internet address, you land on your local service automagically!
While Ngrok is absolutely used by malicious folks for command and control (C2) plus exfiltration, the primary use is for developers to make locally developed services accessible over the internet.
The new release is very cloudy. Their Cloud Edge feature lets you add OAuth, OIDC, SAML, mTLS, Webhook Signature Verification (link goes to a Stripe definition of WSV’s), automated certificates without hassle, which can free up tons of feature exploration and testing time. Want GitHub OAuth? →
ngrok http 80 —oauth=github (yes, it really is that simple).
The tunnel capabilities mean you can bridge your network and a colleague’s/customer’s networks seamlessly through their expensive, “iron clad” firewalls and
exfiltrate interoperate seamlessly.
Along the way, the Ngrok team also kicked up observability a few notches.
Ngrok itself is SOC 2 compliant and the service has a fairly generous free tier.
Full disclosure (before summarizing): I am
#NotAFan of data lakes/warehouses both as a data person and a cybersecurity professional, so read this with that in mind.
Jeremy Stanley begins his article with a fairly hot take:
The data warehouse is a broken metaphor in the modern data stack.
We aren’t loading indistinguishable pallets of data into virtual warehouses, where we stack them in neat rows and columns and then forklift them out onto delivery trucks.
Instead, we feed raw data into factories filled with complex assembly lines connected by conveyor belts. Our factories manufacture customized and evolving data products for various internal and external customers.
The crux of the post is on data quality, something that I do not see mentioned frequently enough in the various data communities I engage in or at least monitor regularly.
As a data user, one recent-ish data feed I regularly rely on is CISA’s catalog of Known Exploited Vulnerabilities. Until recently this “very much not big data” dataset was riddled with quality issues in multiple critical fields that initially went unnoticed by most folks who also used the data. Shunting this into a giant RedShift or Snowflake environment could create cascading analysis and modeling corruption without focusing on quality. Jeremy adds a few more possible (but not exhaustive) list:
We don’t know if the data we produce is high quality until we have tested the finished product. For example:
Did a join introduce duplicate rows?
Did a malformed column cause missing values?
Are timestamps inconsistently recorded?
Has a change in query logic affected business metrics?
After validating the quality of our final product, we should ensure we are consuming high-quality raw materials. Identifying defects in raw data arriving into the factory will save us time and effort in root-causing issues later.
And, I have to say, I feel seen when he points out what we’re monitoring the wrong things in data pipelines:
We monitor data infrastructure for uptime and responsiveness.
We monitor Airflow tasks for exceptions and run times.
We apply rule-based tests with dbt to check the logic of transformations.
We analyze data lineage to build complex maps of data factory floors.
This article has sparked a fun discussion at work, and it definitely provides much food for thought as to how we all gather, store, process, and test data in our environments.
Data Science-enabled Accessibility
I am an avid follower of the Web Science & Digital Libraries Research Group, Old Dominion University’s Department of Computer Science as they do wonderful, ground breaking work in the area of web scraping and web archiving. Over the coming weeks I’ll be featuring some of their cooler projects, and today I’m focusing on a very recent project dubbed InSupport.
According to the paper, InSupport is “Proxy Interface for Enabling Efficient Non-Visual Interaction with Web Data Records”. A more practical definition is that InSupport is a browser extension with a proxy interface applying (machine learned) algorithms that extract features for different items on web pages to assist the blind when navigating shopping and travel booking websites.
We sighted folk take for granted our ability to grok and use the (still, often terrible and complex) user interfaces of travel and shopping sites. Sadly, most have no options to make their sites more accessible despite having many regulations and resources available.
Technology has always had the ability to improve someone’s life and the results (below) from InSupport study trails seem to indicate it can do just that:
The GitHub and paper have a sibling presentation that walks though the entire problem statement and solution process.
That’s a wrap! If you do interact in the comments, the only rule is to be kind. ☮