Discover more from hrbrmstr's Daily Drop
Drop #282 (2023-06-22): Happy ThursdAI
AI's Unsung Heroes; Tortured PhrAIses; A Breakneck TimelAIne
Today's Drop delves into the unseen human labor in AI data annotation, introduces peculiar practices to fool AI systems, and takes us through a rapidly evolving timeline of generative AI. It kind of ended up being a more raw and sobering look into AI's development journey and societal implications than expected. I'm pretty certain that’s a result of a personal challenge to deliberately consider the human aspects intertwined with AI technology, likely because I can no longer avoid using it.
(Programming note: I'll try not to lean too heavily or too often on the ThursdAI theme, but, I’m afraid “AI” is just a nigh unavoAIdable topic.)
AI's Unsung Heroes
In a recent blog post on The Verge, Josh Dzieza highlights the importance of “annotators” in the AI industry. Annotators are responsible for labeling and clarifying data, which is essential for training AI models like OpenAI's ChatGPT. Despite their significant contributions, annotators frequently remain behind the scenes, with their labor fueling the current AI boom.
The article discusses Remotasks, a worker-facing subsidiary of Scale AI, a multibillion-dollar data vendor that serves clients like OpenAI and the U.S. military. Many annotators working for Remotasks are unaware of their connection to Scale AI, as neither company's website mentions the other. This highlights the lack of transparency in the AI industry, especially when it comes to the role of annotators.
While the public response to AI advancements often focuses on job automation, it is essential to recognize the human labor behind these systems. Annotators play a critical role in developing AI technologies, such as chatbots and generative artwork, by providing the necessary data for training and refining models. However, only companies that can afford to purchase this data can compete in the AI market, and those that have access to it are highly motivated to keep it secret.
In a way, it's all ImageNet's fault:
In 2007, the AI researcher Fei-Fei Li, then a professor at Princeton, suspected the key to improving image-recognition neural networks, a method of machine learning that had been languishing for years, was training on more data — millions of labeled images rather than tens of thousands. The problem was that it would take decades and millions of dollars for her team of undergrads to label that many photos.
Li found thousands of workers on Mechanical Turk, Amazon’s crowdsourcing platform where people around the world complete small tasks for cheap. The resulting annotated dataset, called ImageNet, enabled breakthroughs in machine learning that revitalized the field and ushered in a decade of progress.
It's somewhat ironic that many of the folks Silicon Valley is taking advantage of are turning AI back on MTurk. It is less ironic and more disconcerting that scads of other organizations use MTurk and other platforms to do even worse things with AI training.
👉🏼 (Feel free to skip to the last ❡ in this section to avoid some pontification.) 👈🏼
As I've stated in previous Drop's, historical global society-changing advancements have always come on the backs/at the expense of the most vulnerable amongst us. The glowing rectangle you are reading this on itself was built — in large part — by the modern equivalent of slave labor in horrible factory conditions in places where human rights only barely exist for the most well-connected.
Now, these very systems are running or accessing AI models trained by other indentured humans that venture capitalist-backed well-connected CEOs will gladly leave by the wayside as they purchase their quarter-million dollar tickets on janky, doomed submersibles for an expensive joyride.
Make no mistake, we're (me included) are just as tainted as those CEOs, since we really care not about either the faces behind our glowing rectangles or the hands that did the hard work our OpenAI key-fueled API calls rely on. We'd use neither if we did.
In a six-part series, IEEE Spectrum explored the human history of AI. They took an in-depth look at how innovators, thinkers, workers, and sometimes hucksters have created algorithms that can feign human thought and behavior. Rather than focus on the outcomes, they help show that our AI overlords are, ultimately, only as good as we are. Somehow, that's a more frightening thought than I expected it to be.
We're tapping into another resource-filled short piece over at Language Log for this section. It introduced two terms that I was not aware of
One is “rogeting”, which is “an informal neologism created to describe the act of modifying a published source by substituting synonyms for sufficient words to fool plagiarism detection software, often resulting in the creation of new meaningless phrases through extensive synonym swapping”.
The other is “spamdexing”, which is “the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system”.
For ages, humans have been outsmarting the systems and software that try to suss out such nonsense, and the author of the posts makes an interesting posit about how LLMs are [about to] change up the calculus yet again. This is amusing in that a fair number of modern plagiarism detection systems use “AI” in their processes. These systems will and are adapting to determine if something like ChatGPT was used. While these nascent developments do work, they are far from foolproof. So, the cat-and-mouse game will likely never truly cease.
It's a quick read, but the article itself and the links in it provide fodder for many extended posits and food for thought. I was pleasantly surprised to see them reference this, now “old” RNNgem.
A Breakneck TimelAIne
GPT/LLM history is unfolding at an astounding rate, and Jonathan Jeon is doing the yeoman's work or cataloging all the changes/developments/advances over at the ChatGPT, GenerativeAI and LLMs Timeline GH repo.
They've organized/curated a scary complete chronology of key events (products, services, papers, source repos, blog posts and news) that occurred before and after the ChatGPT announcement.
The section header is a chart I threw together to show what a bonkers ride it's been.
Jonathan’s timeline 100% worth your time perusing, keeping a watch on, and contributing to.
If you're keen to think a bit more about AI in a cultural sense, check out the Global AI Narratives project ☮