

Discover more from hrbrmstr's Daily Drop
In a recent Bonus Drop, I dedicated a section to Kagi AI. I'm a big fan of Kagi, and am quite pleased at the services and features I receive for my monthly fee.
Their freemium model lets them use the proceeds to both run the service, and fuel new development.Rather than go into the direction Bing and Google have (y'know, letting us all train ChatGPT to become The Joker), Kagi took a different approach.
First, they are outwardly calling it an experiment with the “labs” nomenclature, which I give them major kudos for. Second, they aren't having their AI back end make up 💩 from whole cloth. They're using the latest and greatest advancements in LLMs to help you quickly grok what's at the end of a URL, either in the form of a short summary, or a bullet list of major points.
Readers of these Drops skew towards the DIY side of things, and we really do not need some third party in the mix to summarise stuff for us (please blame {ggplot2} for my use of non-American spelling).
Today's Drop brings you three different text analysis resources. Each should help you quickly grok the contents of any corpus you encounter. Please note that I did not check for how broad the language support was for all of these — a persistent failing of the 🇺🇸 part of my 🧠 — but the first and last one should be almost language-agnostic.
So sumy
Poor attempts at legal puns aside, Sumy [GH] — or, for code-last folks, Sumy Space — is the result of work performed by Mišo Belica for their Master's thesis. Sumy can create an “extractive summary” from a corpus, meaning it tries to find the most significant sentences in one or more documents and collates it all into a shortened text. While originally designed for a Czech/Slovak corpus, it was built with extensibility in mind, and works fine across many languages.
You can choose from several algorithms:
Random: DO NOT USE
Luhn: Heuristic method (simplest)
Edmundson: Heuristic method with previous statistic research (Lunh on steroids)
Latent Semantic Analysis, LSA: Algebraic method (computation heavy)
LexRank / TextRank: Unsupervised approach inspired by algorithms PageRank and HITS
SumBasic: Method often used as a baseline in the literature
KL-Sum: (honestly this one is kind of wonky)
Reduction: Graph-based summarization (I think this is a bit wonky too)
For the Bonus Drop, I had Kagi AI summarise one of the first stories about the silly balloons. It did a decent job:
A suspected Chinese surveillance balloon has been spotted flying over the continental United States, sparking national security concerns and prompting US Secretary of State Antony Blinken to postpone his trip to China. The balloon, which is roughly the length of three city buses, is expected to exit the east coast of the United States as early as Saturday morning. The US has not ruled out shooting it down, though the military had advised against it due to the risk of falling debris. China claims the balloon is a civilian research vessel that has been blown off course. Additionally, the Pentagon has reported that another Chinese spy balloon has been spotted above Latin America, though it is not currently heading to the US. This incident has added to already tense diplomatic relations between the two countries.
I won't show every Sumy algorithm below, but I will show the output of LSA. I am confident readers of the Drops can pip install
it or use the web interface. Though, if you do use the pip install
method, make sure you have the Punkt sentence tokenizer available:
import nltk
nltk.download('punkt')
Here's what LSA outputs:
sumy lsa --length=5 --url=https://www.cnn.com/politics/live-news/suspected-chinese-balloon-over-us-02-04-23/index.html
The air force said in a statement Friday it is coordinating with other “countries and institutions” to establish the origin of the balloon, which has already left Colombia's airspace after being tracked by its National Defense System.
The massive white orb, carrying aloft a payload the size of three coach buses, had already been floating in and out of American airspace for three days by the time Biden was briefed by his top general, according to two US officials.
The balloon’s week-long American journey, from the remote Aleutian Islands to the Carolina coast, left a wake of shattered diplomacy, furious reprisals from Biden’s political rivals and a preview of a new era of escalating military strain between the world’s two largest economies.
A White House official said Saturday that President Joe Biden and his military advisers took “responsible action” by waiting to shoot down the Chinese balloon until it was over water, minimizing the risk it could have posed to people on the ground had it been shot over the continental US.
“Such actions by the Chinese Communist Party government contravene international law, breach the airspace of other countries, and violate their sovereignty,” Taiwan's Foreign Ministry said in a statement.
Each algorithm is going to provide different results, especially when you limit the number of sentences, but you should get a decent summary, and you won’t need to leave the comfort of yor command line to do so.
While the summaries may not be perfect, sumy is something folks might want to consider experimenting with to see if it has a place in your information gathering workflows.
I found VideoMash to be helpful if I have a ton of YouTube (et al.) content to turn into something usable (I detest videos), especially if raw Whisker extracts are super long.
sumgram
I have noted my soft spot for the Web Science and Digital Libraries (WS-DL) program and team before. They do great work and turn out great students (hire them!, though I have zero affiliation with the school or program).
One of those “great works” is sumgram [GH], a “tool that summarizes a collection of text documents by generating the most frequent sumgrams (conjoined ngrams)”.
OK, fancy word(s) use explanation time.
An ngram (you might see it as “n-gram”) is a sequence of “n” words that appear together in a sentence or a text. Let's take the sentence “The quick brown fox jumps over the lazy dog”
and look at the 2-grams (a.k.a. bigrams):“The quick”
“quick brown”
“brown fox”
“fox jumps”
“jumps over”
“over the”
“the lazy”
“lazy dog”
Now, a conjoined ngram is a specific type of ngram where two consecutive words in the sequence are joined together by an underscore (_
). So, for our example sentence, the 2-gram “quick brown” can also be represented as a conjoined n-gram, “quick_brown”.
Conjoined ngrams are often used in natural language processing and text mining to simplify the analysis of text data because they allow us to treat a sequence of two words as a single entity.
ODU WS-DL focus much of their work on web archives, which are usually collections of web content around a particular subject/topic. One of the common tasks when poking at a web archive is trying to figure what said archive is about (they call this “aboutness” — an oddly accessible term for academia). Think of this “aboutness” as a high-level summary or set of summaries.
The usual “aboutness” method is to take the top k (e.g., k = 20) ngrams. This is problematic when you have entities like “World Health Organization” to deal with. By including sumgrams into the mix, we can actually get an “aboutness” summary that can include multiple different ngram classes (bigrams, trigrams, four-grams, five-grams, etc.). This improves the utility of the exercise without adding tons more to the computational effort.
Their blog post goes into more detail, and — like all the WS-DL tools — drops in your system like a hot knife through butter.
rustbert
(I mean, how could I not use that when it came up in the Unsplash search for “bert”?)
I have not fawned over Rust for a while, mostly due to the aforementioned (in a previous drop) performing executive functions than executing functions. But, I know my roots, and I could not let a Drop about summarisation tools end without a reference to rust-bert, a “Rust-native state-of-the-art Natural Language Processing models and pipelines.” It's a port of Hugging Face's Transformers library, using the tch-rs
crate and gets some pre-processing help from rust-tokenizers
.
Rust-bert supports multithreaded tokenization and can take advantage of those energy-bill-killing GPUs for inference.
It comes with batteries included:
model base architecture
ready-to-use pipelines
task-specific heads
Sigh. More fancy/odd words to explain. (This may take a bit.)
Transformer-based models are a type of deep learning model that are commonly used in natural language processing (NLP) tasks, such as language translation or text classification. They consist of a bonkers number of layers of interconnected nodes (“neurons” — and I am not a fan of that nomenclature), which are trained on further bonkers amounts of data to learn how to perform a specific task.
Within these transformer-based models, a “head” is a specific layer of interconnected nodes that is responsible for performing a specific sub-task within the overarching task. For example, in a text classification task, one head might be responsible for determining whether the text belongs to a certain category, while another head might be responsible for determining the sentiment of the text.
“Task-specific heads” (yes, I am almost done blathering) are simply heads that are designed to perform a specific sub-task within the larger task that the model is being trained to perform. These heads are added to the end of the model and are trained on the output of the preceding layers, which encode the input text in a way that is suitable for the specific task.
The advantage of using task-specific heads in transformer-based models is that they allow the model to learn to perform different sub-tasks within a larger task without requiring the entire model to be retrained. This can save a significant amount of time and computational resources, as well as allowing for greater flexibility in the types of tasks that the model can be applied to.
Rust-bert comes with the following task-specific heads:
Translation
Summarization
Multi-turn dialogue
Zero-shot classification
Sentiment Analysis
Named Entity Recognition
Part of Speech tagging
Question-Answering
Language Generation
Masked Language Model
Sentence Embeddings
OK, I guess I could have just said that to begin with. o_O
So, I have to admit that I did not plan ahead for this drop and re-writing the above 6x has somewhat beaten me down a bit, as I'm still on the mend. Thankfully, the docs explain the rest very well, and there are unusually good examples (the Rust document ecosystem has a weird relationship with example code) to pore over.
I highly encourage you to check out the ready-to-use pipelines, since that's where the tie into today's "summarisation” theme comes into play.
FIN
Depending on where you live, get ready to put on your boots, sandals, sneakers, or hiking boots for tomorrow's “Weekend Project Edition” Drop. ☮
As usual, I get nothing for mentioning paid services.
Apologies for the lack of creativity, but I am, unfortunately, a product of a draconian and boring U.S. Junior High typewriter typing classes.
It turns out my prolific mixed use of British and American spelling in this Drop broke LanguageTool, so further apologies for any egregious typos/grammar issues.