

Discover more from hrbrmstr's Daily Drop
Corpus Tools
If the section on Trafilatura in yesterday's edition piqued your interest, you'll really like this section's featurette on Corpus Tools.
Corpus Tools is a joint portal of the Masaryk University's NLP Centre and Lexical Computing dedicated to a number of software tools for corpus processing including a well-known corpus manager Sketch Engine. They have many cool (Python) projects that you may already be using, if text processing is part of your daily work or hobbies.
JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences. You may remember seeing this as one of the components of Trafilatura.
Chared is a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.
Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect tons of data suitable for text corpora.
Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.
Unitok is a universal text tokenizer with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens ("vertical" format), while preserving XML-like tags containing metadata.
NoSketch Engine is the open-sourced little brother of the corpus querying system Sketch Engine.
wiki2corpus is a script which downloads Wikipedia articles (for a given language) and outputs them in the form of prevertical which can be further processed by other corpus tools.
Language Filter is a language discriminating tool. It works with the vertical format. The language of paragraphs and documents is determined according to pre-defined lists of words with corpus frequency.
There's a plethora of information in each of those wiki sections, including links to papers that describe many of the tools in-depth. Go forth and TOKENIZE ALL THE THINGS!
storysniffer
Since today's edition started off with some text processing crunchy goodness, let's continue the theme and introduce storysniffer [GH], a tool written by acclaimed data journalist Ben Welsh (@palewire), for inspecting the contents of URLs to estimate if they contain a news story.
The true/false guess (er…estimate) is based on a machine-learning model that was trained on a supervised sample of links collected by the News Homepages project. If you head over to the notebook with a bunch of model tests, it appears that they've achieved ~96% accuracy. That's a non-trivial 4%, though, as it failed to detect that an NPR story was a news story if I gave it the URL of the text-only story vs the normal one (apologies, again, for the terrible Substack <pre>
blocks:
# system("pip3 install storysniffer justext")
library(reticulate)
storysniffer <- import("storysniffer")$StorySniffer()
storysniffer$guess(
"https://text.npr.org/1114969355"
)
## [1] FALSE
storysniffer$guess(
"https://www.npr.org/2022/08/02/1114969355/the-ukrainian-women-who-make-art-in-the-face-of-war"
)
## [1] FALSE
It's safer to not rely on just the model based solely on URLs and also give it some actual article text:
justext <- import("justext")
httr::GET("https://text.npr.org/1114969355") |>
httr::content(as = "text") |>
justext$justext(
stoplist = justext$get_stoplist("English")
) |>
sapply(
\(paragraph) paragraph$text
) |>
paste0(collapse="\n") -> article_text
storysniffer$guess(
"https://text.npr.org/1114969355",
text = article_text
)
## [1] TRUE
The training corpus is limited to links published on News Homepages, most of which are in English, so there's definitely some bias. PRs for improving the models are welcome and encouraged.
otherweb
"News" is fundamentally broken, though folks have been complaining about news reporting and distribution since there's been "news". Much of today's news bustedness is due to the present business model designed to support news institutions.
There have been three epochs in the history of news economics:
The first epoch began in 1605 with the German periodical Relation aller Fürnemmen und gedenckwürdigen Historien. This periodical and the newspapers about faraway places that flourished in Europe over the ensuing century were financed entirely by subscriptions. And so it was for a century. Then there was the lengthy middle epoch, beginning with the placement of the first newspaper ad in a 1704 issue of the Boston News-Letter. That one lasted 290 years — a really good 290 years. The current epoch began in 1994, with the first online banner ad. Needless to say, it's been a rough [28] years — and lately, dire has turned to devastating.
The main problem is that if you measure success in clicks and views, there is no financial incentive to optimize for anything else.
The internet’s business model is simple:
Get users to click on a headline so they can see your article (= get views).
Get users to click on ads that accompany the article (= get clicks).
Repeat step 1.
)At no point in this cycle does quality matter. Moreover, specific attributes of informative writing, like external references, are actively discouraged in this model - if the user clicks on an external reference, he or she may not come back to click on the next headline. This leads many online outlets to exclusively reference their own articles and to avoid any and all external links._
Otherweb is a news aggregator site that collects stories and creates "nutrition labels" (the section banner image has an example) based on a set of AI models dubbed, ValuRank, that analyze text along multiple dimensions:
informativity
subjectivity
formality
offensiveness
hatefulness
external references
source diversity
clickbait headlines
use of known propaganda techniques
and more.
A key feature of this process and site is that there is no attempt to define what quality is. You get to decided that based on the label. It's like going to the grocery store and picking a healthy option (based on the food nutrition labels) vs an unhealthy one. We need a balanced food diet for our bodies to be sound, and I think it's safe to say that a balanced news diet does the same for the mind.
If you hit the ValuRank link (above) you can also download the Chrome Extension (always be wary of installing browser extensions, though this one — in the version I tried last week — seems to be safe) to have it assess the content on any web page you visit.
FIN
I’m going to try to make using ValuRank part of the daily news consumption routine. If I manage to keep that going, I’ll drop a blog post on the shape of my personal news media diet. ☮