Labels Matter
Labels matter in data science and public discourse. Quantitative professionals assign items to their proper bins, which inform how each item should be handled.
Data science, like public debate, often reduces to classification. Classification is little more than the assignment of a label. Suddenly, punditry begins to distinguish and define, folks assigned to specific bins begin to resist their labels or the limitations of labels in general, and semantics rears its head.
Semantic debates become tools for classifications. As George Carlin reminds us posthumously, words matter. The labels we assign to people and ideas matter. To prove this point, let’s consider some examples from popular, controversial issues.
Two Common Labels
What do we call a person who’d never have an abortion themselves, but thinks the government shouldn’t prevent anyone else from having an abortion? We call them “pro-choice” and certainly not “anti-abortion.”
What do we call a person who chooses not to receive a COVID vaccine, but does not believe the government should prevent anyone else from doing so? We call them “anti-vax,” and certainly not “pro-choice.”
One Re-applied Label
Of course, data scientists struggle not only with misclassification, but with when and how to split one group into two. When should two different types of items be divided into two separate groups and when can they remain as one?
What do we call a person who has received all the recommended boosters, but does not believe the government or any other institution (employers, restaurants, etc) has a right to mandate that others follow suit? Often, that person is also labeled “anti-vax.” Notably, that individual feels differently about vaccines than the individual assigned the same label above.
There are times when coarse groupings are appropriate. An analysis of citizens over and under the age of 50 might be illustrative without further partitioning of those classifications.1
There are others when such reductive groupings preclude nuanced analysis. Data scientists and developers struggle mightily when two disparate use cases and data types must follow one rigid workflow.
Civically-engaged citizens have similar struggles.
Where this Leads
This is why politicians repeat their talking points, then replicate the use of a term with every available human being sympathetic to their position. Eventually, through brute force, we find ourselves with a broadly-accepted lexicon. In one case, we choose the semantics of the classification based upon the choice. In the other, we choose a label focused upon opposition.
Steelmanning and growth mindsets demand care with language and the perceptions those words create.
Model-building requires the idea often attributed2 to Einstein, that concepts should be made as simple as possible, but not simpler. Software development, data science, and informed discussion all require precision of language.3
It is not profane4 or malicious words we ought to fear. It is those that are imprecise, coercive, or contain multiple different interpretations that chill discussion and break codebases.
Choose your labels wisely.
1 This is where recursive algorithms like those used to define classification and regression trees offer mechanisms beyond which no further splitting occurs and/or methods for pruning overly-specified, degenerate nodes. Otherwise, we’ll start analyzing the properties of bald Estonians, aged 71, with a propensity towards pontification and a penchant for a specific variety of borscht.
2 Apocryphally, and possibly a paraphrasing of Occam’s Razor or the like.
3 What is syntax, beyond precise definitions of terms and their applications?
4 At least seven of ‘em.