The best data scientists will never win a Kaggle competition
Anyone in the machine learning community is familiar with Kaggle. To the uninitiated, it is a repository of tens of thousands of public datasets, hundreds of thousands of public notebooks1, and numerous public competitions. In these competitions, participants vie to develop the best performing model, via some metric the competition’s administrator has predetermined. Maybe today the challenge is to build the best tool for estimating the price of avocados next week. Maybe we’re trying to detect faces in images. Maybe we’re analyzing images of galaxies or the text in a Hollywood script.
As a result of Kaggle’s popularity, resumes are littered with references to successes of personal projects, with victories and elite rankings as evidence of a skillset the market demands (and compensates well). The question is, what exactly does it mean to win a Kaggle competition and does this mean employers should covet these rare birds over others in the flock?
Kaggle Pros
Kaggle competitions present semi-realistic problems from the business and scientific communities, providing data for the machine learning community to explore. The ability to produce meaningful results from these data demonstrate a number of skills that are prerequisites for professional success. Even those who do not win a competition, who produce a strong model that outperforms a baseline demonstrate some practical competence.
Moreover, building comfort predicting mean values, balancing a dominant class, and improving performance demonstrates some experience and aptitude building an ML solution on a new dataset.
Solving Kaggle problems often requires some feature engineering and possibly leveraging additional outside data or pre-trained models. Both of these skills are crucial in low data environments. For instance, in predicting home prices, the distance to water might be enormously relevant, even if it is not included in a Kaggle dataset. Thinking about such factors builds intuition.
Finally, the dedicated Kaggler has probably amassed some experience across a diversity of machine learning problems and data formats. They’ve probably played with structured and unstructured data, text, images, and so on. These are all useful experiences.
But before you tell your recruiters to descend upon the ranks of Kagglers like insects upon a bovine’s posterior…
The Data Scientist’s “Real” Job
Deciding whether to back up a dump truck filled with cash at the front door of Kaggle competition winners depends on what it is you believe a data scientist does to earn their comfortable living.
Equally importantly, the decision depends on what you believe Kaggle rewards when it announces its victors.
The tech blogosphere is littered with memes and posts poking fun at the disparity between what we imagine data scientists are doing all day and the actual content of their working hours. The perception is an office (or spare bedroom nowadays) filled with PhDs assembling the most complex neural networks modern technology has devised. The reality is a similar space filled with well-caffeinated ex-academics cleaning, wrangling, and otherwise navigating the general messiness of real-world databases and their contents.
Surely2, no one would begrudge an aspiring data scientist for choosing an avocational pursuit that avoids the hassle of cleaning, processing, and trying to determine what each of 1,000s of corporate data tables really contain.3
As one brilliant data scientist once described Kaggle, its competitions are like olympic races. Conditions are controlled, distances are well-defined, and the criteria for success are easily assessed. Industrial data science is more like an expedition wherein Lewis, Clark, and Sacajawea are traversing a path to the Pacific. Winning an olympic race requires becoming a remarkable specimen of human fitness. While that might be helpful for exploration of inhospitable terrain, it is only one of numerous relevant attributes.4
The Challenges of the Job
In addition to sparing competitors the need to find data, wrangle data, sit in meetings ascertaining the true nature of data, and so on, Kaggle also offers competitors one additional accelerant. This is perhaps the single most important distinction between the contrivance of an online competition and “the job.”
As countless platitudes note, it is often far easier to solve a problem than to scope and define a problem, much less scope and define the correct problem.5
AE is a development, data science, and design studio. We employ extraordinary data scientists who are well-versed in the finest algorithms known to man or machine. Even so, AE’s value to its clients lies in the ability to help those clients define a data science problem in a way that addresses the business’s needs. Writing Python code and deploying a model effectively is a necessary, but insufficient skill set.
And this is where Kaggle, for all of its value as a training ground and a fantastic forum for exploration of mathematical techniques, falls short. In any Kaggle competition, the problem is narrowly defined. Given these data, estimate, classify, or predict that value or attribute. In most real-world scenarios, choosing which data are relevant and which value ought to be classified are the true challenges.
The difference between a mediocre data scientist and a great one is the ability to make pragmatic, insightful decisions on exactly those questions.
The Competition
To win a Kaggle competition, one is not required to choose the correct problem or scope that problem insightfully. They are simply required to produce the highest possible value for an accuracy metric. Typically, this requires some combination of ensembling (combining multiple models), hyperparameter optimization (tuning the settings for those models), and adding additional training time with more GPUs (more computational power).
These are valid approaches and legitimately relevant skills to deploy in an industrial setting. However, they also skirt yet another important question in any real-world scenario.
When is the model “good enough?”
A great data scientist knows what the business truly needs, what level of accuracy supports that need, and when to deploy his or her most scarce resource (time) elsewhere. For instance, a 97% accuracy rate for a model used by air traffic controllers to prevent mid-air collisions would be a disaster. For a model classifying animal images, it might be incredibly useful.
In that case, should the data scientist invest another 200 hours in reaching 97.4% accuracy? Or perhaps, are those 200 hours (and the salary they imply) better spent building another MVP that can solve another high-impact problem? Kaggle makes that answer clear - keep improving a model until it can top a leaderboard!
Note: We’re plenty capable of winning a machine learning competition if that is the operational objective - as proven by our recent victory in the Neural Latents Benchmark challenge.
Moreover, a Kaggle competition is not always the clear assessment of skill it purports to be. The winning model is probably not the best model. Given the random variation of model performance out-of-sample and the narrow margins separating the top positions, a data scientist with the 100th-best model is likely comparable to the data scientist who finished first!
Hiring
To be clear, there is no reason to assume that someone who has performed well on Kaggle competitions is not an excellent data scientist. There are probably plenty of fantastic data scientists who, during their early career or even while still in school, experimented with techniques in Kaggle competitions.
However, we ought to view the prospective senior data scientist touting recent Kaggle wins with a warier eye. For instance, imagine someone is presented with the opportunity to get paid as a data scientist solving a business’s thorniest quantitative problems or invest the hours required to win a Kaggle competition. What priorities would an employer prefer?
Kaggle is a demonstration of technical competency and skill, just perhaps a far narrower set of skills than those demanded in the vast majority of data science roles. It is a demonstration of familiarity with the most recent libraries, algorithms, and techniques.
It is not, however, evidence of the ability to choose the proper problem, scope that problem intelligently, determine what data are relevant, wrangle and clean that data, develop the simplest tool capable of solving the problem quickly, and communicate those findings to a non-technical audience.
A data scientist capable of implementing the most complex architectures known to modern computer science, if unable to communicate the results and their meaning to a stakeholder, often generates no impact. The ability to convert an accuracy metric into salient, tangible next steps requires clear communication of what was done, the choices made en route, and what was learned from the results.
The ability to scope, choose data, and communicate results is what distinguishes truly extraordinary data scientists.
Those are the skills that justify the salaries.6
For all that glitters on Kaggle is not gold.
1
Often, data scientists compile their analyses in online “notebooks,” which allow scripts to be written and executed in real-time, results to be visualized, and output to be shared. A common notebook type, Jupyter is a tool-of-art for many practicing data scientists.
2
Yes, I called you “Shirley.”
3
For the non-technical folks, a commonly used term is “metadata.” This refers to, broadly, “data describing data.” In other words, just because a column in some corporate data lake is labeled “price” doesn’t necessarily mean that price is charged to some customer. There’s probably some discount applied from some other table, a potential rebate from another, some shipping rules applied from yet another table, and so on. Of course, the only way one might know how to address these nuances is either a lengthy term of employment (increasingly uncommon, circa 2022) or if, by some miracle, sufficient metadata are available to describe these nuances to data science passers by. For anyone who has been employed in corporate analytics, asking data engineering professionals to provide additional documentation is like requesting clemency from Torquemada.
4
I mean, I’d rather add Usain Bolt to my expedition than William Howard Taft, but does that mean Usain Bolt is the ideal explorer?
5
I particularly enjoy “It’s better to solve the right problem approximately than the wrong problem exactly.” (John Tukey). There’s another from organizational theorist/consultant Russell Ackoff, “Successful problem solving requires finding the right solution to the right problem. We fail more often because we solve the wrong problem than because we get the wrong solution to the right problem.” The more cynical, corporate take, from Peter Drucker, “The manager who comes up with the right solution to the wrong problem is more dangerous than the manager who comes up with the wrong solution to the right problem.
6
And the machine learning model that assesses those skills quickly and accurately is being researched with all the fervor and futility of ancient alchemists.