Countless of times a team of data scientist and I have run into a technical discussion, and one frequently asked a question, perhaps only relevant one to this Medium article, is a size of dataset relatively to question/problem the team aim to solve. It’s quite disturbing to see most of aspiring data scientists today never understand when and why to ask the question. A course in massive online learning might help them learning a mathematical underlying process behinds support vector machine (SVM), vanilla neural network, etc. However, how many of them teach an introductory course in Vapnik-Chervonenkis theory (the VC dimension), a fundamental concept of statistical learning, which leaves many data scientists genuinely misunderstand a limitation of the models: they think an ML model can offer you’re a kidney. To avoid messing with learning how to do what, in this blog, I would not write about details, history, and explanation of the VC dimension or prove the VC inequality. Personally, Google is knowledge. Today I would like to focus on an unforeseen story of implementing a machine learning (ML) model with zero ideas of VC dimension, and subsequently, which can create too many ramifications understandable by a few who understand the VC dimension.

First ingenious misunderstanding…

Please check kernels on any problem sets in Kaggle competition. One of the examples, teams participated in Home Credit Default Risk suffer two significant problems, which source an advanced technique in feature engineering, either size of data is too small, or it’s too noisy. Interestingly, nor any team has ever complained that hardware is not capable of processing large size of the dataset. Another hidden problem is financially weak, and many teams could not afford to rent many GPUs computing power. Value of the VC dimension implies that input distribution (datapoint) is a faith of the model’s performance in out of sample error. Thinking of it as how well your model performs once it’s deployed on server production. The VC dimension helps ones to stimulate risk of failing prediction by foreseen probability bound of out of sample error, the highest error that one model can make the wrong move on a real event. Other concern of the VC dimension is to help data scientist in the understanding importance of training dataset. Training dataset is a mother of your ML model. Even your ML model is loosely complex. It should not create a big difference in model performance. For those who have little understanding here, I posted more information in this link. Many times data scientists can run into an unsatisfactory conversation with the team, and it’s great to know that we know we have a dataset to play around. Keep the essential concept of the VC dimension with you all the time. It will help once you fall into a tap of hating a logistic regression line because it’s not as fancy as stacks of it.

Common scenario among aspiring data scientists…

Have you ever run into a math problem that has only one example on homework exercise, but it has been asked too many times on a midterm exam… that’s what the VC dimension tries to tell you. It says that believing in just one example will sabotage the model performance. No data, no generalization. If one plans to enhance a model’s overall F1 score, then rethinking strategically in collecting new data with a relatively better plan and reframing a problem to a machine learning boundary that’s knowing the limitations by mathematical/algorithms understanding. However, ask whether a model will always work under a hood of huge data is equivalent to asking a first-grade student about pitfalls of bitcoin price. It’s irrelevant, because not every question is answerable! Merely collecting more data is a great strategy, but to achieve the remarkable target, it requires a further step (some fancy/hard math equations plus scientific assumptions). For example, an impossible quest of scientists in creating a statistical model to predict next earthquake events. Over decades of hard work, research teams of scientists around the world (including MIT) learned that somehow it is impossible to predict an earthquake event.

Splitting data into training and test dataset is a terrific idea. However, make sure that it won’t fool ones about the model performance. Think of a teacher preparing students for the final exam by given a hundred practice exams, but later that handing out exams are a real problem. A student is stunned in an exam room, recalling practice questions, and starts to ask,”Are these questions are just the same ones?” Creation of an imperfect test set is easily made. Just choosing a training set which has questions to test dataset. One might argue that a high F1 score is a concrete evidence in defining a successful ML model. It’s similar to saying that everyone who went to Harvard Business School will at least earn an annual income of over $300,000. Well, how’re about those who work in public sector, where senior high officers earning over $200,000 is very unlikely. When it comes to testing ML models, Professor Yaser Abu-Mostafa who has taught CS 156 at Caltech makes a good point. Just request for the model weights and test them with actual data on production environment that’s preferable than lab’s results.

In conclusion, the VC has been proved useful in generalizing an unknown hypothesis. Since every problem needs a unique set of algorithms, I have not said that stop yourself from learning deep learning algorithms, but one should care about nothing when you have no data and clean test set to play around. The world of data science has not yet come to any conclusion, but it’s better to quickly realize the importance of finding right amount and quality data to solve the world problem. Stay tuned with the VC dimension.

Drop us a line and we will get back to you