Big Data Trends
Data scientists spend 80% of their time with data preparation. But that lengthy process doesn’t reflect on their inefficiency. That is their JOB!
If you are not good at data preparation, you are NOT a good data scientist.Period.
The validity of any analysis is resting almost completely on the preparation. The algorithm you end up using is close to irrelevant.Complaining about data preparation is the same as being a farmer and complaining about having to do anything but harvesting and please have somebody else deal with the pesky watering, fertilizing, weeding, etc.
Raw Data Collection
This being said – data preparation can be made difficult by the process of raw data collection. Designing a system that collects data in a form that is useful and easily digestible by data science is a high art. Providing full transparency to DS how exactly the data flows to the system is another. It involves processes that consider sampling, data annotation, matching, etc. It does not include things like replacing missing value and excessive normalization. Creating an effective data environment for DS needs to involve DS and cannot be entirely owned by engineering. DS is often NOT able to spec such system requirements in sufficient detail to allow for a clean handover.
But in the bigger picture, there are more important things to consider. The by far biggest issue I see is data science solving irrelevant problems. This is a huge waste of time and energy. The reason is typically that whoever has the problem is lacking data science understanding to even express the issue and data scientists end up solving whatever they understood might be be the problem, ultimately creating a solution that is not really helpful (and often far too complicated). A typical category are ‘underdefined’ tasks: “Find actionable insights in this dataset!”. Well – most data scientists do not know which actions can be taken. They also do not know what insights are trivial vs. interesting. So there is really no point sending them on a wild goose chase.
Wild Goose Chase
The “solving the wrong problem” is pervasive in part because the data science is not sufficiently involved in the decision process (thanks to Meta for asking me to clarify). Now – not EVERY data scientist can and should be expected to be able to shape the problem as well as the solution (back to the unicorn problem), but at least one data scientist on the team should. The bigger issue is however not the lack of ability/willingness from the data science side (although indeed there are plenty who just like to solve a cute problem, not matter how relevant) – but often a corporate culture where analytics, IT, etc is considered an ‘execution’ function. Management decides what is needed and everybody else goes and does it.
Lack of Skepticism
On an individual level and a given (worthwhile) problem I would blame lack of data understanding, data intuition, and finally skepticism as most limiting factors to efficiency. What makes these factors contribute to inefficiency is NOT that it takes longer get to an answer (in fact lack of the three typically leads to results much more quickly) but rather how long it takes to a (almost) right answer.
Inspired from Quora post of Claudia Perlich, Chief Scientist Dstillery, Adjunct Professor at NYU