Data Science Teams Need Generalists, Not Specialists

(hbr.org)

176 points | by bryanrasmussen 1855 days ago

12 comments

wokwokwok 1854 days ago
I work in this field and I flatly reject that the purpose of a ‘business intelligence’ team is to “develop profound new business capabilities” or ”organise the data scientists such that they are optimized to learn”.
What planet are you on?
Such a team is absolutely not a pure research team, devoted to investigating the data and driving new capability and insights.
The purpose of such a team is to facilitate the business to have operational insight into the state of the business, and serve the business as the business requires to deliver business as value.
What the article proposes is like having a development team that has 100% time to pursue their interests and build whatever they like to show to the business to develop new business capability.
There is a place for such a team, and a data science team as described, but like google X, a very specialised thing for companies that can afford to dream.
...not for everyday engineering teams.
Generalists vs specialists is a straw man, obviously you have both.
[-]
- pea 1854 days ago
  I agree. Anecdotally, I've seen a few of companies hire large data science teams and operate them outside the context of any product management; i.e. as basically R&D arms. The problem is that most companies do not have R&D-level problems, and they do not have a good way to surface priorities to the data science team, manage delivery, etc., and the data science team ends up building a lot of stuff from scratch, or working on priorities that don't add a lot of value, or not having stakeholders really understand what they do. This isn't good for the data science team either, as it's hard for them to communicate value.
  I think the pin analogy in the article is a slight straw-man and a strange choice of comparison: sure, running a data science isn't like running a factory production line. It requires experimentation and isn't deterministic. But IMO that doesn't mean it should sit outside of product management, or is completely different from building or managing a regular software engineering team: in some instances, hiring full-stack developers absolutely makes sense. In some, you will need folks who are specialised to certain problem spaces; isn't it the same in data science?
  I also think that, if data science code is going from R&D into actual prod, it is helpful if it is reviewed and tested from a more software eng POV, and may end up in parts being rewritten by teams with more specialised skill sets around e.g. performance. It would put a huge onus to put on every data scientist if they had to learn this stuff (and I'm not talking like petabyte data engineering)
  I've been working on a way[1] to help data scientists more easily share and run their models/code without having to spend a lot of time on the engineering side, but a big group of our users are also folks in data teams who are not experts in lots of ML or engineering (are not "full stack" in this sense), but they work with data every day and need a way to use parts of the data science ecosystem more easily. IMO upskilling these folks may be more then norm going forward than being able to hire data scientists who are also experts on the entire stack.
  [1] https://nstack.com
- usgroup 1854 days ago
  I think this is accurate. Few businesses have the slack required to hire a speculative research division to delight and surprise the business with new capabilities :)
  More typically DS will optimise . I think engagement bandits, recommendation systems and such are all examples of optimising rather than driving new capability , and to agree with you, generalist vs specialist is a straw man in this instance.
  After all someone at some point is going to make a business case for a DS team member and the vast majority of businesses don’t have the luxury of doing anything other than focusing on growth.
- spongepoc 1854 days ago
  >I work in this field and I flatly reject that the purpose of a ‘business intelligence’ team
  Good job the article is about data science and not business intelligence then. Cool the feigned indignant outrage please.
  [-]
  - nerdponx 1854 days ago
    It doesn't say "business intelligence" anywhere in the article.
  - shadowmint 1854 days ago
    > “one person sources the data, another models it, a third implements it, a fourth measures it”
    Call it whatever you like, it is what it is.
  - stronglikedan 1854 days ago
    Is succinctly stating your educated opinion considered feigned indignant outrage now? If so, society is screwed.
- zeckalpha 1854 days ago
  That illustrates the difference between a company that sees data as a function vs as a core strategy. Stitch Fix is definitely the latter.
- licyeus 1853 days ago
  > What the article proposes is like having a development team that has 100% time to pursue their interests and build whatever they like to show to the business to develop new business capability.
  I didn't get this from the article. My interpretation was that rather than having a DS team with specialized roles sitting outside of product/engineering, companies should locate DS generalists within product teams. I.e., push data competency closest to the teams working with the problems + data, and as the DS learns where ML can be applied, the entire team can deliver. In a DS-as-specialist structure, you'd have to locate several roles on a product team to achieve results (or co-locate them and deal with overhead).
  The article confuses by claiming "the goal... is not to execute", but in context, it's clear that it's a criticism of process efficiency myopia, not of product delivery.
  And the word "research" only appears in the article in a negative connotation. “Develop profound new business capabilities” doesn't mean the research of new models, but instead the application of existing techniques to business processes (e.g. a model that can reasonably predict user churn based on sentiment analysis of messages sent on your platform would be profound).
  [-]
  - pmart123 1853 days ago
    The only issue with that is you sometimes develop duplication across siloes. Instead of one team building the framework/analytics engine, you have a bunch different experimental environments, datasets, etc.
- navigatesol 1854 days ago
  >What planet are you on?
  The planet that says the purpose of the business intelligence team can be whatever the company wants it to be.
kodz4 1854 days ago
Sounds like the author wants to work at a university.
And that is probably where the best data science results are going to come from. Where inter-disciplinary teams and cross talk are the norm.
Science takes time to make sense of data.
We have figured out how to produce gigantic quantities of data, but that doesn't mean science gets faster.
Whether it is CERN or Wall Street or the NSA or Facebook processing the data takes it own sweet time.
And when they don't find anything or use it in misguided ways it takes time to work that out too. Because everyone is conditioned to hide that.
It took 20-30 years before anyone seriously took the experiments (data) of a Micheal Faraday to get an accurate math model of electromagnetism. There were a whole lot of famous mathematicians around, and all of them had access to the data. So why did it take time?
Orgs with data don't have that kind of time. And the truth is these mythical generalists don't exist. They really can't be quickly mass produced like vegetables. And on top of it all orgs and execs are conditioned to not share their data.
This combo of factors is why we see so many bad consequences and erosion of trust in every single org dealing with big data.
We are all living under the delusion that Data Science is like working on crude oil at a refinery. It's more like working at a landfill with arbitrary deadlines to find diamonds skewing incentives for the data to be misused.
uptownfunk 1854 days ago
I work in this field.
I am expected to serve clients in a lot of different domains.
When we were doing data science work for an insurance client, I stayed up late for weeks reading about actuarial models and how things currently work as well as learning all the jargon there is in the industry. I could probably pass the highest level actuarial exams at this point (or at least not fail too embarrassingly) [edit: OK I probably couldn't do this, but I could probably pass any/all exams related to the quant side] and also innovate superior premium pricing models using "Machine Learning". Not because I want to become an actuary but that's what's required to do good work.
When I worked in pharma, I learnt from other PhD's/PostDoc's on my team about oncology, a very specific type of oncology in fact, and everything that goes into Pharma companies and their marketing efforts, how doctors operate and behave. Not only that, but I learnt from an industry expert on all the nuances and subtleties that involve analyzing various types of medical claims data. (Hint: It's a total CF)
I could go on and on about the different domains I've had to work in. But the whole point is, being a generalist, in the sense of a dabbler, is utter nonsense. If you want to be a data scientist, you have to be flexible enough to work in any domain, and also have the gumption to become a specialist in the field, do something new in that domain with your shiny "machine learning" knowledge, while making sure your models are not GIGO from spurious statistical assumptions, and making sure you know how to code decently enough that your algorithms/code doesn't shit itself. This probably aligns closer to what the article was actually talking about..
(Edit: Sorry for the confusion, in my world the word generalist has a totally different meaning..)
[-]
- tfehring 1854 days ago
  > [edit: OK I probably couldn't do this, but I could probably pass any/all exams related to the quant side]
  After "staying up late for weeks" at some indeterminate point in the past? I'd be amazed. The pass rate for the quantitative finance exams is ~35%, and that's among people who stay up late for months, not weeks, specifically to study for those exams, after having spent years of their careers working on quantitative financial models at insurance companies and pension consultancies.
  I'm calling this out not to nitpick about the difficulty of actuarial exams, but to make a broader point: You are a generalist. This is what being a good generalist looks like. Developing knowledge that's on parity with specialist practitioners in a field is an upper bound, and it's not really practical to attain for highly specialized fields. But the amount of domain-specific knowledge that's needed varies widely from project to project, and needing to read up on the subject after work for a few weeks is well within that range.
- amirathi 1854 days ago
  > this generalist thing is utter nonsense
  By your own description, you sound like a generalist with ability to dive deep into the areas needed. You seem to refute the core premise of the article but your description says otherwise.
  [-]
  - uptownfunk 1854 days ago
    Honestly, I'm just put off by the word generalist. It makes it sound like just because someone can call a few APIs and create a deep learning model, that, as long as they have a clean data set, all of a sudden the world now revolves around them. Frankly, a lot of it is because of the hype, all of a sudden data scientists are these magical leprechauns that by virtue of their fancy algorithms can make money appear out of thin air.
    It really isn't like that, you really have to be able to go deep, as you said. Yes, there is a part of it where you have to be flexible (I wouldn't call it general, because I think it sweeps too much under the rug), so as to go so deep into a topic, you pass this kind of "expert-level turing test" where, were you and a domain expert put in the same room, a reasonable person wouldn't be able to tell you apart, where the weaker version is "another expert wouldn't be able to tell you apart" or something like that..
    [-]
    - wefarrell 1854 days ago
      It sounds like you equate 'generalist' with BSer, but the article's definition of generalist matches what you're advocating.
      > Specialists’ work is coordinated by a product manager, with hand-offs between the functions in a manner resembling the pin factory: “one person sources the data, another models it, a third implements it, a fourth measures it” and on and on.
      If you're taking the time to learn the domain, source the data and clean it then you fit their definition of a generalist.
  - scottlocklin 1854 days ago
    The author of the HBR article works in a subject where literally nobody can be a specialist of the type he needs unless they work at his company. He's basically saying "this is the kind of data scientist I need; someone flexible enough to learn about my weird problem." He's also requesting a DS which can do everything from soup to nuts. That doesn't mean it's good for a DS to be a generalist.
    FWIIW I am a generalist DS type (arguably just generalist: database engines, blockchain, ML, DSLs, etc), but it's almost entirely because I'm easily bored. From an optimization of my paycheck and hours worked point of view, going deep in one area nets much larger paychecks and a nicer (aka fewer hours) lifestyle.
- mr_overalls 1854 days ago
  > I could probably pass the highest level actuarial exams at this point
  You must be an exceptional intellect, then. Actuaries famously devote thousands of hours and years of their lives to passing the exams.
  [-]
  - steev 1854 days ago
    It is complete nonsense and indicates to me the OP doesn't really understand the field as well as they think. As someone who has gone through all the exams, the upper-level exams test aspects of the industry in incredible depth and is more about regulatory requirements than any specific model. OP may have a decent understanding of reserve models, cat models, or financial forecasting, but those are not a core part of upper level exams (although you would certainly be expected to know it).
    I spent hundreds of hours memorizing flashcards on US healthcare regulations as well as some things about the Canadian healthcare system. By far the worst years of my life.
    [-]
    - uptownfunk 1854 days ago
      Sorry, I didn't mean to make it seem like that, you guys do go through a ton of hard work and effort to earn those credentials. It was perhaps too liberal a hyperbole..
- blunte 1854 days ago
  Perhaps it depends on one's definition of "generalist".
  Having the flexibility, willingness, and capacity to learn almost anything as needed (easily or with intense extra effort) is the hallmark of a generalist by my definition.
  [-]
  - mlthoughts2018 1854 days ago
    I had a different reaction. All the things mentioned in the parent comment are subfields of applied mathematics and scientific computing. Hopping between those different subfields quickly is actually more an indication of deep vertical specialization, and is nearly the opposite of being a generalist.
    In software, I’d consider a generalist to be someone whose career goals are only focused on the business problems, rather than career goals focused on developing a specific skill, mastering a specific technology, or becoming known as a famous expert within a specific field or job function.
    For example, I once worked with a colleague who was an amazing polyglot programmer. Our boss tasked him with fixing an asynchronous web service, including a ton of work in the frontend.
    It had high business value, but my teammate wanted to focus on machine learning in order to create job experience on his resume that would let him get machine learning jobs.
    While the bosses were busy fussing over essentially the desire for every engineer to be an arbitrarily fungible generalist, this guy quit and moved on to a place that would let him specialize in machine learning.
- AndrewKemendo 1854 days ago
  But the whole point is, this generalist thing is utter nonsense.
  You're entire write-up just described the role of a generalist as described in the article. I think you're assuming a generalist means something that it doesn't because it's pretty clear in the article that the author is describing what you describe.
  Why do this?
dagw 1855 days ago
They need both. Having a domain expert in the domain you are modelling will save you a lot time and lets you avoid a lot of dumb mistakes.
[-]
- 0815test 1854 days ago
  Also, you need domain expertise to do good feature engineering and to develop good tailored architectures. These are quite critical to obtaining SOTA results, so far as I can tell. People like to pretend that deep learning and "AI" have made feature/model engineering unnecessary, and it's just not true.
  [-]
  - hikkigaya 1854 days ago
    Not really, you don't need to be an expert in go to beat the best go player.
    [-]
    - bloomer 1854 days ago
      Except go (and all other well defined games) are the most trivial of domains. For real business problems, the difficulty is most often trying to determine what the problem is and how it relates to the rest of the world.
    - currymj 1854 days ago
      there was a strong model of the world in AlphaGo, which allowed it to accurately simulate games that followed the rules.
      that is quite a lot of domain expertise, much more than in most ML systems.
      [-]
      - natalyarostova 1854 days ago
        Another way of saying this, is you don't really need the same level of domain expertise if the problem space is entirely defined by a small set of clearly articulated rules.
    - nerdponx 1854 days ago
      I'd be very surprised if the AlphaGo team didn't at least consult with a few Go players...
- 4thaccount 1855 days ago
  Yea. In my industry you could hire a team of generalists and I wouldn't expect any of their conclusions to be valid. Now if a domain expert can get them a valid dataset...then they can do some statistics and analysis and maybe uncover something I wouldn't notice.
kyllo 1853 days ago
In my experience you basically have to give data scientists direct responsibility for a business process in order to ensure that their models are relevant to and actually get utilized in the process.
Forecasting is a great example of something that every retail company needs, and that data science is supposed to help with. But if you make planners responsible for planning and ordering (minimizing surpluses and stockouts), and make the data scientists responsible for developing forecasting models, the planners won't trust or use the forecasting models--they'll just continue making their own personal models in Excel.
If you want to actually solve a business problem with machine learning, then you have to actually give the data scientist decisionmaking authority in the business process and responsibility for the business result.
thanatropism 1853 days ago
In this thread: people who have never heard of industrial labs.
https://en.wikipedia.org/wiki/Bell_Labs
https://en.wikipedia.org/wiki/PARC_(company)
https://en.wikipedia.org/wiki/Research_and_development#Busin...
michaelcampbell 1855 days ago
Generalists are also often in danger of RIFs and layoffs, because for any task X a company values at time T, there's always an employee Q who can do X better than generalist G.
But on the flip side, they seem to be able to bounce back better than specialists.
[-]
- felixgallo 1854 days ago
  It's the other way around; if you have specialist X and Y, and generalist Z, companies tend to get rid of X and Y and consolidate functions with Z.
  [-]
  - michaelcampbell 1841 days ago
    Not in my experience; but it may depend on how "special" the specialist is, and what the specialization is. Sometimes Z just can't do it.
raverbashing 1854 days ago
I wonder how many "data scientists" around don't know what a normal distribution is or what's the chance of throwing 6 in a die right after throwing a 6 on the first time.
[-]
- isolli 1854 days ago
  I would always expect a higher chance of throwing a 6 the second time, because I updated my Bayesian prior (the die is fair) and I now believe there is a slight chance that the dice is biased in favor of landing 6s.
  [-]
  - Tenoke 1854 days ago
    Is your prior for having an unfair die high enough for your update after one roll to be bigger than basically a rounding error though?
  - Scarblac 1854 days ago
    That's amusing. Your prior is that the die is fair. And even if it is, regardless of what the result of the first throw is, the chance that it is fair is necessarily lowered after the first throw.
    That's an odd effect of Bayes' rule that I never considered.
    [-]
    - Tenoke 1854 days ago
      >regardless of what the result of the first throw is, the chance that it is fair is necessarily lowered after the first throw.
      This isn't true. You probably have a higher prior for it being unfair towards 6 or 1 (and how much depends on the scenario).
      However, if for example, you have priors for it being potentially unfair towards every number equally, 1 throw only changes the posterior probability of whether it is biased towards that number but not whether it is unfair. In case of unequal priors for which numbers it might be biased (e.g. 0.01 chance of it being biased towards 6/1 and 0.001 for 2, 3, 4, 5 and 0.976 for unbiased) getting a number changes all those probabilities yes but for example if you get a 3 the chance of bias there gets higher but mostly at the expense of the probabilities for bias for the other numbers - which get lower (I am too lazy to calculate the posteriors).
      All in all, you should know beforehand what your posteriors are going to be for any case based on any outcome and if you know that any one probability (like that of it being fair) is definitely going to lower no matter the outcome then that should be your current prior for that case.
- navigatesol 1854 days ago
  >I wonder how many "data scientists" around don't know what a normal distribution is
  What is with the poisonous gate-keeping in the data science field/community? Is it a bunch of PhDs stuck doing marketing, bitter that people in the business domain are competing with them?
  What do you care if a "data scientist" doesn't know what a normal distribution is? Are you worried about them as competition? If that person is getting hired in a role over you, maybe you should look in the mirror or be happy you dodged a bullet at said company. Easy.
  I realize you're being facetious, but I always think back to that post here where someone asked how to become a data scientist and one of the replies said you needed to be able to code all the algorithms from scratch in C, C++, CUDA etc etc. Hilarious.
  The purpose of data science is solving problems with the help of data. Everything else is semantics. In the real world most of the value comes from pretty simple solutions. Most people aren't working for Google X. Most experiments aren't critical. Uncertainty is acceptable. The inane trivia --which I liken to stupid whiteboard coding challenges-- becomes less important as the tooling improves. Better get used to it, or the ambitious analysts and BI people will eat your lunch.
  [-]
  - zimablue 1854 days ago
    I'm not sure the normal distribution or avoiding reversion to mean fallacy are trivia though.
  - mumpsman 1854 days ago
    Know how I know you've never worked on a data science project with someone who doesn't know what they are doing in the slightest?
    Reword your entire sentiment towards programming and you get something like this:
    "What do you care if a 'programmer' doesn't know what a variable is?"
    The reality is that, like programming, there are some simple things (throw features into a random forest and get an adequate model) but if you actually want to drive value in any data science role you must be knowledgeable about the details of what you are doing and you must know exactly how these algorithms work to avoid ruinous pitfalls.
- selimthegrim 1854 days ago
  Here's a fun twist on that: https://gilkalai.wordpress.com/2017/09/07/tyi-30-expected-nu...
rq1 1854 days ago
Isn't it a generalist' ability to specialize Just In Time?
thekhatribharat 1854 days ago
Earlier HN discussion: https://news.ycombinator.com/item?id=19361208
dalbasal 1854 days ago
I think smith's early take on "division of labour" unfortunately conflated two different things, either of which could be called division of labour.
One is what he described in the pin factory, some basic precurser to a factory/assembly line. One person draws wire, another cuts it, another sharpens..
This is not about specialized skills, or even labour. It's industrial engineering. Break a process down to components and optimize individually. In simple cases like this, just breaking down the proccess is 90% of the way. Give 3 people the job of sharpenning an endless pile of pins and the improved tools/methods will follow.
This bleeds into industrial labour (Smith's example for this is pin packaging) in a few ways. Once you have small, efficient, tooled compnent processes... you don't need skilled labour. Pinmaking might be a specialized craft, but anyone can be taught to sharpen.
This is the opposite of the other kind of division of labour.
When historians (especially british ones from the same period) theorized about early civilisation, specialized division of labour featured often. Early cities had enough people that not everyone had to farm. You could have specialist priests, soldiers, artisans, stonworkers, smiths, boatbuilders..
This is where specialized skills and depth of knowledge comes in.
So... data science & pins... The first kind of division of labour is the "organization as a machine" kind. People do a consistent reptitive process. This works very well, but only if you need a cvonsisten, repetitive result.
I think what this article is mostly argueing is that data science (like programming, engineering and a lot of "information economy" jobs) is but shouldn't be organized this way. You don't need a consistent, repetitive result. If you do, that's what computer programs are for.
I agree, I think. Like with a lot of software domains, there's a stark differenc ebetween small projects where requirements, design, architecture & implementation can be done in one head and big projects that can get bogged down in bureucracy, misunderstandings and the inability to move back-and-forward between elements. I think data science has the extra problem of datasecurity and other things that require controls and rigidity.
I expect a lot of these problems will lessen with time. The field is still in growth phase, and both tools and skill levels will improve.
Circa 2005, a problem I cam across all the time was impractical designs for web apps. You'd have a designer who had been designing posters and liveries. They'd make a picture. Then you had a html guy, who would try very hard to make it happen in html. Then it'd go to a JS or server-side specialist, who discovers that and arbitrary amount of text needs to fit neatly into a box that fits exactly 416 charachters of lorem ipsum.
[-]
- throw22032019 1854 days ago
  To "break a process down to components" is what the division of labour is, and the division of labour is nothing else but that. Smith didn't conflate shit. Specialization is a posterior an effect of the division of labour. There's no opposite division of labour, labour is divided in the same sense in the two cases: one worker straightens the pin, another sharpens it; a machine straightens it, another sharpens it.
  "Labour" does not just mean what a labourer does, it can mean what is to be done, "a labour", such as the labour of making a pin. If you had attempted instead to infer what Smith meant through the example, you'd not have written any of that.
- DanielBMarkham 1854 days ago
  Having wrote a book on this, I kinda agree. But it's not the people, and it's the job. It's whatever you're currently doing. My belief is that a key skill in the upcoming century is to be able to sort whatever you're doing into those two categories and automate those things that fall into your first category, organization as machine.
  In fact, if you look at most of the waste in tech development, it's people continuing to do this first category of work when they should be working in the second category. (Insert long rant here about the importance of everybody being able to automate things in order for this to happen.)
potatofarmer45 1854 days ago
"Mastery in that they know the business capability from end-to-end". That's the point. These generalists are not really data scientists. Often it's some account manager/salesperson/ops worker rebranding themselves to get ahead. A real data scientist who understands the business side is rare and really good, but most "generalists" are just bad. So bad in fact they exacerbate every single problem the author thinks they solve: Having bs people called data scientists devalues and annoys the real data scientists; and they often create confusing buzzword driving plans and white elephants for show rather than insight that will have a real impact.
I worked in a media agency once where an account manager who could barely do a simple calculation in Excel rebranded himself as a "lover of data". All he did was play office politics. We nicknamed him unicorn. It was a nice way to describe a creature with a thin rod sticking out of its head.
He was such a leech in that he was full of incorrectly used buzzwords, overpromising, and overselling (he had a habit of present non-significant analysis as facts). The very idea of statistical confidence was a mystery to him.
Because he played the politics game well, he did reasonably well and ran a few projects. Every single one of those turned out to be a giant white elephant to the clients.
[-]
- itronitron 1854 days ago
  What you're describing is not a generalist. Generalists can actually do things and it sounds like your colleague was just pretending. I have worked with one or two people similar to what you describe, they can pose a real risk to their employer if they are not managed appropriately.
- yannis7 1854 days ago
  we've all dealt with such people and I hear you on all the concerns raised.
  But as pointed out, what you describe has nothing to do with the "generalist vs. specialist" debate