December 07, 2020

Data Science, Terms of Service

Misunderstandings often come from presumption (a person assumes a word or concept means the same thing to another person), and its predecessor, fear (because asking for clarity can imply you don’t know or don’t understand, which is usually untrue). Definitions are about finding and formalizing truth by pairing words you and others trust, to describe words you don’t yet have a shared understanding of. To that end, I’ve put together definitions to help us have clearer conversations about Data Science.



In early-2016 I started receiving an increasing number of inquiries about Data Science–what to study, who to hire, how to define problems. Talking to hiring managers, recruiters, aspiring Data Scientists, and even practicing Data Scientists, it became clear everyone had different Data Sciences in mind. As a result, hiring managers weren’t clear on what they were looking for, job seekers weren’t clear on what to learn and recruit for, and worst of all companies were preparing to throw headcount at problems that weren’t ready for it. The aim of this post is to provide definitions to the most common interpretations of Data Science I’ve come across, starting with “Data Science.”



1. Data Science – combining computer science and applied math to ingest and analyze data in more than 6 dimensions. I use 6 dimensions as a cutoff because when people say “Data Science” they usually mean “techniques drawn from Differential and Integral Calculus, and Linear Algebra that can capture relationships beyond what can be visualized.” Primary tools: Statistical and Machine Learning models, Ad-Hoc Visualization Tools, Data in a variety of formats (structured, semi-structured, unstructured). Output: Models for Decision Science OR Data Products.

1.a. Machine Learning Engineer – an engineer who aims to create tight systems of data flows and mathematical models that create highly accurate results. Sometimes called a Data Scientist. Usually comes from a Computer Science background. Aim is to create scalable production machine learning systems that require relatively little human hand-holding. Put another way, these people help machines make decisions. Examples include Facebook’s Newsfeed Algorithm, Amazon’s “People Also Looked At,” and LinkedIn’s “People You May Know.”

1.b. Data Scientist – a non-computer science applied mathematician who aims to understand the relationships within data. The focus *can be* on creating production machine learning systems, but these people help humans make decisions. Aim is to understand the factors leading to outcomes (inference) as well as predicting them. Examples of work product include Churn Scoring for Customer Success, Lead Scoring for Sales and Marketing, and Feature Development Recommendations for Product teams.

2. Analytics – representing data, visually or otherwise, to make it human-interpretable. 6 dimensions or less–x, y, z, color (points of different colors), size (points of different sizes), time (moving plot). Primary tools: Dashboarding and BI Tools for rapidly prototyping solutions to problems, Structured Data. Work products include Ad-hoc Analyses for fast decision making, and Dashboards for regular decision making and company/division/product health monitoring.

3. Data Engineering – moving and summarizing data between and within data systems to make it usable in Analytics, Data Science, and Machine Learning applications; and more. Work products include Data Lakes, Data Warehouses, and Data Monitoring and Logging Systems.

4. Full Stack _____ – a majority 1 or 2 but also a sizable portion of 3, and vice-versa.



In general**, data moves through the Data Stack in the following manner:

In this continuum, the aim of each is to:

The smaller the company and newer the team, the more of the stack one can expect to occupy. And whichever part of the stack one occupies, jumping into the others isn’t just accretive, it’s inevitable.

To job seekers interested in data I recommend using these definitions to 1) identify the layer of the stack you’re best positioned to take on now and 2) build towards the layer you most want to be in next.



* – Physics, Econometrics, Operations Research, Biostatistics, etc.
** – Excuse my broad strokes.

Charles Pensig

Charles Pensig