Data Scientist vs. Software Engineer: How Do They Differ?
By Kat Campise, Data Scientist, Ph.D.
In the current world of tech staffing and recruitment, there is a noticeable misunderstanding as to the concrete separation between a data scientist and a software engineer. From a data scientist’s perspective, this can be mystifying as we are not, in general, software engineers: we utilize whatever current level of programming knowledge that we possess to strictly deal with data extraction, cleaning, statistical analysis, and building statistical models.
Software engineers, arguably, have a broader scope along with a honed expertise in creating functional and scalable (hopefully) software systems for use by both internal and external users. While data scientists have a certain level of skill with Python, R, and perhaps other programming languages, we’re not spending our time developing software. This is not to say that the more computer science oriented among us won’t or can’t merge software engineering with our data science skills, it’s simply not a part of our daily job function (as a general rule).
Software as a Product vs. Data Products
Every time we use our smartphones or interact with one another via a digital platform, we are using a software product. Even the Software as a Service (SaaS) model is, ultimately, selling a product: licensing the use of software created by a software engineer or team of engineers. The software is the end product. Software engineers are responsible for planning, building, testing, deploying, and maintaining the software system.
Data can be a product as well; it all depends on what value can be gleaned from the scientific analysis via the precise use of statistical models. As such, data scientists utilize already existing software to extract value from the data flow. We’re neither designing the data architecture for storage (data engineers are the Big Data equivalent of software engineers) nor are we constructing new-fangled data science software – unless we’re doing it on the side as a hobby or personal passion.
Both Use Scientific Principles, But for a Different Purpose
Engineering is a scientific discipline that has a specific iterative cycle and a set of measurement methods to ensure a robust system that meets the needs of the end user. In a sense, software engineers are the human-to-machine and machine-to-human interpreters who navigate the two worlds and generate a product which can be easily used by just about any human being on the planet. Google, Amazon, Microsoft, and Apple are examples of tech companies that create software which is not just for a specific target demographic – as a side note, Salesforce, CRM software, and most enterprise software systems are specific use products and do not encompass as many users as, say, searching for something via Google. However, the objective is still the same: humans require a level of software accessibility with as little cognitive demand as possible.
For example, anyone who buys an iPhone or other Apple product needs to be able to interact with the device and its firmware/software in a streamlined and intuitive fashion. Therefore, software engineers utilize engineering best practices to ensure that the software has continuity of use (it’s likely failure rate is below a certain threshold) and users aren’t utterly confused when they try to use the software program.
Data scientists are, by definition, scientists. But, it’s not because it’s in the job title. We directly engage in the scientific method through the data science life cycle:
- Identify the business problem or the question to be answered (hypothesis generation).
- Maneuver through exploratory data analysis (EDA) which includes extracting a target dataset, clean/process it, and running an initial analysis to determine if the data and problem/question are aligned – if not, then we may reframe the question or problem and repeat the EDA (initial hypothesis testing).
- Perform a more profound analysis by expanding beyond descriptive statistics: linear or logistic regression, clustering, decision trees, Principal Component Analysis (PCA), etc. This step may also incorporate building machine learning models as the statistical algorithms overlap for both functions (further hypothesis testing and analysis).
- Draw one or more conclusions and present the results to the stakeholders.
So, we have an engineering component when we venture into machine learning, deep learning, and artificial intelligence. But, data scientists are communicating conclusions that may or may not be useful to a highly specific group of stakeholders and/or decision makers. The general public isn’t directly interacting with the data science process like they would when they use Google docs or Keynote. However, the level of analysis conducted by a data scientist can enact a shift in software design. Conversely, we can engineer a machine learning algorithm for use by consumers, but the software engineers are devising the machine-to-human system that bridges the gap between algorithms and the everyday person who only wants to click a button.
Educational Differences
Software engineers can certainly be self-taught. However, many job descriptions require at least a Bachelor’s degree in computer science or computer engineering along with professional experience in a particular set of programming languages: C++, C#, Python, Java, JavaScript, etc. Software engineers may also be responsible for creating technical documentation (even if there is a technical writer on staff), and will also need to have familiarity with whatever software development methodology is being used by the employer (Agile is presently the most popular method).
Meanwhile, data scientists tend to have a higher educational hurdle to clear. Master’s degrees and Ph.D.’s in statistics, computer science, or another quantitatively intense discipline (there are many Ph.D.’s in physics and other sciences who moved out of academia and into data science) are frequently the educational requirements for job entry. Reddit and Quora threads abound with lamentations about the educational barrier to entry into data science. But, considering that in-depth knowledge of higher math, e.g., calculus, linear algebra, and advanced statistics (graduate level coursework) is crucial to understanding the what, where, when, why, and how of the statistical algorithms, it’s a better use of energy to redirect the math angst towards conquering those courses. For the time being, the hardcore math requirement isn’t going to dissipate.
Data scientists also need to know how to do their work using Python, R, and SQL (yes, SQL continues to pop up in data science job postings). Plus, if a company prefers to use analytical software such as SAS, SPSS, or any number of other software products, then the data scientist must either already know how to use it or be a fast learner.
Finally, data scientists must be excellent communicators as they are the gateway between complex algorithms, gnarly datasets, and an audience who is asking them, “What does this mean? What do I do with this information?” Therefore, they need to be able to adjust their verbiage to the level of understanding present in the stakeholders and/or decision makers. This isn’t solely verbal communication: human beings love visuals – we are primarily visual creatures. So, a data scientist must also create accurate and, to some extent, visually pleasing graphical results of their work.
Both data scientists and software engineers have analytical components within their work responsibilities. Both use scientific processes to achieve a particular result. But, they have very different roles which produce divergent outputs. Summarily, if someone is interested in creating software for widespread use, then software engineering would be a good choice. Alternatively, if they’re interested in parsing through data to determine if there are compelling patterns, then testing the possibility of relationships between the input values, and building predictive models, then the world of data science may be more appealing.