What’s the Difference Between a Data Scientist and a Data Engineer?
By Kat Campise, Data Scientist, Ph.D.
The term Big Data has been floating through various writings since at least the 1990’s but did not fully enter the spotlight until roughly 2005. As social media gained in popularity, and businesses began to understand the potential for leveraging an array of different data aggregation channels, marketing departments grabbed hold of “Big Data” as a buzzword. The same trajectory has occurred with data science; only now human resource departments and recruiters are involved in a confusing matrix of trying to discern the separate job functions of data scientists, data engineers, and data analysts. We have already discussed the dissimilarities between data scientists and data analysts. In this article, we’ll address the distinctions between data scientists and data engineers.
Included in This Article:
- Data Engineer vs. Data Scientist
- Software Engineering vs. Statistical Software
- Data Engineer’s Toolkit
- Data Engineer vs Data Scientist Salary and Job Outlook
- Which is Better?
Data Engineer vs. Data Scientist
Data engineers have the essential responsibility for building data pipelines so that the incoming data is readily available for use by data scientists and other internal data users. Since data pipelines are an extremely critical aspect of data ingestion from divergent data sources, and the raw data that is collected arrives in different structured, unstructured, and semi-structured formats, data engineers are also responsible for cleaning the data; this is not the same type of cleaning that data scientists perform.
Notably, the goal of a data engineer when “cleaning” the data is to transform it into a usable format. Additionally, data engineers are responsible for architectural maintenance of the databases as well as building software solutions that help to better extract, transform, and load the data into either cloud-based or local database systems. These tasks are commonly referred to as extraction, transformation, and loading (ETL).
A data scientist’s job is to move the data into the next phase: determining if there are actionable patterns as based on the business problem or question for which they are seeking a solution or an answer. A data scientist cleans a dataset with the intent of feeding it into a statistical model for predictive and inferential purposes.
Software Engineering vs. Using Statistical Software
Data engineers often have a software engineering background as they are tasked with building software solutions specifically for all things data related. Depending on how an enterprise approaches their job functions, data engineers can also assume the role of a database administrator, which isn’t all too surprising since data warehousing is a fundamental component of data engineering. Indeed, there is a great deal of crossover between the two job functions such as maintaining the database system, ensuring that data is stored correctly and funneled to the appropriate data user, scripting complex queries, and implementing a robust data recovery plan.
While it may be beneficial for a data scientist to have a computer science degree or experience as a software engineer, the primary knowledge they should have is in-depth expertise in statistics and statistical software. Certainly, data scientists do need to know how to query and retrieve data via the data engineer’s pipeline. However, they are not constructing nor are they maintaining those pipelines.
In short, data scientists are responsible for using software/programming languages to help them extract a specific dataset, which they transform into a clean dataset for loading into a statistical model. Generally, they are not engineering comprehensive software programs or deploying extensive programming techniques for all of the data flowing into the enterprise.
The Data Engineer’s Toolkit
In terms of data toolkits, this is where there is less of a deterministic separation between data engineers and data scientists. Both will likely use programming languages such as Python, Java, C++ or a query language, e.g., SQL. Furthermore, data scientists and data engineers must know how to utilize distributed storage and computation software including Hadoop along with any additional software packages such as Spark, Hive, Pig or NoSQL systems such as MongoDB. For cloud-based storage and computation, many enterprises use Amazon Web Services or Google Cloud Computing, and data engineers need to understand how each architecture functions, i.e., how the data is ingested, stored, retrieved, and computed.
The specifics depend on what the enterprise chooses to use as its database management system and related software packages; thus this is not an exhaustive list. The main point of departure is the level of knowledge and the primary purpose of a data scientist vs. a data engineer using each of the aforementioned tools. Data scientists are pulling data whereas data engineers are building, preserving, and improving upon the entire data architecture and flow.
Comparatively, data scientists must also know how to develop and deploy statistical models using R or Python. Some enterprises prefer to use SAS, SPSS, MatLab, Tensorflow or KNIME as their analytics or machine learning platforms. Moreover, it would be remiss not to mention that Excel is still used, to a certain extent, as an analytics tool for datasets. As such, data scientists will spend most of their time using one or more of these software systems to iterate through the data science cycle.
Data scientists must also know how to create data visualizations and effectively communicate their findings to all of the enterprise stakeholders. Pitch decks, PowerPoints, ggplot, Tableau, and constructing well-written reports are just a few examples of additional tools within a data scientists arsenal.
Data Engineer vs Data Scientist Salary and Job Outlook
Both data scientists and data engineers play an essential role within any enterprise.
Data engineering does not garner the same amount of media attention when compared to data scientists, yet their average salary tends to be higher than the data scientist average:
- Data Engineer: $137,000
- Data Scientist: $121,000
It is important to keep in mind that the job descriptions for data engineers frequently state that there may be times when they will need to be on call. Such is not the case with data science positions — at least, it is not advertised or explicitly posted as a possible requirement.
However, the average salary reports tend to vary. For example, the above figures were Glassdoor’s average salary computation; but, some reports use the median base salary which knocks both of those valuations to $100,000 (data engineers) and $110,000 (data scientists).
With regard to job outlook, Glassdoor released their 2018 “50 Best Jobs in America” report and, based on the number of advertised job openings, data science positions ranked number one and totaled approximately 4,500 data science job advertisements whereas data engineer jobs were ranked 33rd with roughly 2,800 job openings. Suffice to say that the demand for both roles is expected to continue through 2021, with IBM and several other enterprises reporting a 28% increase in demand for both job functions.
Which is Better – Data Engineer or Data Scientist?
When trying to decide between becoming a data scientist vs. a data engineer, the main question to ask is, “Which set of skills aligns with what I would enjoy doing on a daily basis?” There is a caveat: both require a substantial amount of knowledge in different yet interconnected areas.
Experienced software engineers are likely to have an easier transition into the data engineer position — but, this does not preclude them from also considering a data science role. That being stated, if the data science candidate does not have advanced knowledge in statistical modeling, predictive analytics, and how to conduct a thorough research and reporting cycle, then this gap needs to be closed through additional education and/or hands-on experience. Whichever path one chooses, both jobs will continue to be in demand through the foreseeable future.