Data Science Training – Learn the Essential Skills
By Kat Campise, Data Scientist, Ph.D.
The most common question posed by newcomers who wish to learn about data science training is, “What are the skills required to become a Data Scientist?” All too often data science is conflated with “Data Analyst”. Indeed, both jobs require an analytical component. However, data scientists go above and beyond merely producing descriptive statistics for a clean dataset that fits neatly into an Excel spreadsheet. There is grunge work which constitutes roughly 80% of the data science process: data wrangling/munging and cleaning the data. But, that is only the beginning of an iterative cycle which demands a higher order combination of knowledge and applied skill.
Scientific Problem Solving
Summarily, data scientists, just like their scientist counterparts in other disciplines, are problem solvers. While creative solutions are welcomed, there is a specific scientific process to framing a question, producing a hypothesis, and then being able to understand when the question either needs to be refined or wholly revised based on exploratory data analysis (EDA).
Although there is a general step-by-step data science cycle, it is not entirely algorithmic, and there are mini-cycles, or iterations, within the broader cycle. Such is the reason that many data science job descriptions prefer candidates to have at least a Master’s degree in a STEM field: the job applicant will have had minimum exposure to experimental design, implementing quantitative research methods, and communicating the results to others.
Quantitative Methods
Between the programming languages, machine learning/deep learning algorithms, and inferential or predictive statistics, data scientists need to have a solid math and stats foundation. Ideally, a data scientist should have at least a basic knowledge of multivariate calculus, linear algebra, discernment among the various statistical models, numerical analysis, and probability theory. Thus, a combination of abstract and applied mathematical knowledge provides the data scientist with a greater meta-awareness as to what is going on under the algorithmic “hood” and adjust the various numerical parameters accordingly.
Programming
While data scientists are not software engineers, nor are they any other type of hardcore programmer, they must have a working knowledge of R, Python, and/or SQL. Most data science job descriptions will list either R or Python as a required qualification, and a majority of data science training programs will offer either programming language as part of the curriculum.
But, depending on the employer, C++ and Java may also be a prerequisite for the job. Because data scientists work with different types of data, e.g., structured, semi-structured, and unstructured, the aforementioned programming languages are used to extract, transform, and load the targeted dataset prior to analysis. Employers may demand a familiarity with Hadoop, Apache, Hive, Spark or other data storage and processing systems. Additional analytical software knowledge may include SAS, MATLAB or SPSS. Learn more about these key programming languages here:
Hadoop for Data Science
Hive for Data Science
Java for Data Science
Python for Data Science
R for Data Science
SAS for Data Science
SQL for Data Science
Tableau for Data Science
Machine Learning
It’s true that machine learning is an expertise all to itself as there are specific jobs for machine learning engineers. However, machine learning (ML) and deep learning (DL) are the foundations for artificial intelligence — some experts will say that they are subsets of AI. So, without first establishing supervised and unsupervised learning protocols, the higher level AI functions aren’t yet able to establish learning parameters on their own — at least not entirely. As more enterprises seek to leverage ML, DL, and AI capabilities, they either merge this skillset with their data science qualifications or hire a machine learning engineer, specifically. Either way, there is much cross-correlation between the two job functions and having at least a modicum of training in machine learning is recommended.
Fortunately, machine learning operates on a cycle that is similar to data science where 80% of the work comprises data extraction, cleaning, transformation, and normalizing. Data scientists will need to understand the difference between feature selection and feature extraction, how to determine the best-fit model for the data (there are also algorithms that can assist in model selection), parameter tuning, and assessing the model’s precision. The primary distinction between data science and machine learning is the expected outcome of the process:
● Data scientists are tasked with providing knowledge and actionable insight, meaning a decision that can be made or an action to be taken, which is based on the alignment between the business objectives, the problems or questions posed, and the results of advanced statistical methods.
● Machine learning carries out various levels of automated analytical tasks and can be programmed for further actions such as image recognition and natural language processing.
Data Visualization
Data visualization isn’t solely relegated to the glitzy graphics presented once a data scientist arrives at a conclusion. Whether they are exploring the data during the initial research phase or assessing the chosen statistical method, understanding the different types of charts, graphs, diagrams, and plots along with when and how they are used is an essential skill.
There is an added layer of complexity if R or Python is being used to produce the graphical displays as each detail of the graph is clarified through precise programmatic specifications. Generating data visualization via Python or R is quite different from merely inserting a pie chart into an Excel spreadsheet. Many employers have a strong preference for Tableau and knowing how to develop Tableau dashboards would be a plus (though not consistently listed under required qualifications). As such, data visualization is an important component of data science training.
Domain Knowledge
Aside from the scientific, quantitative, and programming knowledge that a data scientist should possess, they also must have a certain amount of domain knowledge. This knowledge translates into the data scientist understanding the ins and outs of a specific industry, e.g., banking, finance, real estate, pharmaceuticals, etc. Domain knowledge includes familiarity with successful business models within the particular industry. Business-oriented domain knowledge stands in contrast to data scientists who have academic experience as there are distinct differences in valuation strategy and stakeholder objectives between academia and the business realm.
Each industry has its own operating procedures, rules and regulations, and reporting requirements which often dictate how data is handled; this is particularly true for industries that collect and store personal data such as credit card companies and any enterprise dealing with patient medical data. For example, a data scientist working for a logistics company is likely to work with real-time sensor data for cold storage cargo. Meanwhile, a data scientist working within a marketing department should understand the various points of sale and advertising networks utilized by the company. Data scientists in the U.S. who work in healthcare must adhere to HIPPA regulations.
Additionally, every industry utilizes software that is specifically designed for its business processes. A data scientist may build predictive models for customer service interactions with consumers or gather and analyze data that is shared between the sales team and customer service. This data is generated from the software used internally. Certainly, internal systems can be learned while on the job. However, in general, familiarity or expertise with the in-house software and other applications is necessary.
Communication Acumen
There are several data stakeholders within any enterprise: C-level executives, departmental managers, customers, vendors, and so forth. Data scientists will need to explain their findings to an array of individuals who may not have the same level of technical or mathematical training. Within all disciplines there are specific terms, usually termed as “jargon,” that experts use to communicate with one another; this is equally true of data science.
As such, data scientists must possess excellent communication skills that accurately convey the who, what, when, how, where, and why of what they’re doing as it applies to different departments or stakeholders. But, they are tasked with explaining the concepts in such a way that either avoids the data science jargon or simplifies it to comprehensible, yet still precise, information. When there are teams of data scientists, they will also need to communicate stakeholder issues and objectives to the team.
Communication isn’t merely in verbal form. Many, if not all, data scientists write reports or other technical communications. Being able to convey clear, concise, comprehensive, consistent, and correct information is a vital asset for all data scientists whether they are presenting their findings to C-level executives or working with data engineers to build a pipeline for targeted data.
Indeed, data science salaries are alluring. But, there’s a robust reason for the data scientist shortage that’s beyond the current marketing hype. Data science isn’t a siloed profession. They work with different departments and/or stakeholders who have divergent objectives. The larger the enterprise, the more likely such divergence will occur. Add to this that data science is still a new job title and many organizations have no idea what a data scientist does nor how their skill set can be leveraged. All too many job descriptions align with data analyst or data engineer rather than data science. So, this is where the communication expertise plays an additionally important role.
Notably, some may view data scientists as simply “glorified statisticians.” While statisticians do enter the field, and having expertise in statistics is of great importance, there is more required of data scientists when compared to the traditional statistician. Those new to data science who have an intrinsic impetus towards attaining the required skills will have a more positive experience while improving their mathematical, programming prowess, scientific, and communication prowess.