What is the Difference Between Data Mining and Machine Learning?
Data mining is the probing of available datasets in order to identify patterns and anomalies. Machine learning is the process of machines (a.k.a. computers) learning from heterogeneous data in a way that mimics the human learning process. The two concepts together enable both past data characterization and future data prediction.
There are Many Lenses of Data Mining
The purpose of data mining is to identify patterns in data, and patterns can be identified in many different ways depending on what information is needed.
1) Data mining is used to classify data.
Classifying data is something we perform on a daily basis, like when we sort laundry and separate shirts, pants, socks, etc. In terms of big data, sorting becomes far more complicated. For example, credit checks access a person’s financial history. After integrating data on existing debt, income, and late payment histories, loan applicants are classified into either “eligible” or “ineligible”.
2) Data mining is used to identify associations in data.
For example, consider a grocery store that sets up an online shopping system with a virtual shopping cart. Once data is collected from thousands of customers, it would probably be revealed that people who buy hot dogs often buy buns and ketchup as well, or that people who add pasta noodles to their carts often buy pasta sauce. Sometimes associations are completely beyond what anyone would anticipate, such as the Pop-tart story found here.
As another example, consider an application that collects cell phone GPS location data from its users.
Using data mining, analysts can deduce that a few people, call them Rachel, Ross, Joey, Chandler, and Monica, gather every day at about the same time at a coffee shop called Central Perk (those of you that watched “Friends” know what this is about). By that, they can infer that Rachel, Ross, Joey, Chandler, and Monica are friends.
3) Data mining is used to identify outliers and anomalies.
Identifying unusual data can be very useful. An example would be a fraud detection system run by a credit card company. If, all of a sudden, high-ticket item purchases are made from an individual’s account and those purchases are outside his or her home state, security programs will isolate the incident and ring virtual alarm bells to indicate something unusual is happening that warrants further investigation, such as a freeze on the account and a phone call to the customer. Another example, considering the Central Perk scenario above, would be if it were observed that Chandler and Monica stopped coming to Central Perk altogether after being faithful members for many years. A trend that is broken suggests that something has changed, which is actually true – Chandler and Monica got married and moved to the suburbs.
4) Data mining is used to group data.
Cluster analysis groups items together based on shared properties. For example, if biologists are given the DNA sequences of 1,000 different species, algorithms that compare the sequences might cluster the species into five general groups that are upon investigation identified as mammals, reptiles, amphibians, birds, and fish.
5) Data mining is used to perform regression analysis and generate prediction models.
Regression analysis seeks to analyze the relationship between quantitative variables. Calculating residential real estate values is a perfect example of regression analysis. Residential real estate prices are influenced by many different factors including square footage, number of beds/baths, population of city, distance to schools, etc. If the data from hundreds of recently sold properties is collected and analyzed, data mining could determine how much each factor contributes to the purchase price. Using that information, real estate investors can then predict values and trends. Both real estate investors and insurance companies rely heavily on such predictive models.
No matter the type of data mining, all data mining strategies have the ultimate goal of extracting patterns from data.
Data scientists are not merely interested in characterizing existing data, although that is a huge part of their job. They are equally interested in predicting future data and accurately characterizing unknown data. Machine learning is a way that data mining output is used to generate tools that can be applied to novel data.
The Machine Learning Toolbox: Advanced Algorithms
The main purpose of machine learning is to generate algorithms that can “learn” from data. Algorithms are sequential processes that can solve a problem in a finite number of steps. In machine learning algorithms, each piece of data that is run through the algorithm pipeline will influence the outcome of the algorithm. For example, if one spam message is run through the algorithm, the machine will learn what one spam message looks like. If thousands of spam messages are run through the algorithm, the machine has been exposed to thousands of spam messages so that it can identify commonalities and better define exactly what spam looks like. The goal of machine learning is to develop an algorithm that can independently operate and be applied to novel data. In this example, it would be an algorithm that can accurately classify an email as “spam” or “legitimate”.
In supervised learning, accurately characterized data is divided into “training” and “test” sets. Training sets are typically about 80% of data, and test sets are the remainder. In our example, we have emails that are classified as “spam” or “legitimate” by human experts. The machine learning algorithm is developed using the training set, a portion of emails that have already been identified. Once the optimized algorithm has been developed after all of the training set has been run through the pipeline, the algorithm is tested with the test set to determine its accuracy. Accuracy is determined by how many times the algorithm correctly characterizes test set data. Ideally, algorithms would classify big data correctly 100% of the time, but considering that there are always outliers, that is not realistic. A classification accuracy above 90% is usually considered acceptable.
In unsupervised learning, the classes are not known. The machine learning algorithm would infer patterns and properties based on input comparisons and cluster data into different groups. For the email example, after running thousands of unclassified emails through the algorithm, the algorithm might group them into three different categories. Human experts would then examine random samples from the three clusters of emails, and upon examination, may label them as “spam”, “personal”, and “retail”. Or perhaps four clusters of emails would be generated by the algorithm. In that case, human experts would analyze examples in each cluster and assign cluster labels such as “spam”, “personal”, “work”, and “retail”. Note that unsupervised learning output requires expert analysis in order to assign meaning.
Data Scientists are Master Programmers
The job of data scientists is to examine data to make predictions, and data scientists cannot do their jobs without both data mining and machine learning. They must perform data mining to characterize data, and they must integrate machine learning algorithms in order to make predictions. These two processes require an intense amount of programming, and thus data scientists should have fluency in programming languages such as R, Python, or MatLab. Data scientists also must be able to write and modify these complicated algorithms.