Java for Data Science?
By Kat Campise, Data Scientist, Ph.D. If you’ve arrived at this guide already having researched the primary skills and knowledge required to enter a data science career, you are probably aware that knowledge of programming languages is a persistent theme. Python and R are the two most widely cited languages for Kaggle competitions, data science job postings, and just about every blog, article, and many Quora answers for “What programming languages do I need for data science?” But, Java? Isn’t that what web and software developers use? Yes and no, it depends on programmer preference vs. employer requirements.
A quick search for “data scientist” via Indeed.com yields tens of thousands of data science job postings (as of July 2018), and Java as a preferred qualification appears in roughly 10% of those requests for qualified applicants. While Python, SQL, and R should be the first set of programming languages added to your data science toolkit, including Java to the mix can expand your employability in the data science job market.
A Little Java History
Oak, DNA, Silk, Java, were possible names for the newly minted, object-oriented programming language back in the early 1990s. James Gosling, a Canadian computer scientist employed by Sun Microsystems (currently owned by Oracle) created Java in 1991 and released for public use four years later. Over 20 years later, Java is now pervasive: Android apps, Hadoop, web server applications, enterprise desktop applications, retail, banking — Java is everywhere. Thus, it shouldn’t be surprising that it’s consistently ranked as the most preferred (and often lucrative) programming language. Returning to Indeed.com and running a cursory data mining expedition for Java-only jobs returns well over 60,000 job listings throughout the U.S. Amazon.com, Microsoft, Oracle, and Google all appear on the list of companies seeking software engineers with Java experience or Java Developers. The estimated salary range is between $90,000 and $135,000. Notably, there is 50% less data science job postings when compared to the Java-focused employment opportunities.
Why Java for Data Science?
First and foremost, choosing to use Java for data science is mainly a preferential decision either on the part of the individual data scientist or an employer. The data science job postings in relation to preferred programming languages are revealing, but it doesn’t tell the entire story. Employers will provide a litany of “Preferred” or “Desirable” qualifications and nestle Java in between Python, R, SQL, C++, etc. So, it wouldn’t be prudent to jump to the conclusion that the 10% of Java-related data science postings only include Java as the desired language. However, in terms of specific data science functions, Java can be used for many of the same processes:
- Data import and export.
- Cleaning data.
- Statistical analysis.
- Machine learning and Deep learning.
- Deep learning.
- Text analytics (also known as Natural Language Processing or NLP).
- Data visualization.
There is a caveat: Python and R have highly specific libraries that are far more robust for data science. As such, if you’re not yet proficient in either of those two languages (and, of course, SQL!), start with the learning Python and R for data science. Then, follow up with Java as an ancillary skill. Keep in mind that, as a data scientist, you are using a confluence of knowledge which increases the complexity of the job. You’re not only applying advanced statistical methods, but you need to map those methods and techniques to a programming language. Additionally, there are other constraints and expectations such as the enterprise’s business logic, rules and regulations surrounding data collection and the use of data (the General Data Protection Regulation, GDPR, is a perfect example), as well as any systemic dependencies such as the enterprise’s data storage and data management software. While this isn’t a complete list of every consideration throughout the data science cycle, it gives an approximate picture as to the interconnected complexity that is data science. The final point here is that choosing a “traditional” or most widely used data science programming language is your best bet. Once you’ve reached a high command of being skilled in that language, then it’s far easier to transfer that knowledge to Java.
Java Educational Resources
A majority of the learning resources available for Java are focused on web development, software engineering, and Android app development. There are eBooks dedicated to Java for Data Science — which are included in the list below — but, they far outnumber the number of courses geared explicitly towards learning Java as a data science tool.
- The Software Guild is a Java coding bootcamp that can take you From Apprentice to Master, teaching you everything you need to know to enter junior developer roles in the workforce. First teaching the basics of Object Oriented Programming including basic Java syntax, using the NetBeans IDE, debugging and object oriented concepts such as methods, boolean expressions and arrays, teaching then moves on to Consuming and Creating REST Web Services. By studying JSON, AJAX, jQuery and more, learn to host a RESTful web service using Spring MVC’s Web Frameworks and how to consume the service from the browser using the AJAX functionality in the jQuery library.
- Coursera: One of the largest and most popular MOOCs, Coursera offers Java Programming and Software Engineering Fundamentals (Duke University), and Object-Oriented Programming in Java: Data Structures and Beyond (UC San Diego). Learners can take individual courses in either of those specializations or complete a series of courses to earn a certificate. The individual courses may be audited without cost, but the specializations require a monthly fee ($49 per month as of this writing).
- edX: While there aren’t currently any “Java for Data Science” courses included in the edX offerings, there are a plethora of Java programming modules for beginning, intermediate, and advanced programmers. Most of the courses are available for free, but if you want to earn a certificate, the average cost is $99.
- Codecademy: The basic “Learn Java” course at Codecademy is another way to begin your Java for data science journey. Granted, it’s not geared directly towards using Java for data science, but learners can establish some of the essential Java functions. The basic course is free. To access advanced courses, their Pro membership ($19.99 per month) is required.
- Amazon.com: For specific “how to” guides that target “Java for Data Science” learners will need to navigate to the online retail giant. There aren’t a wide variety of choices, but the five main texts that are available, “Java Data Science Cookbook,” “Java: Data Science Made Easy,” “Mastering Java for Data Science,” “Data Science with Java: Practical Methods for Scientists and Engineers,” and “Java for Data Science” provide ample information for getting started as a Java-oriented data scientist.