How Librarians are Important to the Data Science Movement
The library has always been a repository for knowledge and research tools. In recent years, with the advent of big data and data science, research has become more powerful and data-driven. However, despite the increasing treasure trove of data, research indicates that there are not enough people out there who can harness the power of big data. The consulting group McKinsey estimates that the U.S. currently faces a shortage of nearly 200,000 people who have the technical knowhow to use big data to make effective organizational decisions. This dearth of big data experts is despite the fact that big data collection itself is increasing worldwide, supported by advances in Internet of Things which enable better real-time data collection.
Librarians have long been shepherds of vast amounts of knowledge. This is why libraries can stand to benefit by adding data science to their list of offerings. Big data and data science applications serve to make libraries an even more powerful source of knowledge to bridge the gap and increase big data analytics literacy in society. Libraries are beginning to offer resources for patrons to learn more about big data and its benefits. It makes sense, practically speaking, for librarians to adopt and support big data practices and resources. Data science and big data lend themselves readily to research applications in a variety of fields, and can be used in conjunction with machine learning techniques to learn how to cluster, make recommendations, predict outcomes, and so on, based on data. Librarians can support data science by providing access to training and instructional materials to help improve the knowledge base surrounding big data. There are several ways libraries can help individuals and organizations adopt data science and big data practices: by providing access to information, organizing educational workshops and courses, and offering services supporting research data management. Read on to discuss the role of data science in the library and how librarians can both support data science and big data, and benefit from these new concepts.
In the data science world, data is the raw material of new knowledge
Libraries are known for their role as keepers of information, while the word “data” in data science seems to be a step down from information. The raw data involved in big data, for example, has no inherent use in its unanalyzed form; on the other hand, information you find in a library book, for instance, offers readily accessible insights. Library science is often referred to interchangeably as library and information science, which seems to leave data out of the equation. However, it is important to recognize that the raw data used in big data and data science approaches are the raw ingredients for knowledge and actionable information. It is because of this fact that librarians can play such a huge role in harnessing the power of this emerging field. The world is currently facing a shortage of people who can take raw data and turn it into knowledge. This is where librarians can be incredibly helpful and transformative to the field of data science. Librarians’ expertise in data management and organization can serve as the foundation for the training of the next cadre of data scientists.
Librarians can offer big data services as an additional tool in the research toolbox to mitigate the data scientist shortage
An important place in the research data lifecycle where librarians can play a large part is in the discovery, understanding, and cleaning of data for use in big data analysis. Even before the data analysis process begins, data must be selected, cleaned, and formatted in a specific way to be able to be analyzed. Librarians can offer resources on ways to handle big data to prepare it for analysis just as they offer resources on many other research topics. Librarians already offer similar types of resources in terms of finding novel data sources, providing background information on subjects, and offering advice on dealing with metadata. These resources will help ensure that everyone can have skills in data science. As Hal Varian explains, “the ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next two decades.” Such skills will need to be developed in not only data scientists but also the general workforce in order to deal with the shortage of people possessing these skills. Not only will library patrons benefit from such knowledge, but librarians, too, may stand to benefit from providing such resources in their own work. Having knowledge of big data is becoming increasingly essential in today’s world.
Librarians can play a role in improving access to big data information
As shepherds of vast amounts of knowledge for centuries, librarians play an important role in ensuring that people have access to information regarding big data, and data analytics approaches. In a way, librarians have been dealing with big data all along. Focusing collection development on methods underlying big data analytics would be very helpful for researchers just delving into the topic. The National Institutes of Health’s National Library of Medicine recommends several topic areas in which libraries could offer books and resources. These topics include: “Topics for books and ebooks might include machine learning, research data management, data visualization, text mining, algorithms, R programming language, Python, data wrangling, etc. Curating those resources on a LibGuide or website, along with links to websites that help people learn about data science and obtain support (eg stackoverflow.com) might be especially useful.”
Libraries can organize workshops and courses regarding how to use big data approaches
Indeed, libraries have taken note of the increasing potential of data science. The Harvard Library and Harvard-Smithsonian Center for Astrophysics John G. Wolbach Library have teamed up to train librarians with the growing data needs of their communities with the “Data Scientist Training for Librarians” course. In this course, librarians learn the tools that can be used for “extracting, wrangling, storing, analyzing, and visualizing data” to experience the data lifecycle firsthand. This type of hands-on experience helps librarians be better equipped to help improve the services they offer. The Coalition for Networked Information also has a related course, “Data Science and Visualization Institute for Librarians,” which offers librarians the chance to learn about data science and visualization. The course discusses data exploration and analysis, data visualization, data cleaning and preparation, web scraping, bibliometric network analysis, and other topics. The National Library of Medicine recently offered a course called Big Data in Healthcare: Exploring Emerging Roles which helped healthcare librarians dive into the exciting world of big data. The class discussed examples of big data, applications in business and commerce, and of course, big data’s applications in medicine. The use of big data in electronic health records was discussed, as was the unstructured data format of hospital data, which creates problems for data scientists. Building a data schema is important in such contexts, and NLM librarians helped the class participants understand best practices to build structured datasets for powerful machine learning applications. While big data may still seem nebulous, librarians’ expertise in data management can make them excellent teachers of data science methodology insights through workshops and courses.
Libraries can offer services supporting research data management
Librarians are unique in that they have a specialized knowledge base regarding the finding, storing, and preservation of information. This skillset is of particular use to data scientists, who can consult with librarians to help them come up with data management plans, for example. Librarians have a general research- and data-minded skillset that makes them uniquely capable of advising library patrons on the best practices for the collection, management, and organization of data. While librarians don’t have to be big data experts to help train and support data scientists, they can support the work of data science enthusiasts. Libraries can support data scientists and others interested in improving their data analytics skills by helping them conceptualize how data is collected, organized, and stored. Librarians’ database design and development skills can prove useful for organization and data mining processes in Big Data. Other ways libraries can support data scientists and others interested in expanding their data science horizons include helping library patrons consider how their data is documented and organized, protected, stored, and shared. As the National Library of Medicine notes, “Librarian support is key to helping researchers thrive, regardless of whether their data is big or small, and regardless of the methodologies they use.”
The “New Librarian”: A librarian equipped with data science knowledge
As mentioned above, librarians are excellent when it comes to the management and organization of information. This skill is also critical to data science, and is essential especially in terms of curating data to be used in big data approaches. Librarians are adept at communicating strategies and resources that can help drive investigation and learning. This is why the “new librarian” – a librarian who can offer helpful resources for data scientists – could be so transformational as a link between data science and library science. Data scientists must organize and wrangle large amounts of raw, messy data into insights that can drive decision-making for organizations. Library scientists (i.e., librarians) must offer resources to help drive the creation of new knowledge. The link between library and data science is obvious here in that it comes from the librarian’s ability to work with knowledge and help library patrons obtain resources to deal with new knowledge to gain actionable insights. Jeffrey Stanton of the Syracuse University iSchool writes regarding his concept of the “new librarian” who is trained in data science approaches: “A librarian does not need to become a programmer, but every librarian interested in knowledge creation should have some essential familiarity with how various software tools can transform data. A librarian need not be a database engineer, but every librarian must understand the underpinnings of information retrieval tools. A librarian does not need to be a statistician, but every librarian should have a clear understanding of how descriptive summaries and basic tests of numeric data can be used and misused. Finally, a librarian does not need to be a graphic designer, but every librarian needs to recognize the features of effective data displays. In short, to fulfill their missions, librarians can exercise a range of sophisticated skills that squarely occupy the central ground between understanding information user needs on one end and data curation on the other.” Librarians have long been known as advocates of free access to information, to serve the community and increase knowledge, and preserve historical information to pass on to future generations who can benefit from this unfettered access to information. The new librarian’s roles add to these conventional roles – their skillset is expanded beyond simply storing and shelving books. As technology advances and big data becomes more commonplace, it will be more normal for citizens to look to big data to answer important questions which cannot be answered by more conventional approaches such as surveys and literature reviews. That’s why the new librarian is so essential to help usher in a new generation of data-driven insights which can be used in a variety of important settings by data scientists and others interested in big data.
How can libraries keep up with the demand for data science?
The Data Science in Libraries Project at the University of Pittsburgh recommends several actions to bring library and information science up-to-date with data science approaches. These include collaboration with leadership institutes; utilizing physical learning spaces in the library such as computer laboratories where people can learn data science together; leveraging existing educational resources; and even repositioning the Master’s in Library and Information Science (MILS) to offer a more data-science-friendly curriculum.
Libraries can benefit from their data science knowledge
Data science can be used not only by patrons of the library but in the library’s internal processes. Business insights gleaned from big data can be used in the library itself. For example, librarians can look to data science and big data analytics (e.g., from library book checkout and use records) to determine what books should be added to the library’s collections. In this way, big data could make in-house library decisions, such as book purchases and other collections decisions, more streamlined, efficient, and better.
Conclusion
While the transition to the new data-science-friendly library seems complex, it is a natural outflow of recent advances in computing. Librarians, who are trained in knowledge organization and management, and who are adept in explaining how to take advantage of and organize information sources, are perhaps the best candidates to help address the shortage of data scientists. As data science continues to become more commonplace across all industries, the world will need citizens who can not only conduct data science inquiries but can also collect, organize, process, and deal with the raw data. While librarians themselves do not need to become proficient in the intricacies of data science techniques, they can serve as stewards of this information and promote learning in this space. Data science can also be used to improve library day-to-day operations, e.g., as related to selection, purchasing, preservation, and disposal decisions for library resources. As Chris Erdmann, Head Librarian of the Harvard-Smithsonian Center for Astrophysics, and the creator of the DST4L class, notes, this approach is more sustainable for libraries than current approaches: “more technical knowledge and expertise needs to be transferred to librarians, and the outcome, is a better trained, more tech savvy librarian that can take an active role in the technical aspects of library projects.”