From Data Lakes to Knowledge Lakes: The Age of Big Data Analytics

Amin Beheshti - Boualem Benatallah - Michael Sheng

Abstract: The continuous improvement in connectivity, storage and data processing capabilities allow access to a data deluge from open, private, social and IoT data. Data Lakes introduced as a storage repository to organize this raw data in its native format until it is needed. The rationale behind a Data Lake is to store raw data and let the data analyst decide how to curate them later. This tutorial gives an overview on state-of-the-art methods for the automatic curation of raw data in the Data Lake and to prepare them for deriving insights. We introduce the novel notion of Knowledge Lake, i.e., a contextualized Data Lake, and discuss big-data methods for building Knowledge Lakes being assets for big-data applications. The tutorial also points out challenges and research opportunities.


With data science continuing to emerge as a powerful differentiator across industries, almost every organization is now focused on understanding their business and transform data into actionable insights. For example, governments derive insights from vastly growing private, open and social data for improving government services, such as to personalize the advertisements in elections, improve government services, predict intelligence activities and to improve national security and public health. In this context, organizing vast amount of data gathered from various private/open data islands, i.e. Data Lake, will facilitate dealing with a collection of independently-managed datasets (from relational to NoSQL), diversity of formats and non-standard data models.

The notion of a Data Lake has been coined to address this challenge and to convey the concept of a centralized repository containing limitless amounts of raw (or minimally curated) data stored in various data islands. The rationale behind a Data Lake is to store raw data and let the data analyst decide how to cook/curate them later. While Data Lakes, do a great job in organizing big data and providing answers on known questions, the main challenges are to understand the potentially interconnected data stored in various data islands and to prepare them for analytics.

In this tutorial, we present the notion of Knowledge Lake, i.e. a contextualized Data Lake. The term Knowledge here refers to a set of facts, information, and insights extracted from the raw data using data curation techniques such as extraction, linking, summarization, annotation, enrichment, classification and more. In particular, a Knowledge Lake is a centralized repository containing virtually inexhaustible amounts of both data and contextualized data that is readily made available anytime to anyone authorized to perform analytical activities. The Knowledge Lake will provide the foundation for big data analytics by automatically curating the raw data in the Data Lake and to prepare them for deriving insights.


From Data to Big Data

In this part we provide an introduction to data and its cross-cutting aspects (e.g., provenance and versioning) and then introduce the notion of Big Data and how it is different from large datasets.

Organizing Big Data: Data Lake

In this part we present the challenges in storing vast amount of noisy data (varying from structured entities to unstructured documents) being generated on a continuous basis. We Introduce the notion of data lake, i.e., a storage repository to organize this raw data in its native format (supporting from relational to NoSQL DBs).

Data Curation: from Ingestion and Cleaning to Extraction an Enrichment

In this part we introduce data curation, i.e., tasks that transform raw data (unstructured, semi-structured and structured data sources, e.g., text, video, image data sets) into curated data (contextualized data and knowledge that is maintained and made available for use by end-users and applications).

From Data Lake to Knowledge Lake

In this part we introduce the novel notion of Knowledge Lake, i.e., a contextualized Data Lake, and discuss big-data methods for building Knowledge Lakes being assets for big-data applications.

Knowledge Lake as a Service

In this part we present the notion of Knowledge lake as a service and discuss its main components and services from organizing and curating the raw data to querying and analysing the contextualized data in the Knowledge Lake.

Motivating Scenario: Police Investigation in the Age of Internet of Things

Modern police investigation processes are often extremely complex, data-driven and knowledge-intensive. In such processes, it is not sufficient to focus on data storage and data analysis; and the knowledge workers (e.g., investigators) will need to collect, understand and relate the big data (scattered across various systems and generated by many Internet enabled devices such as CCTVs, Police cars, camera on officers on duty and more) to process analysis. In this part we use this motivating scenario to explain how Knowledge Lakes can assist police investigators in communicating analysis findings, supporting evidences and to make decisions.


Dr. Amin Beheshti, Macquarie University, Sydney, Australia. Amin is a Lecturer in Data Science and the head of the Data Analytics Research Group, Department of Computing, Macquarie University. Amin completed his PhD and Postdoc in Computer Science and Engineering in UNSW Sydney and holds a Master and Bachelor in Computer Science both with First Class Honours. In addition to his contribution to teaching activities, Amin extensively contributed to research projects; where he was the R&D Team Lead and Key Researcher in the 'Case Walls & Data Curation Foundry' and 'Big Data for Intelligence' projects. Amin has been recognized as a high-quality researcher in Big-Data/Data/Process Analytics and has been invited to serve and served as General-Chair, PC-Chair, Organisation-Chair and program committee member of top international conferences.

Scientia Prof. Boualem Benatallah, University of New South Wales, Sydney, Australia. Boualem is Scientia Professor at the University of New South Wales Sydney, Australia. His research interests lie in the areas of Web service protocols analysis and management, enterprise services integration, large scale and autonomous data sharing, process modelling and service oriented architectures for pervasive computing. He has several ARC (Australian Research Council) funded projects in these areas. He was a visiting scholar at Purdue University (USA), Visiting Professor at INRIA-LORIA (France), Visiting Professor at Blaise Pascal University (Clermont-Ferrand, France). He has been a General chair, PC chair and Program Committee member of several top conferences. He has been keynote and tutorial speaker on Web Services at several workshops and conferences. He has published widely in top international journals and conferences.

Prof. Michael Sheng, Macquarie University, Sydney, Australia. Michael is a full Professor and Head of Department of Computing at Macquarie University. Michael holds a PhD degree in computer science from the University of New South Wales (UNSW) and did his post-doc as a research scientist at CSIRO ICT Centre. Michael has more than 320 publications as edited books and proceedings, refereed book chapters, and refereed technical papers in top journals and conferences. Michael is the recipient of the ARC Future Fellowship (2014), Chris Wallace Award for Outstanding Research Contribution (2012), and Microsoft Research Fellowship (2003). He is a member of the IEEE and the ACM.


  • Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. PVLDB, 11(12), 2018.
  • Beheshti, A., Benatallah, B., Nezhad, H.: Processatlas: A scalable and extensible platform for business process analytics. Softw., Pract. Exper. 48(4), 2018.
  • Beheshti, A., Schiliro, F., Ghodratnama, S., Amouzgar, F., Benatallah, B., Yang, J., Sheng, Q., Casati, F., Motahari-Nezhad, H., iProcess: Enabling IoT Platforms in Data-Driven Knowledge-Intensive Processes, BPM Forum, 2018.
  • Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: CoreDB: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017.
  • Beheshti, S., Tabebordbar, A., Benatallah, B., Nouri, R.: On automating basic data curation tasks. In:WWW, 2017.
  • Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Scalable graph-based OLAP analytics over process execution data. Distributed and Parallel Databases 34(3), 379-423, 2016.
  • Beheshti, S., Benatallah, B., Sakr, S., Grigori, D., Motahari-Nezhad, H.R., Barukh, M.C., Gater, A., Ryu, S.H.: Process Analytics - Concepts and Techniques for Querying and Analyzing Process Data. Springer Book, 2016.
  • Beheshti, S., Benatallah, B., Nezhad, H.R.M.: Enabling the analysis of cross-cutting aspects in ad-hoc processes. In: CAiSE. 2013.
  • Tene, Omer, and Jules Polonetsky. "Big data for all: Privacy and user control in the age of analytics." Nw. J. Tech. & Intell. Prop. 11, 2012.
  • Terrizzano, Ignacio G., et al. "Data Wrangling: The Challenging Yourney from the Wild to the Lake." CIDR. 2015.

Early Bird registrations

Until September 30th, 2018