Speaker: Ganesh Ramakrishnan
State of the art AI and Deep Learning are very data hungry. This comes at significant cost including larger resource costs (multiple expensive GPUs and cloud costs), training times (often times multiple days), and human labeling costs and time. In this talk we present our an overview of our research efforts toward Data Efficient maChIne LEarning (DECILE) and our associated open source platform (http://www.decile.org) in which we attempt to address the following questions. Can we train state of the art deep models with only a sample (say 5 to 10) of massive datasets, while having negligible impact in accuracy? Can we do this while reducing training time/cost by an order of magnitude, and/or significantly reducing the amount of labeled data required? In this talk, we will cover the following different components of DECILE while also outlining our research along those threads, viz., a) SUBMODLIB, b) CORDS, c) TRUST, d) DISTIL and e) SPEAR. Below, we introduce each component briefly.
a) SUBMODLIB (https://github.com/decile-team/submodlib) is a library for submodular optimization. This library implements a number of submodular optimization algorithms and functions (including the submodular mutual information and conditional gain functions) in C++ with Python wrappers. It finds its application in summarization, data subset selection, hyper parameter tuning etc.
b) CORDS (https://github.com/decile-team/cords) is a library for COResets and Data Subset selection for compute-efficient training of deep models. We will also briefly present our algorithmic innovations along this thread.
c) TRUST (https://github.com/decile-team) is a library for targeted subset selection toward personalization and model remediation that includes several information theoretic measures on sets that have innovated.
d) DISTIL (https://github.com/decile-team) is a library for Deep dIverSified inTeractIve Learning toolkit for deep models, that provisions for factoring in the effect of data augmentation on active learning.
e) SPEAR (https://github.com/decile-team/spear) is a library for Semi-suPervisEd dAta pRogramming. SPEAR also includes innovative models that we have built for selecting (under some budget constraint) the unlabeled subset to be labeled, that best complements a given set of rules meant for labeling data (referred to as data programming). We will also briefly introduce the different state-of-the-art algorithms implemented, especially those innovated by us.
Ganesh Ramakrishnan (https://www.cse.iitb.ac.in/~ganesh/) is currently serving as an Institute Chair Professor at the Department of Computer Science and Engineering, IIT Bombay. His areas of research include human assisted AI/ML, AI/ML in resource constrained environments, learning with symbolic encoding of domain knowledge in ML and NLP, etc. More recently, he has been focusing his energy on organizing relevant machine learning modules for resource constrained environments into https://decile.org/. In the past, he has demonstrated the impact of such data efficient machine learning in applications such as Video Analytics (https://www.cse.iitb.ac.in/~vidsurv) and OCR (https://www.cse.iitb.ac.in/~ocr) and is seeking to make similar impacts in creating a machine translation eco-system (https://www.udaanproject.org/) and in multi-modal analytics (https://www.cse.iitb.ac.in/~malta/). In the past, he has received awards such as IBM Faculty Award, and awards from Qualcomm, Microsoft as well as IIT Bombay Impactful Research Award and most recently the Dr P.K. Patwardhan Award for technology Development. He also held the J.R. Isaac Chair at IIT Bombay. Ganesh is very passionate about boosting the AI research eco-system for India and toward that, the research by him and his students as well as collaborators has resulted in startups that he has either jointly founded, has transferred technology to, or is mentoring. Ganesh is also currently serving as the Professor-in-charge of the Koita Centre for Digital Health at IIT Bombay (https://www.kcdh.iitb.ac.in/).