Our research combines theory and applications in Data Science and trustworthy machine learning. We develop theoretical results — algorithms, theorems, proofs that define new ideas and concepts. We implement and test these on real datasets and and applications.
Our theoretical work includes deep mathematical and computational concepts derived from geometry, topology, theoretical machine learning, algorithms and many other areas. Below are a list of our current research interests, and our favourite application areas. Get in touch if you would like to work in these exciting areas.
Machine Learning: Analysis of SGD, Neural Networks and Optimisation
Neural networks are mysterious things. They perform surprisingly well, and they also fail when no one expects them to. Big neural networks are behind all the recent successes in ML and the current hype in AI, but we don’t really understand what goes on in them. Where do they work? Why do they fail?
Recent research has found many interesting behaviours in neural networks. For example, Network Pruning shows that large networks have small subnetworks that are really the core to the network’s actions. Study of the Stochastic Gradient descent algorithm has found that the trajectory in model space taken by the algorithm determines its properties through its mathematical invariants. Finally, neural networks behave strangely with respect to data — they frequently forget certain data points, and memorise certain points.
We are looking at neural networks at finer scales than ever before — simultaneously analysing the network, the optimisation algorithm and the data to understand the subtle relations between them.
Generative Models and Impact of Artificial Data
The exciting current developments in AI are all based on generative model created with data found on the Internet. As AI and generative models become better and used more widely, artificial data will become more common everywhere. What happens when next generation of generative models are trained in part on artificial data?
We tried feeding artificial data to train generative models, and found that through generations, the models degrade until they start producing nonsense.
There are, of course, many a lot of even more exciting research to do in this area. What is really different in artificial data? Can we avoid this pitfall? What are the subtle differences in models trained with artificial data and what are their consequences?
See my interview at Scientific American for a longer discussion: https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/
Value of Data
Some of the largest companies today run their businesses on data, and make enormous profits. So clearly, data is valuable. But how much exactly? Our personal data is used for building their core models and products, though we gain no knowledge of an individual’s contribution to these profit making products. Consider a company that uses our movie preferences, search behaviour or interactions with friends to improve their ML models. How much did your data contribute to the product? How much did the ML model improve by using your data?
This is the problem of Data valuation. By assigning a value to each point, we can get a better idea of how much our data is worth. We can then make better decisions to share or not to share our data. We can perhaps ask for compensations from those using our data. This fundamental problem involves the study of a range of topics — economics, mathematics, game theory, interpretable machine learning, privacy and many others. It will have huge impact on the future of data science, technology and even economic policy.
Differential Privacy and Private Machine Learning.
Privacy is a fundamental and difficult challenge in today’s world. As our data gets used in many products and services, the question is: can we hope to save some of our privacy while retaining the same applications of data science? Thus, instead of simply extracting as much knowledge as possible, privacy is a question of balancing what we can learn with what we can hide.
Differential privacy is a way of measuring the loss of privacy due to a particular computation. Private ML models are those with guarantees of a certain level of differential privacy. In our research group we develop algorithms that provide guarantees of differential privacy to making queries to databases, training ML models and various other computations. We are currently investigating improved differential privacy for training neural networks and data valuation.
Topological Data Analysis
Topological data analysis deals with the Shape of Data. In data science, where the objective is to gain knowledge from data, this shape can reveal insights that are otherwise hard to see. In many real scenarios, the data is directly associated with shapes. In biological data for example, the shape of the organs and networks are the primary concern. In geospatial data, the 2-D shape of spatial functions and trajectories determine the properties of the system concerned. In such applications, Topological data analysis provide valuable features that act as input to machine learning models and the basis of analytics.
Computational topology can be used to gain insights on machine learning itself. To understand the mysterious behaviour of neural networks, topology can be applied to the optimisation steps. This approach is currently being applied to gain deeper understanding of machine learning.
Computational Fairness and Interpretability.
As complex computations, datascience and machine learning becomes common across many domains, we have to consider possible consequences and complexities.
Complex computations and data driven decisions raise the question of fairness. Biases in the data may be cause the final decisions to be biased. This is a subtle issue, and biases are hard to catch, or even to define. What is fair from one perspective may be unfair from another. We study the topic of fairness in training ML models in conjunction with other issues like privacy and explainability/interpretability.
Complex models and datascience create another type of issue — explaining what is going one. Why did the AI decide to do what it did? Why did the self driving car turn at a certain point? Why did the classifier suggest the specific diagnosis? Interpretable machine learning is the topic of answering such questions — finding how features, model and data affect specific decisions and general behaviour of models. This requires close examination of models and their behaviours; combining technques from mathematics, economics and other areas.
Application Areas
Network Analysis and Graph machine learning.
Networks are ubiquitous from social networks to computer, transport and biological ones. We have developed algorithms specifically for networked data, including algorithms for embedding/representation, classification and community detection.
Our early work on hyperbolic embedding has led to a large body of work hyperbolic approaches for networks, knowledge graphs, and many other types of data. We have published many popular benchmark datasets, and award winning graph-ML libraries.
Biomedical Engineering.
We have applied network analysis and topological data analysis to detection go diabetic retinopathy and Azheimer’s disease. We have developed a realistic model of the human tongue surface, with analytical methods to measure its sensing and friction properties. Further works is in progress in these areas.
The brain: young, old, and under Alzheimer’s disease.
Our work on 3D meshes of the human tongue has gained recognition in media all over the world.
The beautiful 3D geometry of the human tongue surface
See our most recent work on the features and uniqueness of tongue papillae using ML and topological data analysis.
IoT, Sensors and decentralised computation.
We develop decentralised algorithms including decentralised model training algorithms. We have a line of well known works in the areas of information processing in IoT and sensor networks using deep mathematical concepts such as differential geometry and topology, Ricci flow and conformal geoemetry, hypoerbolic geometry and many others.
Chemistry and Drug design.
Machine learning and graph machine learning are being increasingly applied to predict chemical properties of molecules. We are combining ML with geometric ideas to predict drug interaction, reaction and other properties of molecules.
Mobility, trajectories and spatial data.
We have developed several algorithms to process data from a 2-D space such as a map. We have developed Locality sensitive hashes for trajectories, topological sketches and differentially private query algorithms on maps. Typically these works combine geometric and topological ideas with datascience techniques.