Less Labels, More Learning: Weak Supervision
Imagine training your deep learning models without relying on perfectly labeled datasets, primarily using raw data from the Internet and making the process more cost-effective —this is the promise of weakly supervised learning.
In this post from RidgeRun.ai, we'll provide a concise overview of weak supervision, a trending training mechanism gaining popularity for its potential to reduce data requirements. This post aims to offer a high-level understanding of weakly supervised learning, encouraging you to delve deeper into the realm of cutting-edge research.
Anyone involved in machine learning, especially deep learning, has encountered the challenge of data acquisition for model training. Whether it's the difficulty of obtaining sufficient data samples, the high cost of labeling, or the frustration of working with an unclean and improperly labeled dataset, these are common nightmares for machine learning engineers.
Picture scenarios like dealing with a poorly labeled dataset, a situation often encountered when using third-party labeling services, where the accuracy of labels can't be trusted. Or consider a dataset that's only partially labeled due to associated costs, or one labeled at a coarse level, lacking specifics on the details you need. In these situations, while you might have enough data samples for training, a reliable label set for traditional supervised learning mechanisms is missing. This is where the concept of weakly supervised learning emerges, offering a Holy Grail to the machine learning engineer.
The core idea behind weakly supervised learning, in any of its variations (which we'll delve into later), is to use existing labels as a general guide for clustering data into sets with a high probability of containing the required labels. The algorithm then discerns which samples actually hold the information needed to learn the classes.
A weakly supervised learning model learns the required features from noisy datasets.
Types of weakly supervised learning
Despite the overall idea being the same, there are different types of weak supervision algorithms, classified mostly by the nature of the data to be processed. Here we will take a quick look at the three main scenarios depicted in Figure 1: incomplete supervision, inaccurate supervision and inexact supervision. We will go over their main characteristics and highlights of each scenario.
Incomplete supervision arises when labeling costs are high, resulting in only a few correctly labeled samples and a substantial amount of unlabeled data. In this scenario, where the labeled data is treated as perfectly annotated, two main approaches are commonly employed: active learning and semi-supervised learning. Both inherit some advantages from traditional supervised learning methods.
Active learning relies on external input, typically from a machine learning engineer, to provide feedback on the ground truth labels for unlabeled data. The process involves training a model with available labeled data as in a supervised approach, and then prompting the user to label the unlabeled samples on which the model exhibits the lowest confidence (this can be achieved through single inference or querying by committee, which is based on a majority vote from an ensemble). The success of active learning hinges on having a sufficient number of labeled samples to establish a baseline model performance. With this baseline, only a few strategically chosen unlabeled samples need to be queried, minimizing the overall labeling effort. Figure 2 shows the overall process in active learning.
Semi-supervised learning eliminates the need for direct human interaction by leveraging the probability distribution of both labeled and unlabeled samples to infer missing labels. Various methods can be applied, but the underlying principle remains consistent: organizing unlabeled data in the feature space into clusters. A few known labels then determine the probability of a group of unlabeled samples belonging to a specific label or not.
Consider Figure 3 as a graphical representation of the reasoning behind semi-supervised learning. In Figure 3(a) we have a set of unlabeled samples (circles) and we want to know the label for the smiling face. Since we do not have label information it is impossible to emit any guess about the corresponding label. On the other hand, Figure 3(b) shows a similar scenario but considering only the labeled samples (triangle, diamond, and pentagon), in this case, we have just a few samples scattered all around the feature space with insufficient information to provide some guesses about the nature of our target sample. Finally, if we combine the labeled and unlabeled samples together like in Figure 3(c) at first sight it becomes clear that our smiling face has a low probability of belonging to the polygon class while it seems like it might belong to the triangle class (a wild assumption just from a simple observation). As noted, it was not necessary to have more labeled samples to emit some criterion about the label of our target sample, the conjunction of known and unknown samples provide some idea of the probabilistic distribution.
You can conceptualize this as a mixture of Gaussians as shown in Figure 4. The dotted circles show the probability from each gaussian centered on each data group. In this scenario, the correspondence of the triangle class to the smiling face becomes more evident since it is closer to the gaussian centered in that cluster.
Have you ever engaged a crowdsourcing service to label an extensive dataset? If so, you may have encountered instances where the labels were not as accurate as desired. This common scenario arises because labeling is intricately tied to expert knowledge, and external contributors may interpret instructions differently, resulting in inaccuracies. This phenomenon is known as inaccurate supervision, creating what is commonly referred to as a noisy dataset—where labels deviate from the ground truth and exhibit variability.
Consider Figure 5, showcasing a sample from our object detection example, labeled through a crowdsourcing service. In our training process, we observed confusion in the classification results, particularly between the pawn and bishop pieces. Upon inspection, it became apparent that some labels mistakenly tagged the bishop as a pawn (highlighted in red in the image), introducing inaccuracies into the final results of our supervised learner.
To mitigate the impact of inaccurate labeling, the field of inaccurate supervision aims to detect inconsistencies in provided labels and address them by either removing or re-labeling suspicious samples. Various algorithms tackle this challenge, including majority vote using multiple models and aggregating their predictions for a single label assignment, statistical outlier detection, and data-editing approaches (e.g., label propagation) where inconsistencies in labels concerning their neighbors trigger the detection and correction of erroneous labels. These approaches collectively contribute to refining the accuracy of labeled datasets, enhancing the robustness of supervised learning models in the face of inaccuracies.
Inexact supervision, also known as multi-instance learning (MIL), addresses scenarios where labels lack the fine-grained information required for training models under traditional supervised learning. Consider a case where we use inexact supervision for an anomaly detection model. We might possess data samples containing anomalies, but the precise location of the malformation isn't labeled in the images—only the presence of an anomaly is known. In such instances, a conventional supervised learning model might struggle to identify the anomaly due to the vague labeling scheme.
In MIL, instances are grouped into sets known as bags. The foundational assumption is that a bag is positive if at least one of its instances is positive (referred to as a witness), while a negative bag must consist only of negative instances, as illustrated in Figure 6.
At its core, multi-instance learning seeks to classify or process each set of instances within a bag as a single entity and assigns a label to the entire bag. This perspective positions MIL as an extension of supervised learning, with classical techniques like clustering, expectation maximization, SVM, or kNN adapted to align with the MIL approach.
Regardless of the method used, the fundamental idea is that the presence of at least one positive sample in a bag shifts its location in the feature space, transforming the problem into a conventional machine learning task.
While MIL offers a powerful framework, it comes with challenges. One notable issue is associated with the concept of the witness rate (WR), which represents the ratio of positive instances to negative instances in a bag. If the WR is high, categorizing the instances as representative of the positive label is straightforward. However, a low WR indicates a highly unbalanced dataset where there isn't sufficient information to characterize the positive group accurately. This imbalance poses a challenge in effectively leveraging the inexact supervision provided by MIL.
Is Weak Supervision the Holy Grail of Machine Learning?
After introducing the concept of weakly supervised learning and its encouraging promises, you may find yourself contemplating the rationale behind continuing with traditional supervised or semi-supervised learning methods. The allure of weak supervision sounds almost too good to be true!
While weakly supervised learning is a promising and emerging technique, it is still the subject of ongoing research to comprehend its full capabilities. It comes to the rescue when dealing with limited sets of labeled data. However, it's essential to note that in scenarios with a meticulously clean dataset, traditional supervised learning tends to outperform weak supervision.
Recent research in the realm of weakly supervised models, particularly in NLP applications, has shed light on some considerations. Weakly supervised methods, it turns out, require a perfectly labeled validation set to yield reasonable results. If the validation data isn't pristine, performance suffers, akin to a random selection of hyperparameters. Notably, research by Zhu et al. suggests that utilizing perfectly labeled data for supervised learning training, even with small samples, produces better-performing models compared to their weakly supervised counterparts. While these findings mark a starting point, further investigation is encouraged to thoroughly characterize weakly supervised models and their capabilities. Nonetheless, these results underscore the importance of keeping traditional supervised methods on the radar.
The question that lingers: Are you intrigued enough to give weak supervision a try? The evolving landscape of machine learning continues to challenge and inspire, and the quest for the most effective learning paradigm persists.
See you in our next blog!
The RidgeRun.ai Team