Technology

A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning

A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning

Estimated reading time: 9 minutes

  • Self-supervised learning (SSL) significantly reduces the reliance on costly, human-labeled data by enabling models to learn robust features from unlabeled data.
  • The Lightly AI framework provides an efficient way to build and train self-supervised models like SimCLR, simplifying complex architectural setups.
  • Intelligent data curation through coreset selection (e.g., diversity-driven) is superior to random sampling, ensuring that selected data points provide maximum informational value for downstream tasks.
  • An effective active learning workflow, combined with SSL, drastically reduces labeling efforts while boosting overall model performance and data efficiency.
  • Linear probe evaluation is a crucial method for objectively assessing the quality of learned representations, demonstrating how well pre-trained features generalize to new classification tasks.

The quest for powerful AI models often hits a roadblock: the scarcity and cost of labeled data. Traditional supervised learning, while effective, demands vast human annotations, which can be prohibitively expensive and time-consuming. This is where Self-Supervised Learning (SSL) emerges as a game-changer, enabling models to learn robust features from unlabeled data itself. Coupled with intelligent data curation and active learning strategies, SSL can drastically reduce labeling efforts while boosting model performance.

In this tutorial, we explore the power of self-supervised learning using the Lightly AI framework. We begin by building a SimCLR model to learn meaningful image representations without labels, then generate and visualize embeddings using UMAP and t-SNE. We then dive into coreset selection techniques to curate data intelligently, simulate an active learning workflow, and finally assess the benefits of transfer learning through a linear probe evaluation. Throughout this hands-on guide, we work step by step in Google Colab, training, visualizing, and comparing coreset-based and random sampling to understand how self-supervised learning can significantly improve data efficiency and model performance. Check out the FULL CODES here.

This guide will walk you through building a powerful self-supervised learning pipeline using Lightly AI, demonstrating how to leverage learned representations for efficient data curation and an impactful active learning workflow. We’ll implement a SimCLR model, visualize its learned embeddings, and apply smart sampling techniques to select the most informative data points for downstream tasks.

Setting Up Your Self-Supervised Learning Environment

Before diving into the model architecture and training, we need to prepare our development environment. This involves installing necessary libraries and ensuring our system is ready for GPU acceleration, a crucial component for efficient deep learning. We’ll be working in Google Colab for a smooth, cloud-based experience.

We begin by setting up the environment, ensuring compatibility by fixing the NumPy version and installing essential libraries like Lightly, PyTorch, and UMAP. We then import all necessary modules for building, training, and visualizing our self-supervised learning model, confirming that PyTorch and CUDA are ready for GPU acceleration. Check out the FULL CODES here.


!pip uninstall -y numpy
!pip install numpy==1.26.4
!pip install -q lightly torch torchvision matplotlib scikit-learn umap-learn import torch
import torch.nn as nn
import torchvision
from torch.utils.data import DataLoader, Subset
from torchvision import transforms
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.neighbors import NearestNeighbors
import umap from lightly.loss import NTXentLoss
from lightly.models.modules import SimCLRProjectionHead
from lightly.transforms import SimCLRTransform
from lightly.data import LightlyDataset print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}") 

This setup ensures that all dependencies are met, from the core deep learning framework PyTorch to specialized libraries like Lightly AI for SSL and UMAP for dimensionality reduction. Confirming CUDA availability is key to harnessing GPU power for faster training.

Building and Training Your SimCLR Model with Lightly AI

The core of our self-supervised learning journey is the SimCLR model. SimCLR (A Simple Framework for Contrastive Learning of Visual Representations) learns by maximizing agreement between different augmented views of the same image via a contrastive loss function. Lightly AI provides convenient modules to build such models efficiently.

Defining the SimCLR Architecture

Our SimCLR model is built upon a standard ResNet backbone, a common choice for visual feature extraction. The key modification for SSL is the removal of the ResNet’s classification head, replacing it with a projection head that maps the backbone features into a lower-dimensional embedding space. This space is where the contrastive learning magic happens.

We define our SimCLRModel, which uses a ResNet backbone to learn visual representations without labels. We remove the classification head and add a projection head to map features into a contrastive embedding space. The model’s extract_features method allows us to obtain raw feature embeddings directly from the backbone for downstream analysis. Check out the FULL CODES here.


class SimCLRModel(nn.Module): """SimCLR model with ResNet backbone""" def __init__(self, backbone, hidden_dim=512, out_dim=128): super().__init__() self.backbone = backbone self.backbone.fc = nn.Identity() self.projection_head = SimCLRProjectionHead( input_dim=512, hidden_dim=hidden_dim, output_dim=out_dim ) def forward(self, x): features = self.backbone(x).flatten(start_dim=1) z = self.projection_head(features) return z def extract_features(self, x): """Extract backbone features without projection""" with torch.no_grad(): return self.backbone(x).flatten(start_dim=1) 

The extract_features method is particularly useful. It allows us to retrieve the learned representations directly from the backbone, which will later be used for visualization and downstream tasks.

Preparing the CIFAR-10 Dataset

For our experiments, we’ll use the well-known CIFAR-10 dataset. A critical aspect of contrastive learning is data augmentation, where multiple distorted versions (views) of the same image are generated. Lightly AI’s SimCLRTransform automates this process, ensuring rich, varied views for the model to learn from.

In this step, we load the CIFAR-10 dataset and apply separate transformations for self-supervised and evaluation phases. We create a custom SSLDataset class that generates multiple augmented views of each image for contrastive learning, while the evaluation dataset uses normalized images for downstream tasks. This setup helps the model learn robust representations invariant to visual changes. Check out the FULL CODES here.


def load_dataset(train=True): """Load CIFAR-10 dataset""" ssl_transform = SimCLRTransform(input_size=32, cj_prob=0.8) eval_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)) ]) base_dataset = torchvision.datasets.CIFAR10( root='./data', train=train, download=True ) class SSLDataset(torch.utils.data.Dataset): def __init__(self, dataset, transform): self.dataset = dataset self.transform = transform def __len__(self): return len(self.dataset) def __getitem__(self, idx): img, label = self.dataset[idx] return self.transform(img), label ssl_dataset = SSLDataset(base_dataset, ssl_transform) eval_dataset = torchvision.datasets.CIFAR10( root='./data', train=train, download=True, transform=eval_transform ) return ssl_dataset, eval_dataset 

This dual transformation strategy is fundamental: the SSL transform creates the positive pairs for contrastive learning, while the evaluation transform prepares data in a standard format for performance assessment.

Training the SimCLR Model

Training involves feeding the augmented image pairs to the model and calculating the NT-Xent (Normalized Temperature-scaled Cross-Entropy) loss. This loss function pushes the representations of augmented views of the same image closer together in the embedding space, while pushing representations of different images further apart.

Here, we train our SimCLR model in a self-supervised manner using the NT-Xent contrastive loss, which encourages similar representations for augmented views of the same image. We optimize the model with stochastic gradient descent (SGD) and track the loss across epochs to monitor learning progress. This stage teaches the model to extract meaningful visual features without relying on labeled data. Check out the FULL CODES here.


def train_ssl_model(model, dataloader, epochs=5, device='cuda'): """Train SimCLR model""" model.to(device) criterion = NTXentLoss(temperature=0.5) optimizer = torch.optim.SGD(model.parameters(), lr=0.06, momentum=0.9, weight_decay=5e-4) print("\n=== Self-Supervised Training ===") for epoch in range(epochs): model.train() total_loss = 0 for batch_idx, batch in enumerate(dataloader): views = batch[0] view1, view2 = views[0].to(device), views[1].to(device) z1 = model(view1) z2 = model(view2) loss = criterion(z1, z2) optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() if batch_idx % 50 == 0: print(f"Epoch {epoch+1}/{epochs} | Batch {batch_idx} | Loss: {loss.item():.4f}") avg_loss = total_loss / len(dataloader) print(f"Epoch {epoch+1} Complete | Avg Loss: {avg_loss:.4f}") return model 

This training phase is entirely label-agnostic, meaning we don’t need any human annotations for the CIFAR-10 dataset during this crucial representation learning step. The model learns visual semantics purely from the images themselves.

Unlocking Data Efficiency with Embeddings and Coreset Selection

Once our SimCLR model is trained, it’s capable of generating powerful embeddings. These embeddings are numerical representations that capture the semantic essence of an image. Visualizing and intelligently selecting from these embeddings are key to efficient data curation.

Generating and Visualizing Embeddings

After training, we use our model’s backbone to extract features for the entire dataset. These high-dimensional embeddings can then be reduced to two dimensions using techniques like UMAP or t-SNE, allowing us to visualize how the model groups similar images.

We extract high-quality feature embeddings from our trained backbone, cache them with labels, and project them to 2D using UMAP or t-SNE to visually see the cluster structure emerge. Next, we curate data using a coreset selector, either class-balanced or diversity-driven (k-center greedy), to prioritize the most informative, non-redundant samples for downstream training. This pipeline helps us both see what the model learns and select what matters most. Check out the FULL CODES here.


def generate_embeddings(model, dataset, device='cuda', batch_size=256): """Generate embeddings for the entire dataset""" model.eval() model.to(device) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=2) embeddings = [] labels = [] print("\n=== Generating Embeddings ===") with torch.no_grad(): for images, targets in dataloader: images = images.to(device) features = model.extract_features(images) embeddings.append(features.cpu().numpy()) labels.append(targets.numpy()) embeddings = np.vstack(embeddings) labels = np.concatenate(labels) print(f"Generated {embeddings.shape[0]} embeddings with dimension {embeddings.shape[1]}") return embeddings, labels def visualize_embeddings(embeddings, labels, method='umap', n_samples=5000): """Visualize embeddings using UMAP or t-SNE""" print(f"\n=== Visualizing Embeddings with {method.upper()} ===") if len(embeddings) > n_samples: indices = np.random.choice(len(embeddings), n_samples, replace=False) embeddings = embeddings[indices] labels = labels[indices] if method == 'umap': reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine') else: reducer = TSNE(n_components=2, perplexity=30, metric='cosine') embeddings_2d = reducer.fit_transform(embeddings) plt.figure(figsize=(12, 10)) scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap='tab10', s=5, alpha=0.6) plt.colorbar(scatter) plt.title(f'CIFAR-10 Embeddings ({method.upper()})') plt.xlabel('Component 1') plt.ylabel('Component 2') plt.tight_layout() plt.savefig(f'embeddings_{method}.png', dpi=150) print(f"Saved visualization to embeddings_{method}.png") plt.show() 

The resulting visualizations provide immediate insight into the model’s understanding of the data, revealing clear clusters for different classes even though it was trained without labels.

Actionable Step 1: Implement Diverse Coreset Selection

Instead of randomly picking data points for labeling, coreset selection aims to choose the most representative or diverse subset. This is crucial for active learning, where we want to gain maximum knowledge from minimal human annotation. We explore two strategies: class-balanced selection and diversity-driven (k-center greedy) selection.


def select_coreset(embeddings, labels, budget=1000, method='diversity'): """ Select a coreset using different strategies: - diversity: Maximum diversity using k-center greedy - balanced: Class-balanced selection """ print(f"\n=== Coreset Selection ({method}) ===") if method == 'balanced': selected_indices = [] n_classes = len(np.unique(labels)) per_class = budget // n_classes for cls in range(n_classes): cls_indices = np.where(labels == cls)[0] selected = np.random.choice(cls_indices, min(per_class, len(cls_indices)), replace=False) selected_indices.extend(selected) return np.array(selected_indices) elif method == 'diversity': selected_indices = [] remaining_indices = set(range(len(embeddings))) first_idx = np.random.randint(len(embeddings)) selected_indices.append(first_idx) remaining_indices.remove(first_idx) for _ in range(budget - 1): if not remaining_indices: break remaining = list(remaining_indices) selected_emb = embeddings[selected_indices] remaining_emb = embeddings[remaining] distances = np.min( np.linalg.norm(remaining_emb[:, None] - selected_emb, axis=2), axis=1 ) max_dist_idx = np.argmax(distances) selected_idx = remaining[max_dist_idx] selected_indices.append(selected_idx) remaining_indices.remove(selected_idx) print(f"Selected {len(selected_indices)} samples") return np.array(selected_indices) 

By prioritizing diversity, we ensure that the selected subset covers a wide range of visual concepts, leading to more robust models with less data.

Active Learning Workflow and Performance Evaluation

To quantify the quality of our learned representations and the effectiveness of coreset selection, we simulate an active learning workflow. This involves training a simple classifier on a small, labeled subset of data, then evaluating its performance.

Evaluating with a Linear Probe

A standard way to evaluate the quality of self-supervised features is through a “linear probe.” Here, the pre-trained backbone’s weights are frozen, and only a simple linear classifier is trained on top of its extracted features. If the features are good, even a simple linear model can achieve high accuracy.

We freeze the backbone and train a lightweight linear probe to quantify how good our learned features are, then evaluate accuracy on the test set. In the main pipeline, we pretrain with SimCLR, generate embeddings, visualize them, pick a diverse coreset, and compare linear-probe performance against a random subset, thereby directly measuring the value of smart data curation. Check out the FULL CODES here.


def evaluate_linear_probe(model, train_subset, test_dataset, device='cuda'): """Train linear classifier on frozen features""" model.eval() train_loader = DataLoader(train_subset, batch_size=128, shuffle=True, num_workers=2) test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=2) classifier = nn.Linear(512, 10).to(device) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001) for epoch in range(10): classifier.train() for images, targets in train_loader: images, targets = images.to(device), targets.to(device) with torch.no_grad(): features = model.extract_features(images) outputs = classifier(features) loss = criterion(outputs, targets) optimizer.zero_grad() loss.backward() optimizer.step() classifier.eval() correct = 0 total = 0 with torch.no_grad(): for images, targets in test_loader: images, targets = images.to(device), targets.to(device) features = model.extract_features(images) outputs = classifier(features) _, predicted = outputs.max(1) total += targets.size(0) correct += predicted.eq(targets).sum().item() accuracy = 100. * correct / total return accuracy 

This method provides an objective measure of how well the self-supervised pre-training has prepared the model for downstream classification tasks.

Actionable Step 2: Simulate Active Learning for Data Efficiency

The full active learning simulation brings everything together. We first pretrain our SimCLR model, then generate embeddings. Using these embeddings, we select a small budget of samples via coreset selection and compare the linear probe accuracy with a randomly selected subset of the same size. This direct comparison highlights the efficiency gains.


def main(): device = 'cuda' if torch.cuda.is_available() else 'cpu' print(f"Using device: {device}") ssl_dataset, eval_dataset = load_dataset(train=True) _, test_dataset = load_dataset(train=False) ssl_subset = Subset(ssl_dataset, range(10000)) ssl_loader = DataLoader(ssl_subset, batch_size=128, shuffle=True, num_workers=2, drop_last=True) backbone = torchvision.models.resnet18(pretrained=False) model = SimCLRModel(backbone) model = train_ssl_model(model, ssl_loader, epochs=5, device=device) eval_subset = Subset(eval_dataset, range(10000)) embeddings, labels = generate_embeddings(model, eval_subset, device=device) visualize_embeddings(embeddings, labels, method='umap') coreset_indices = select_coreset(embeddings, labels, budget=1000, method='diversity') coreset_subset = Subset(eval_dataset, coreset_indices) print("\n=== Active Learning Evaluation ===") coreset_acc = evaluate_linear_probe(model, coreset_subset, test_dataset, device=device) print(f"Coreset Accuracy (1000 samples): {coreset_acc:.2f}%") random_indices = np.random.choice(len(eval_subset), 1000, replace=False) random_subset = Subset(eval_dataset, random_indices) random_acc = evaluate_linear_probe(model, random_subset, test_dataset, device=device) print(f"Random Accuracy (1000 samples): {random_acc:.2f}%") print(f"\nCoreset improvement: +{coreset_acc - random_acc:.2f}%") print("\n=== Tutorial Complete! ===") print("Key takeaways:") print("1. Self-supervised learning creates meaningful representations without labels") print("2. Embeddings capture semantic similarity between images") print("3. Smart data selection (coreset) outperforms random sampling") print("4. Active learning reduces labeling costs while maintaining accuracy") if __name__ == "__main__": main() 

The expected outcome is a notable improvement in accuracy for the model trained on the coreset-selected samples compared to those chosen randomly, demonstrating the power of intelligent data curation.

Real-World Impact: Medical Imaging

Consider a medical imaging startup developing an AI system to detect subtle anomalies in X-ray scans. Acquiring and meticulously labeling thousands of X-rays by expert radiologists is incredibly costly and slow. By leveraging self-supervised learning, they can pre-train a robust model on a vast pool of existing, unlabeled X-rays. Then, using active learning with coreset selection, the system intelligently identifies only the most ambiguous or diverse X-rays that would benefit most from expert labeling. This drastically reduces the number of annotations required, accelerating development, cutting costs, and bringing life-saving diagnostic tools to market faster.

Actionable Step 3: Continuously Iterate and Monitor

The active learning cycle isn’t a one-off process. To truly master data efficiency, continuously iterate on your active learning loops. After labeling the initial coreset, retrain your model, generate new embeddings, and select a new coreset of informative samples for the next labeling round. Monitor the model’s performance and the diversity of chosen samples to refine your coreset selection strategy over time.

Conclusion

The intersection of self-supervised learning, efficient data curation, and active learning represents a powerful paradigm shift in how we approach machine learning challenges. By understanding the underlying principles and implementing practical strategies using frameworks like Lightly AI, developers can build more robust, efficient, and scalable models, especially in data-scarce domains.

In conclusion, we have seen how self-supervised learning enables representation learning without manual annotations and how coreset-based data selection enhances model generalization with fewer samples. By training a SimCLR model, generating embeddings, curating data, and evaluating through active learning, we experience the end-to-end process of modern self-supervised workflows. We conclude that by combining intelligent data curation with learned representations, we can build models that are both resource-efficient and performance-optimized, setting a strong foundation for scalable machine learning applications.

The journey from raw, unlabeled data to a high-performing AI model can now be significantly optimized, democratizing access to advanced machine learning for a wider array of applications.

Ready to dive deeper?

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Frequently Asked Questions

What is Self-Supervised Learning (SSL) and why is it important?

Self-Supervised Learning (SSL) is a machine learning paradigm where a model learns representations from unlabeled data by generating supervisory signals from the data itself. It’s crucial because it reduces the dependence on expensive and time-consuming manual data labeling, enabling AI development in data-scarce domains or with vast amounts of readily available unlabeled data.

How does Lightly AI simplify SSL implementation?

Lightly AI provides a framework that simplifies the implementation of Self-Supervised Learning techniques like SimCLR. It offers pre-built modules for projection heads, loss functions (e.g., NTXentLoss), and data transformations (e.g., SimCLRTransform), allowing developers to quickly set up and train SSL models without needing to implement these components from scratch.

What is coreset selection and how does it improve data efficiency?

Coreset selection is a strategy to intelligently pick a representative subset of data points from a larger dataset. Instead of random sampling, it aims to select the most diverse or informative samples, often based on their embeddings. This improves data efficiency by maximizing the knowledge gained from a small, labeled budget, leading to better model performance with significantly less annotation effort.

Can SSL and active learning be applied to real-world scenarios beyond image classification?

Absolutely. While this guide focuses on image classification with CIFAR-10, SSL and active learning principles are highly transferable. They can be applied to various data types, including natural language processing (e.g., pre-training large language models on unlabeled text), audio processing, time-series analysis, and even tabular data, particularly in domains where labeled data is scarce or expensive to obtain.

What is the role of a “linear probe” in evaluating SSL models?

A linear probe is a common method for evaluating the quality of representations learned by an SSL model. It involves “freezing” the weights of the pre-trained backbone and then training a simple, lightweight linear classifier on top of its extracted features using a small amount of labeled data. High accuracy with a linear probe indicates that the self-supervised model has learned robust and semantically meaningful features, which can be effectively transferred to downstream tasks.

The post A Coding Guide to Master Self-Supervised Learning with Lightly AI for Efficient Data Curation and Active Learning appeared first on MarkTechPost.

Related Articles

Back to top button