A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text

A Coding Implementation to Build a Transformer-Based Regression Language Model to Predict Continuous Values from Text
Estimated reading time: 8 minutes
- Regression Language Models (RLMs) utilize transformer architectures to predict continuous numerical values directly from text sequences, extending NLP capabilities beyond traditional classification or generation.
- The implementation process involves generating synthetic text-to-number data, building an efficient tokenizer to convert text into numerical tokens, and then training a lightweight Transformer encoder.
- The core architecture of the RLM includes token and positional embeddings, a multi-layer Transformer encoder, and a masked mean-pooling operation to aggregate contextual information, followed by a simple Multi-Layer Perceptron (MLP) head for regression.
- Training involves optimizing model parameters using the Adam optimizer and Mean Squared Error (MSE) loss, with critical monitoring of both training and validation losses to ensure effective generalization and prevent overfitting.
- To enhance RLMs, consider using real-world datasets, fine-tuning pre-trained language models (PLMs) like BERT, and exploring more advanced pooling strategies or complex regression heads for improved performance.
- Setting the Stage: Data Generation and Tokenization
- Architecting the Transformer-Based Regression Language Model
- Training, Evaluation, and Real-World Impact
- Taking the Next Steps: Actionable Enhancements
- Conclusion
The landscape of Natural Language Processing (NLP) has been revolutionized by transformer architectures, primarily known for tasks like text classification, translation, and generation. However, their utility extends far beyond these traditional boundaries. Imagine a model that doesn’t just understand the sentiment of a sentence, but quantifies its intensity; or one that extracts a specific numerical value from a complex textual description. This is the domain of Regression Language Models (RLMs).
We will build a Regression Language Model (RLM), a model that predicts continuous numerical values directly from text sequences in this coding implementation. Instead of classifying or generating text, we focus on training a transformer-based architecture that learns quantitative relationships hidden within natural language descriptions. We start by generating synthetic text-to-number data, tokenizing it efficiently, and then train a lightweight Transformer encoder to map linguistic cues to real-valued targets. By the end, we not only understand how RLMs can be implemented from scratch but also visualize their learning behavior and test their generalization on unseen examples. Check out the FULL CODES here.
This article dives into the practical steps of constructing an RLM using PyTorch, guiding you through data preparation, model architecture design, training, and evaluation. We’ll demonstrate how to leverage the power of transformers to bridge the gap between human language and numerical reasoning, opening up new possibilities for quantitative analysis from textual data.
Setting the Stage: Data Generation and Tokenization
Every robust machine learning model begins with well-prepared data. For an RLM, this means pairing text with its corresponding continuous numerical target. To simplify this tutorial and ensure a clear demonstration of the RLM’s capabilities, we begin by generating a synthetic dataset. This controlled environment allows us to observe the model’s learning process without the complexities of real-world data noise and biases.
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from collections import Counter
import re torch.manual_seed(42)
np.random.seed(42) print(" Regression Language Model (RLM) Tutorial")
print("=" * 60) We begin by importing essential libraries, such as PyTorch, NumPy, and Matplotlib, to build and visualize our Regression Language Model. We set random seeds to ensure reproducibility and initialize the environment, thereby guaranteeing consistent results each time the tutorial is run. Check out the FULL CODES here. def generate_synthetic_data(n_samples=2000): """Generate synthetic text-to-number regression data""" templates = [ ("The temperature is {} degrees", lambda x: x), ("I rate this {} out of ten", lambda x: x), ("The price is {} dollars", lambda x: x), ("Confidence level: {}", lambda x: x / 100), ("Speed of {} kilometers per hour", lambda x: x / 10), ("{} percent complete", lambda x: x / 100), ("Scored {} points in the game", lambda x: x / 10), ("The distance is {} meters", lambda x: x), ] data = [] for _ in range(n_samples): template, transform = templates[np.random.randint(len(templates))] value = np.random.uniform(0, 100) text = template.format(round(value, 1)) target = transform(value) data.append((text, target)) return data We create a synthetic dataset that pairs natural language sentences with corresponding numerical values. By using varied templates such as temperatures, ratings, and percentages, we ensure the model learns diverse text–number relationships. This controlled setup helps us simulate realistic regression tasks without relying on external data. Check out the FULL CODES here.
The generate_synthetic_data
function employs various sentence templates (e.g., “The temperature is {} degrees”, “I rate this {} out of ten”) and randomly generated numerical values. Each template includes a transformation function to create diverse target values. This method allows us to build a dataset that effectively simulates real-world scenarios where numbers are embedded within natural language, from direct mentions to implied ratios or scales.
Once the data is generated, the next crucial step is tokenization. Machines don’t understand words directly; they process numerical representations. Our SimpleTokenizer
class serves this purpose. It scans through the training texts to build a comprehensive vocabulary, assigning a unique index to each word. It also incorporates special tokens like <PAD>
for sequence alignment and <UNK>
for out-of-vocabulary words. This ensures that every input sentence is converted into a consistent sequence of numerical tokens, ready for the neural network.
class SimpleTokenizer: def __init__(self): self.word2idx = {"<PAD>": 0, "<UNK>": 1} self.idx2word = {0: "<PAD>", 1: "<UNK>"} self.vocab_size = 2 def fit(self, texts): """Build vocabulary from texts""" words = [] for text in texts: words.extend(re.findall(r'\w+|[^\w\s]', text.lower())) word_counts = Counter(words) for word, _ in word_counts.most_common(): if word not in self.word2idx: self.word2idx[word] = self.vocab_size self.idx2word[self.vocab_size] = word self.vocab_size += 1 def encode(self, text, max_len=20): """Convert text to token indices""" words = re.findall(r'\w+|[^\w\s]', text.lower()) indices = [self.word2idx.get(w, 1) for w in words] if len(indices) < max_len: indices += [0] * (max_len - len(indices)) else: indices = indices[:max_len] return indices We design a simple tokenizer to convert raw text into numerical tokens that the model can process. It builds a vocabulary from all unique words and maps each to an index, handling unknown words and padding automatically. This step ensures our textual inputs are transformed into consistent, machine-readable sequences for training. Check out the FULL CODES here.
Architecting the Transformer-Based Regression Language Model
With our data tokenized, we move to the core of our implementation: the Transformer architecture adapted for regression. This involves defining a custom PyTorch Dataset
for efficient data loading and then building the RegressionLanguageModel
itself. The model leverages token and positional embeddings, a multi-layer Transformer encoder, and a simple feed-forward network to output continuous values.
class RLMDataset(Dataset): def __init__(self, data, tokenizer, max_len=20): self.data = data self.tokenizer = tokenizer self.max_len = max_len def __len__(self): return len(self.data) def __getitem__(self, idx): text, target = self.data[idx] tokens = self.tokenizer.encode(text, self.max_len) return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32) class RegressionLanguageModel(nn.Module): def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2, dropout=0.1, max_len=20): super().__init__() self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0) self.position_embedding = nn.Embedding(max_len, embed_dim) encoder_layer = nn.TransformerEncoderLayer( d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim * 4, dropout=dropout, batch_first=True ) self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) self.fc1 = nn.Linear(embed_dim, 64) self.relu = nn.ReLU() self.dropout = nn.Dropout(dropout) self.fc2 = nn.Linear(64, 1) self.max_len = max_len def forward(self, x): batch_size, seq_len = x.shape positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1) token_embed = self.token_embedding(x) pos_embed = self.position_embedding(positions) embeddings = token_embed + pos_embed padding_mask = (x == 0) encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask) mask_expanded = (~padding_mask).unsqueeze(-1).float() summed = (encoded * mask_expanded).sum(dim=1) pooled = summed / mask_expanded.sum(dim=1) x = self.fc1(pooled) x = self.relu(x) x = self.dropout(x) output = self.fc2(x) return output We package our text–number pairs into a PyTorch Dataset, where we tokenize each sentence and return tensors ready for batching. We then build a Transformer-based RLM: token and positional embeddings flow through a multi-layer encoder, we mean-pool non-padded tokens, and feed the result to a small MLP head for regression. In effect, we allow the encoder to learn numerical cues from language, while the head maps them to a single continuous value. Check out the FULL CODES here.
The RegressionLanguageModel
integrates several key components. First, Token and Positional Embeddings convert token indices into dense vector representations and add information about their order in the sequence, crucial for transformer understanding. These embeddings are summed to create a rich input representation. Second, a Transformer Encoder processes these embeddings. This stack of attention layers excels at capturing long-range dependencies and contextual relationships within the text. Third, after the encoder, we perform a masked mean-pooling operation on the output. This aggregates the context-rich token representations into a single vector, effectively summarizing the entire input sequence while ignoring padding tokens. Finally, this pooled representation is fed into a small Multi-Layer Perceptron (MLP) head (fc1
, relu
, dropout
, fc2
), which is tasked with regressing to the final continuous numerical value.
Training, Evaluation, and Real-World Impact
Training an RLM involves optimizing its parameters to minimize the difference between its predictions and the true target values. We employ standard practices, including using the Adam optimizer and Mean Squared Error (MSE) loss, which is well-suited for regression tasks.
def train_rlm(model, train_loader, val_loader, epochs=15, lr=0.001): criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=lr) train_losses, val_losses = [], [] print(f"\n Training on {device}") print("-" * 60) for epoch in range(epochs): model.train() train_loss = 0 for tokens, targets in train_loader: tokens, targets = tokens.to(device), targets.to(device) optimizer.zero_grad() outputs = model(tokens) loss = criterion(outputs, targets) loss.backward() optimizer.step() train_loss += loss.item() train_loss /= len(train_loader) train_losses.append(train_loss) model.eval() val_loss = 0 with torch.no_grad(): for tokens, targets in val_loader: tokens, targets = tokens.to(device), targets.to(device) outputs = model(tokens) loss = criterion(outputs, targets) val_loss += loss.item() val_loss /= len(val_loader) val_losses.append(val_loss) print(f"Epoch {epoch+1:2d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}") return train_losses, val_losses We train the model using Adam and MSE loss on a GPU, if available, iterating over mini-batches to backpropagate and update weights. We switch to evaluation mode for validation at the end of each epoch, track training and validation losses, and print progress so we can see the learning dynamics in real-time. Check out the FULL CODES here.
During training, the model processes mini-batches of tokenized sentences, calculates the loss, and updates its weights via backpropagation. Crucially, we monitor both training and validation losses. The validation loss indicates how well the model generalizes to unseen data, helping us detect overfitting. Visualizing these loss curves provides immediate feedback on the learning dynamics, showing if the model is converging effectively.
print("\n Generating synthetic data...")
data = generate_synthetic_data(2000)
split_idx = int(0.8 * len(data))
train_data, val_data = data[:split_idx], data[split_idx:]
print(f"Train samples: {len(train_data)}, Val samples: {len(val_data)}") print("\n Building tokenizer...")
tokenizer = SimpleTokenizer()
tokenizer.fit([text for text, _ in train_data])
print(f"Vocabulary size: {tokenizer.vocab_size}") train_dataset = RLMDataset(train_data, tokenizer)
val_dataset = RLMDataset(val_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32) print("\n Building Regression Language Model...")
model = RegressionLanguageModel(vocab_size=tokenizer.vocab_size)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}") train_losses, val_losses = train_rlm(model, train_loader, val_loader) plt.figure(figsize=(10, 4))
plt.plot(train_losses, label='Train Loss', linewidth=2)
plt.plot(val_losses, label='Val Loss', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('RLM Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show() print("\n Testing Predictions:")
print("-" * 60)
test_examples = [ "The temperature is 25.5 degrees", "I rate this 8.0 out of ten", "The price is 45.0 dollars", "75.0 percent complete"
] with torch.no_grad(): for text in test_examples: tokens = torch.tensor([tokenizer.encode(text)]).to(device) prediction = model(tokens).item() print(f"Input: {text}") print(f"Predicted value: {prediction:.4f}\n") print(" RLM Tutorial Complete!") We generate and split synthetic data, fit our tokenizer, wrap everything in PyTorch datasets/loaders, and build the Transformer-based RLM. We train the model, visualize loss curves to verify learning, and then run a few natural-language test prompts to see the predicted continuous values. With that, we complete the end-to-end RLM pipeline.
After training, we assess the model’s generalization by testing it on a few natural language prompts. This step confirms whether the RLM has successfully learned the numerical cues within the text and can accurately predict continuous values for unseen inputs. With that, we complete the end-to-end RLM pipeline.
Real-World Example: Predicting Review Scores
Beyond our synthetic data, Regression Language Models have compelling real-world applications. Consider an e-commerce platform where customers leave detailed text reviews but sometimes omit a numerical star rating. An RLM could analyze the textual content of a review like “The product arrived quickly, but the quality was slightly below expectations, though still usable for the price point.” and predict a precise numerical rating (e.g., 3.7 out of 5) based on the nuanced sentiment and feedback expressed. This can help quantify user satisfaction even when explicit scores are missing, providing richer insights into product performance.
Taking the Next Steps: Actionable Enhancements
Implementing a basic RLM is a great starting point. To further enhance its capabilities or adapt it to specific problems, consider these actionable steps:
1. Experiment with Real-World Datasets and Data Augmentation: While synthetic data is excellent for proof-of-concept, real-world data introduces richer linguistic patterns. Explore public datasets with text-to-number relationships (e.g., product reviews with ratings, news articles with stock price impacts, medical notes with severity scores). Additionally, apply data augmentation techniques like paraphrasing or back-translation to increase the diversity of your training text without collecting more data.
2. Fine-tune Pre-trained Language Models (PLMs): Instead of building a transformer from scratch, leverage powerful pre-trained models like BERT, RoBERTa, or even smaller, more efficient options. Initialize your RLM’s encoder with a pre-trained PLM and then fine-tune it on your regression task. This often leads to significantly better performance due to the PLM’s extensive general language understanding.
3. Explore Advanced Pooling and Regression Heads: Our current implementation uses simple mean-pooling. Investigate other pooling strategies such as using a special [CLS]
token’s output (common in BERT-like models), attention pooling, or even training a small recurrent neural network (RNN) on top of the transformer’s sequence output to condense information. You could also experiment with more complex MLP architectures for the regression head, adding more layers or different activation functions.
Conclusion
In conclusion, we successfully designed, trained, and evaluated a Regression Language Model capable of predicting continuous values from textual inputs. We observe how combining positional embeddings, transformer encoders, and a simple regression head enables the model to capture the numerical semantics embedded in language. By generating synthetic data, visualizing training progress, and testing predictions, we demonstrate how RLMs bridge the gap between language understanding and numerical reasoning.
Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Frequently Asked Questions (FAQ)
1. What is a Regression Language Model (RLM)?
A Regression Language Model (RLM) is a type of transformer-based model designed to predict continuous numerical values directly from text inputs. Unlike traditional NLP models that classify or generate text, RLMs focus on quantifying relationships and extracting numerical insights embedded within natural language.
2. Why use synthetic data for training an RLM?
Synthetic data is used in this tutorial to provide a controlled environment for demonstrating RLM capabilities. It simplifies the learning process by removing real-world data noise and biases, allowing for a clearer understanding of how the model learns text-to-number relationships. It’s ideal for proof-of-concept and initial model development.
3. How does the Transformer encoder contribute to predicting continuous values?
The Transformer encoder is crucial because its multi-head self-attention mechanism excels at capturing long-range dependencies and complex contextual relationships within text. By processing token and positional embeddings, it learns a rich representation of the input sentence, which is then fed to a regression head to predict the continuous numerical value.
4. What are the key components of the RegressionLanguageModel
architecture?
The RegressionLanguageModel
architecture comprises several key components: Token and Positional Embeddings for input representation, a multi-layer Transformer Encoder for contextual understanding, a Masked Mean-Pooling operation to aggregate sequence information, and a final Multi-Layer Perceptron (MLP) Head to regress to the continuous numerical target.
5. What are some real-world applications of RLMs?
RLMs have various real-world applications beyond synthetic data. Examples include predicting numerical star ratings from product reviews, estimating stock price movements based on news articles, quantifying disease severity from medical notes, or even assessing risk scores from financial reports, essentially bridging the gap between qualitative text and quantitative outcomes.