Technology

Can a Small Language Model Predict Kernel Latency, Memory, Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes

Author2 days ago

0 7 minutes read

Can a Small Language Model Predict Kernel Latency, Memory, Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes

Estimated reading time: 9-10 minutes

The Regression Language Model (RLM) is a unified approach that predicts numeric outcomes like GPU kernel latency, program memory usage, and neural network accuracy directly from raw code strings.
It utilizes a 300M-parameter T5-Gemma-initialized encoder-decoder, decoding numbers as text with constrained decoding, which allows for expressing uncertainty.
RLM eliminates the need for hand-engineered features, sophisticated graph encoders, or time-consuming profiling, offering a more robust and adaptable solution.
It achieves strong correlations across diverse tasks, including APPS LeetCode Memory (ρ ≈ 0.93), CodeNet Memory (ρ > 0.5), Triton Kernel Latency (ρ ≈ 0.52), and NAS Ranking (τ ≈ 0.46).
An open-source `regress-lm` library and the Code-Regression dataset are available, making this technology accessible for research, fine-tuning, and integration into optimization workflows.

The RLM Breakthrough: Unifying Code-to-Metric Regression
How Does RLM Work Its Magic?
Concrete Results and Real-World Impact
Actionable Steps for Developers and Researchers
Conclusion
Frequently Asked Questions

Main Content

In the complex world of software optimization, predicting crucial performance metrics like GPU kernel latency, program memory usage, or even neural network accuracy has traditionally been a formidable challenge. Developers and researchers often rely on laborious hand-engineered features, sophisticated graph encoders, or time-consuming profiling to gain insights. These bespoke solutions are often brittle, demanding significant maintenance and struggling to adapt to new programming languages, hardware architectures, or computational tasks. Imagine a future where a single, unified model could read raw code and instantly tell you its performance characteristics, eliminating much of this guesswork.

This future is now within reach, thanks to a groundbreaking innovation. Researchers from Cornell and Google introduce a unified Regression Language Model (RLM) that predicts numeric outcomes directly from code strings—covering GPU kernel latency, program memory usage, and even neural network accuracy and latency—without hand-engineered features. A 300M-parameter encoder–decoder initialized from T5-Gemma achieves strong rank correlations across heterogeneous tasks and languages, using a single text-to-number decoder that emits digits with constrained decoding. This new Regression Language Model (RLM) represents a paradigm shift, treating performance prediction not as a bespoke engineering problem, but as a language modeling task.

The RLM Breakthrough: Unifying Code-to-Metric Regression

The core innovation of the RLM lies in its unified approach to code-to-metric regression. Instead of building separate, specialized models for each prediction task, RLM consolidates these diverse challenges into a single framework. This means one model can perform an impressive array of predictions:

Peak memory usage from high-level code written in languages like Python, C, C++, and more.
Latency for Triton GPU kernels, critical for optimizing deep learning workloads.
Accuracy and hardware-specific latency directly from ONNX graphs, essential for neural architecture search (NAS).

What makes this truly revolutionary is that it achieves these predictions by simply reading raw text representations of code or ONNX graphs and decoding numeric outputs. There’s no need for intricate feature engineering, complex graph encoders that struggle with novel operations or languages, or unreliable zero-cost proxies. This standardization dramatically reduces maintenance costs and significantly improves transferability, allowing the model to be fine-tuned for new tasks or hardware with greater ease.

The model’s design also enables a powerful feature: multi-objective decoding. Because the decoder operates autoregressively, it can condition later metric predictions on earlier ones. For example, it can predict model accuracy and then use that information to refine its prediction of per-device latencies. This capability is vital for understanding realistic trade-offs and navigating complex optimization landscapes, such as exploring Pareto fronts in multi-objective NAS scenarios where you might balance accuracy against various hardware constraints.

How Does RLM Work Its Magic?

At its heart, RLM is an encoder-decoder architecture, leveraging a T5-Gemma encoder initialization with approximately 300 million parameters. This robust backbone is designed to process raw string inputs, whether they are lines of source code, Triton intermediate representations (IR), or ONNX graph descriptions. The magic truly happens in the output phase: instead of a traditional regression head that outputs a single number and might struggle with diverse scales or uncertainties, RLM decodes numbers as text. It emits digit tokens (representing sign, exponent, and mantissa) in an autoregressive fashion.

Crucially, this decoding process employs “constrained decoding.” This ensures that the model only generates valid numerical sequences, preventing nonsensical outputs. Furthermore, this textual emission of numbers supports expressing uncertainty through sampling. This approach offers a more flexible and robust method for numeric prediction than conventional mean squared error (MSE) regression heads, even when those heads use output normalization techniques.

Ablation studies provided deep insights into RLM’s effectiveness:

Language pretraining proved essential, accelerating convergence and significantly boosting Triton latency prediction.
The unique decoder-only numeric emission strategy consistently outperformed traditional MSE regression heads.
Learned tokenizers, specialized for ONNX operators, were instrumental in increasing the effective context window for the model.
Naturally, longer input contexts generally led to improved predictions.
Scaling to a larger Gemma encoder, when paired with adequate tuning, further enhanced correlation scores, demonstrating the model’s scalability.

The open-source regress-lm library provides all the necessary text-to-text regression utilities, constrained decoding logic, and recipes for multi-task pretraining and fine-tuning, making this advanced methodology accessible to the wider research and development community.

Concrete Results and Real-World Impact

The RLM’s performance is not merely theoretical; it delivers compelling concrete results across a diverse range of tasks. These strong correlations demonstrate its practical utility in various optimization pipelines:

APPS LeetCode Memory: Achieved an impressive Spearman ρ ≈ 0.93 for predicting peak memory usage from Python code.
CodeNet Memory: Showed an average Spearman ρ > 0.5 across 17 different programming languages, with particularly strong results for C/C++ (~0.74–0.75).
Triton Kernel Latency: Demonstrated a Spearman ρ ≈ 0.52 for predicting latency on NVIDIA A6000 GPUs, a significant step forward for GPU kernel optimization.
NAS Ranking: Achieved a Kendall τ ≈ 0.46 on average across five classic Neural Architecture Search spaces (NASNet, Amoeba, PNAS, ENAS, DARTS), making it competitive with, and in some cases surpassing, sophisticated graph-based predictors.

These correlations are not just academic achievements; they are strong enough to have a tangible impact on real-world engineering challenges. Consider a compiler trying to optimize code for a specific GPU. Instead of relying on pre-computed tables or brittle heuristics, an RLM could directly analyze candidate kernel code and predict its latency, guiding the compiler to select the most efficient option without extensive, time-consuming profiling runs. Similarly, in Neural Architecture Search, an RLM could quickly evaluate the performance of thousands of candidate architectures based on their ONNX graph representations, drastically pruning the search space and accelerating the discovery of optimal models, capturing realistic trade-offs along Pareto fronts.

The introduction of the Code-Regression dataset, available on Hugging Face, further solidifies this research by providing a unified resource for code-to-metric tasks, spanning APPS/LeetCode runs, Triton kernel latencies (derived from KernelBook), and CodeNet memory footprints. This, coupled with the NAS/ONNX suite encompassing various well-known architectures, ensures a robust foundation for future research and application development.

The significance of this work cannot be overstated. By reframing performance prediction as a text-to-number generation problem, a compact, ~300M-parameter T5Gemma-initialized RLM can read diverse code representations and emit calibrated numerical predictions. This capability provides a powerful, generalized tool for compiler heuristics, GPU kernel selection, and multi-objective NAS triage, moving beyond the limitations of bespoke features or Graph Neural Networks. The open dataset and library further lower the barrier to entry, inviting replication and fine-tuning for new hardware or languages.

Actionable Steps for Developers and Researchers

Explore the regress-lm Library and Code-Regression Dataset: Dive into the open-source tools provided by the researchers. Experiment with the regress-lm library to understand the training and decoding stack, and explore the Code-Regression dataset to familiarize yourself with the unified benchmarks.
Fine-tune RLM for Custom Hardware or Languages: Leverage the provided recipes to fine-tune an RLM instance for your specific hardware platform (e.g., a new GPU architecture) or for programming languages not extensively covered in the initial benchmarks, extending its predictive power to novel domains.
Integrate RLM-driven Prediction into Optimization Workflows: Consider how RLM’s capabilities can enhance your existing compiler pipelines, GPU kernel selection heuristics, or Neural Architecture Search strategies. Using RLM for rapid, initial performance estimates can significantly reduce the need for expensive profiling or exhaustive search, streamlining development and optimization cycles.

Conclusion

The Regression Language Model (RLM) marks a pivotal advancement in the field of software performance prediction. By demonstrating that a relatively small language model can accurately predict complex metrics like kernel latency, memory usage, and model accuracy directly from raw code strings, it opens up new avenues for automation and optimization. This unified, text-to-number approach, leveraging the power of modern language models, promises to standardize and simplify performance prediction pipelines, offering a more robust and adaptable solution than its predecessors. The strong correlations achieved across diverse tasks and languages underscore RLM’s potential to revolutionize how we design, optimize, and deploy software in an increasingly complex computational landscape.

Check out the Paper, GitHub Page and Dataset Card. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes appeared first on MarkTechPost.

Frequently Asked Questions

What is a Regression Language Model (RLM)?

A Regression Language Model (RLM) is a type of language model specifically designed to predict numeric outcomes directly from raw text inputs, such as source code or graph representations. Unlike traditional language models that generate text, RLM focuses on generating numerical values that represent performance metrics or other quantitative characteristics.

What performance metrics can RLM predict?

The RLM can predict a variety of crucial performance metrics, including GPU kernel latency, program memory usage from high-level code (e.g., Python, C, C++), and neural network accuracy and hardware-specific latency directly from ONNX graphs.

How does RLM differ from traditional performance prediction methods?

Traditional methods often rely on laborious hand-engineered features, sophisticated graph encoders, or time-consuming profiling. RLM, in contrast, uses a unified language modeling approach, reading raw code text and directly decoding numeric outputs without the need for bespoke feature engineering or complex, brittle custom solutions. It also supports expressing uncertainty through sampling.

What are the practical applications of RLM?

RLM has significant practical applications in areas like compiler heuristics (guiding compilers to select efficient code options), GPU kernel selection (predicting latency for optimal deep learning workloads), and multi-objective Neural Architecture Search (quickly evaluating and pruning candidate architectures based on performance trade-offs).

Where can I find resources to learn more about RLM?

You can explore the research paper on arXiv, the associated GitHub page for tutorials, code, and notebooks, and the Code-Regression dataset available on Hugging Face. These resources provide the tools and data necessary for understanding, replicating, and extending RLM’s capabilities.

Author2 days ago

0 7 minutes read