Technology

How to Run a RAG Powered Language Model on Android With the Help of MediaPipe

How to Run a RAG Powered Language Model on Android With the Help of MediaPipe

This comprehensive guide will walk you through implementing a Retrieval Augmented Generation (RAG)-powered language model on your Android device, transforming your apps with on-device intelligence. Learn to integrate advanced AI capabilities, set up on-device LLMs like Gemma3-1B, create embeddings, and fine-tune model parameters for smarter, more accurate mobile AI experiences.

Estimated Reading Time: 10 minutes

  • RAG on Android: Learn to integrate Retrieval Augmented Generation (RAG) powered language models into Android apps using the versatile MediaPipe framework, enabling on-device, context-aware AI.
  • On-Device LLM Setup: Discover how to set up lightweight on-device LLMs like Google’s Gemma3-1B and the Gecko Embedder, transferring necessary model files to your Android device via adb.
  • Embedding Generation: Understand the process of converting your app’s knowledge base into mathematical embeddings using MediaPipe’s Embedder and storing them in an SQLite Vector Store for efficient retrieval.
  • Model Fine-Tuning: Master the art of optimizing your RAG-powered language model by experimenting with parameters such as setMaxTokens, setTemperature, setTopK, and setTopP to control creativity, accuracy, and response length.
  • Practical Implementation: Follow a step-by-step guide to build a “Simon Says” app example, illustrating the entire RAG pipeline from model setup to prompt processing and response generation.

The convergence of advanced AI and powerful mobile hardware is unlocking unprecedented capabilities for Android applications. Delivering intelligent, context-aware experiences directly on user devices is no longer a distant dream but a tangible reality, largely thanks to techniques like Retrieval Augmented Generation (RAG) and the versatile MediaPipe framework. This guide will walk you through implementing a RAG-powered language model on your Android device, transforming your apps with on-device intelligence.

A couple of months ago, I gave a talk about running on device SLMs in apps. It was well received and refreshing to be able to give a talk where mobile apps can gain an advantage from the rise in language model usage.

After the talk, I had a few pieces of feedback from attendees along the lines of “It would be good to see an example of an app using an SLM powered by RAG.”

These are fair comments. The tricky thing here is the lead time it takes to set up a model and show something meaningful to developers in a limited time. Fortunately, it does make for a great blog post!

So, here it is! If you’re looking for steps to add a RAG-powered language model to your Android app, this is the post for you!

What is RAG?

If you’re unfamiliar with the term, RAG stands for Retrieval Augmented Generation. It’s a technique for language models to access external information that isn’t available in their dataset after training. This allows models to be aware of up-to-date information they can use to provide more accurate answers to prompts.

Let’s look at a quick example. Imagine you have two language models, one model is using RAG to retrieve external information about capital cities in Europe, whilst the other is only relying on its own knowledge. You decide to give both models the following prompt:

You are an expert on the geography of Europe. Give me the name of 3 capital cities. Also give me an interesting fact for each of them.

The result could look something like this:

As the comparison illustrates, you’re more likely to get a useful answer from the language model using RAG, instead of the model relying on its own knowledge. You may also notice less hallucinations from the RAG-powered model, as it doesn’t need to make up information.

That’s really all you need to know about RAG at a high level. Next, let’s go deeper and write some code to enable your own RAG-powered language model inside your app!

Setting Up Your Android App

Let’s say you want to create an app using a language model to play the game of Simon Says. You want the model to be Simon, and use RAG to access a datasource of tasks to help decide what to ask the user. How would you do that?

The most straightforward way to do that on Android is with MediaPipe, a collection of tools to help your app use AI and machine learning techniques. To begin, add the MediaPipe dependency to your app build.gradle:

dependencies { implementation("com.google.mediapipe:tasks-genai:0.10.27") implementation("com.google.ai.edge.localagents:localagents-rag:0.3.0")
}

Next, you need to add a language model to your test device via your computer. For this example, we’ll use Google’s Gemma3-1B, a lightweight language model that holds 1 billion parameters worth of information.

Side Note: Before downloading the model, you may have to sign up to Kaggle and agree to Google’s AI Terms and Conditions.

Once the model is downloaded, it’s time to add the model to your device. You can do this via adb:

$ adb shell mkdir -p /data/local/tmp/slm/ # Create a folder to store the model
$ adb push output_path /data/local/tmp/slm/gemma3-1B-it-int4.task # Copy the model over to the file

Alternatively, you can use the File Explorer using Android Studio to create the folders yourself and drag the model onto your device.

With the model added, you can continue building the RAG pipeline to feed Gemma with information. Next, let’s look at adding the information Gemma will rely on to answer prompts.

Creating Embeddings

The information language models rely on to perform RAG is not the same information you pass into it. Models require information to be in a specific format called Embeddings.

These are mathematical strings that represent the text’s semantic meaning. When the model receives a prompt, it will use it to search for the most relevant information matching it and use it alongside its own information to provide an answer. These mathematical strings are created by a tool called an Embedder.

Embeddings are a whole subject on their own; you are encouraged to read about them. For this post, you only need to know how to create them. You can do this on the device by using the Gecko Embedder.

First, download the sentencepiece.model tokenizer and the Gecko_256_f32.tflite embedder model files to your computer. Then push them to your device:

$ adb push sentencepiece.model /data/local/tmp/slm/sentencepiece.model # Push the tokenizer to the device
$ adb push Gecko_256_f32.tflite /data/local/tmp/slm/Gecko_256_f32.tflite # Push the embedder model to the device

With the embedder installed on your device, it’s time to provide a sample file to create the embeddings from. In Android Studio, in your app module assets folder, create a file called simon_says_responses.txt. Then, in the file, add the following text:

<chunk_splitter>
Go for a walk
<chunk_splitter>
Jump and down 10 times
<chunk_splitter>
Sing your favourite song!
<chunk_splitter>
Text your best friend a funny meme
<chunk_splitter>
Do 10 press ups!
<chunk_splitter>

The file contains a couple of different responses one could give in a game of Simon Says, each split up with a <chunk_splitter> tag. This gives the embedder a signal to know how to separate each response when splitting the text into embeddings.

This process is called chunking and can have a large effect on how well RAG performs via the language model. Experimentation with different-sized chunks and responses is encouraged to find the right combination for your needs!

One thing to consider is app storage. Remember, you’ve already installed a language model and an embedder onto your device. These take up gigabytes of space, so make sure not to further bloat a device by using a text file that is too large!

You want to consider storing the sample file in the cloud and downloading it via the network to reduce issues with storage.

With the text file in place, it’s time to initialize the embedder:

private const val GeckoEmbedderPath = "/data/local/tmp/slm/gecko_256_f32.tflite"
private const val TokenizerModelPath = "/data/local/tmp/slm/sentencepiece.model"
private const val UseGpuForEmbeddings = true val embedder: Embedder<String> = GeckoEmbeddingModel( GeckoEmbedderPath, TokenizerModelPath, UseGpuForEmbeddings, )

The embedder takes three parameters: the path of the embedder and the tokenizer on the device, with a final parameter to set whether the embedder can use the device GPU when creating embeddings.

Setting this to true will speed up the creation of the embeddings if a GPU is available. Make sure to check the device capabilities before deciding to enable this value.

Next, create an instance of the language model:

private const val GemmaModelPath = "/data/local/tmp/slm/gemma3-1B-it-int4.task" val llmInferenceOptions = LlmInferenceOptions.builder() .setModelPath(GemmaModelPath) .setPreferredBackend(LlmInference.Backend.CPU) // Change to GPU if you have a GPU powered device. .setMaxTokens(1200) .build() val llmInferenceSessionOptions = LlmInferenceSessionOptions.builder() .setTemperature(0.6f) .setTopK(5000) .setTopP(1f) .build() val languageModel = MediaPipeLlmBackend( context, // This is the application context languageModelOptions, languageModelSessionOptions) languageModel.initialize().get()

There are a lot of parameters involved above; don’t worry about these for now. We’ll come back to them later on.

With the language model created, you can focus back on the embeddings. The model needs a place to retrieve the embeddings each time it receives a prompt.

MediaPipe provides an SQLite Vector Store, which is a common tool to store embeddings. Let’s create one:

private const val PromptTemplate: String = """ You are Simon in a game of Simon Says. Your task is to ask the player to perform a task from the following list: {0}. Your response must only contain the task that the player must do. Your response must be based on the players request: {1}. Do not ask the player to do the same thing twice. You must not ask the player to do anything that is dangerous, unethical or unlawful.
""" val chainConfig = ChainConfig.create( languageModel, PromptBuilder(PromptTemplate), DefaultSemanticTextMemory( SqliteVectorStore(768), embedder )
)

Here, everything begins to link together as the language model and embedder are both passed into the ChainConfig. The 768 is the size of each “vector” the database can store. The PromptBuilder is used to provide a prompt to help drive the RAG process.

Finally, let’s load the text file we created earlier in the assets folder into the embedder. First, load the file from the device and split the text into a list of strings:

// This is an extension function to read the file from disk.
val gameResponses: List<String> = context.getTextFromFile("simon_says_responses.txt")

Next, load the responses into the chainConfig created earlier.

chainConfig.semanticMemory.getOrNull() ?.recordBatchedMemoryItems(ImmutableList.copyOf(gameResponses)) ?.get()

Depending on your device and the size of the text file, this can take a couple of seconds to a few minutes to complete. A good rule of thumb is to prevent the language model from being used whilst this occurs. You could also run this operation on an IO thread to avoid blocking the main thread.

With that done, you’ve just converted your textfile into a set of embeddings kept in a vector store. You’ve also linked your language model to the store so it can now retrieve information!

The next section will show how to use those embeddings by passing the language model your prompts.

Passing Prompts to Your RAG Powered Model

With most of the complex setup complete, passing a prompt into the language model is surprisingly easy. First, you need to create a RetrievalAndInferenceChain using the chain config and invoke it.

val retrievalAndInferenceChain = RetrievalAndInferenceChain(chainConfig)

Next, create a request and pass it into the chain.

val prompt = "Tell me something to do Simon involving jumping!" val retrievalRequest = RetrievalRequest.create( prompt, RetrievalConfig.create( 50, // topK 0.1f, // minSimilarityScore RetrievalConfig.TaskType.RETRIEVAL_QUERY ) ) val response = retrievalAndInferenceChain!!.invoke(retrievalRequest).get().text

With that, the language model will process the prompt. During processing, it will refer to the embeddings in your vector store to provide what it thinks is the most accurate answer.

As you would expect, what you ask the model will result in different responses. As the embeddings contain a range of simon says responses, it’s likely you will get a good response!

What about the cases where you receive an unexpected result? At this point, you need to go back and fine tune the parameters of your objects from the previous section.

Let’s do that in the next section.

Fine-Tuning Your RAG Powered Language Model

If there’s one thing we’ve learned over the years from Generative AI, it’s that it isn’t an exact science.

You’ve no doubt seen stories where language models have “misbehaved” and produced a less than desired result, causing embarrassment and reputational damage to companies.

This is the result of not enough testing being performed to ensure the responses from a model are expected. Testing not only helps avoid embarrassment, it can also help to experiment with your language model output so it works even better!

Let’s take a look at what levers we can adjust to fine tune our RAG powered language model. Starting with LlmInferenceOptions.builder:

LlmInferenceOptions.builder() .setModelPath(GemmaModelPath) .setPreferredBackend(LlmInference.Backend.CPU) // Change to GPU if you have a GPU powered device. .setMaxTokens(1200) .build()

The first parameter that can be changed is the setMaxTokens() parameter. This is the amount of “input” or “output” the language model can handle. The larger the value, the more data the language model can handle at once. Resulting in better answers.

In our example, this means the model can handle a text input and generate an output that consists of 1200 tokens. If we wanted to handle less text and also generate a smaller response, we could set MaxTokens to a smaller value.

Be careful with this value, as you could find your model unexpectedly handling more tokens that it expects. Causing an app crash.

Let’s move onto LlmInferenceSessionOptions.Builder():

llmInferenceSessionOptions = LlmInferenceSessionOptions.builder() .setTemperature(0.6f) .setTopK(5000) .setTopP(1f) .build()

Here, you can set a couple of different parameters. .setTemperature(), .setTopK(), and .setTopP(). Let’s dive them into each of them.

.setTemperature() can be thought of how “random” the responses from the language model can be. The lower the value, the more “predictable” the responses from the model can be. The higher the value, the more “creative” the responses will be, leading to more unexpected responses.

For this example, it’s set to 0.6, meaning the model will provide semi-creative, but not unimaginable responses.

The temperature is a good value to experiment with, as you may find different values provide better responses depending on your use case. In a game of Simons Says, some creativity is welcome!

.setTopK() is a way of saying “only consider the top K results to return to the user.” Language models, whilst processing a prompt, generate a number of responses, potentially thousands!

Each of these responses are given a probability as to how likely they are to be the right answer. To limit the amount of responses to consider, the topK value can be set to help focus the model. If you’re happy with less probable responses being considered, you would want to set this value high.

Similar to the temperature property, this is a good property to experiment with. You may find the model works better with less or more responses to consider, depending on your needs.

For a game of Simon Says, we want the model to be thinking about a lot of different responses to keep the game fresh. So 5000 seems like a good value.

.setTopP() builds upon the limit set by topK by saying “only consider results that have a probability of P”. As mentioned earlier, language models assign a probability to each response it generates as to how likely it can be. Setting topP means the model can easily discard any responses that don’t have the minimum probability needed.

To show an example, if the model had a topP set to 0.4 and was considering the following responses:

  • Simons Says clap 5 times! = 0.7 // This response has a probability of 0.7 of being correct
  • Simons Says jump up and down! = 0.5
  • Find a car and drive it. = 0.3

The first two responses would be considered because the probability is higher than 0.4. The response about the car would be discarded, as it only has a probability of 0.3.

Similar to temperature and topK, topP allows you to define how creative your language model can be. If you want to consider less probable responses to prompts, setting a low P value will help.

In our example, it’s set to 1.0. That’s because we want the model to be absolutely sure about its responses. It’s playing a children’s game after all!

Experimenting with these values will generate very different results from your language models. Try them out and see what happens!

Where To Go Next?

I hope you’ve enjoyed this walkthrough of how to add a RAG powered language model to your Android App! If you’re looking to learn more about RAG and Android, here are a few links I recommend:

  • Simons Says App: Clone and run the sample code for this blog post to see it in action. It shows how to setup the RAG pipeline with Gemma using Android architecture best practices.
  • MediaPipe RAG: Check out the RAG section on the MediaPipe Google Developer docs – Highly recommended reading.
  • Setting Temperature, TopK, and TopP in LLMs: Learn more about how setting the Temperature, TopK, and TopP values can help control the results from language models. Another highly recommended article.

This comprehensive guide has equipped you with the knowledge to integrate powerful RAG-enabled language models into your Android applications using MediaPipe. From understanding the core principles of Retrieval Augmented Generation to the intricate steps of setting up your development environment, preparing your data with embeddings, and fine-tuning your model’s responses, you now possess the expertise to build truly intelligent on-device experiences.

Actionable Step 1: Prepare Your Android Environment and On-Device Models

Begin by adding the necessary MediaPipe dependencies to your Android project’s build.gradle. Next, download the Google Gemma3-1B language model, along with the sentencepiece.model tokenizer and the Gecko_256_f32.tflite embedder. Utilize adb push commands or Android Studio’s File Explorer to transfer these crucial files to a designated directory (e.g., /data/local/tmp/slm/) on your test Android device, ensuring all terms and conditions for model usage are met.

Actionable Step 2: Generate and Store Embeddings from Your Knowledge Base

Create a structured text file (e.g., simon_says_responses.txt) within your app’s assets folder, carefully chunking your specific data with <chunk_splitter> tags. Initialize the GeckoEmbeddingModel using the correct paths for your tokenizer and embedder, set up a MediaPipeLlmBackend instance for Gemma, and configure an SqliteVectorStore. Finally, load your prepared text file into the chainConfig, allowing the system to convert your raw data into semantically rich embeddings for efficient retrieval.

Actionable Step 3: Implement Prompting and Optimize Model Behavior

Construct a RetrievalAndInferenceChain leveraging your pre-configured language model and vector store. Formulate a RetrievalRequest by combining the user’s prompt with suitable topK, minSimilarityScore, and TaskType settings, then invoke the chain to generate a RAG-powered response. Continuously fine-tune your model’s output by experimenting with parameters like setMaxTokens, setTemperature, setTopK, and setTopP within the LlmInferenceOptions and LlmInferenceSessionOptions to achieve the desired balance of creativity, accuracy, and response length for your application’s unique needs.

Conclusion

On-device RAG-powered language models represent a significant leap forward for mobile application intelligence. By leveraging MediaPipe, you can deliver sophisticated, context-aware AI experiences directly on user devices, enhancing performance, privacy, and user engagement. The ability to augment a language model’s inherent knowledge with external, up-to-date information opens up a world of possibilities for dynamic and responsive applications.

Dive in, experiment with the parameters, and explore the potential. The future of intelligent mobile apps is now in your hands!

Ready to Build? Start integrating RAG into your Android app today!

Explore the official MediaPipe documentation and the provided sample code to kickstart your journey.

Your users will thank you for the smarter, more responsive experiences you’ll create!

Learn More About MediaPipe RAG

Frequently Asked Questions

Q: What is Retrieval Augmented Generation (RAG) and why is it important for on-device LLMs?

A: RAG is a technique that allows language models to access and retrieve external, up-to-date information that was not part of their original training data. For on-device LLMs, RAG is crucial because it enables them to provide more accurate and contextually relevant answers without constantly retraining the model, reducing hallucinations and allowing for dynamic, smart mobile AI experiences.

Q: How do I install the Gemma3-1B model and Gecko Embedder on my Android device?

A: After downloading the Gemma3-1B model, the sentencepiece.model tokenizer, and the Gecko_256_f32.tflite embedder, you can push them to your Android device using adb push commands. For example: adb push gemma3-1B-it-int4.task /data/local/tmp/slm/gemma3-1B-it-int4.task. Ensure you create the target directory first.

Q: What are embeddings and how are they created using MediaPipe?

A: Embeddings are mathematical vector representations of text that capture its semantic meaning. Using MediaPipe, they are created by an Embedder (like Gecko Embedding Model). You provide a text file (e.g., simon_says_responses.txt) which is then chunked, and the Embedder processes these chunks into numerical vectors, stored in a Vector Store like SQLite.

Q: What role does setTemperature play in fine-tuning an LLM?

A: The setTemperature parameter controls the “randomness” or creativity of the LLM’s responses. A lower temperature (e.g., 0.1) results in more predictable and focused outputs, while a higher temperature (e.g., 1.0) encourages more diverse and imaginative responses. Finding the right balance depends on your application’s specific needs.

Q: What is the purpose of setTopK and setTopP in LLM configuration?

A: setTopK limits the LLM to consider only the top `K` most probable tokens for its next output, narrowing down the potential choices. setTopP further refines this by considering only the most probable tokens whose cumulative probability exceeds `P`, effectively filtering out less likely options. Both parameters help control the model’s output diversity and relevance.

Related Articles

Back to top button