… or how to get a feeling for the magnitude of the challenge ahead.

I am no mobile developer. I haven’t even written any substantial front-end code in the last 5 years. And still I need a user interface that is at least as good as the rest.

Hello, LLM!

After a short research on a framework that supports multiple platforms and is supposed to be quick to learn, I decided to go for Flutter as a Framework. Which means Programming in Dart. The first tutorials got me promising results.

After a few days I was able to implement a simple but working chat interface. It had a text input and a message history. In order to make it exciting to use, I needed some real response.

Quick and easy backend

I already had a used computer with an AI capable graphics card (nVidia GeForce RTX 3060 with 12 GB VRAM) which I had used for experiments earlier. What was missing was an interface to get simple chat responses. The legacy OpenAI chat completions API is widely supported across many tools and can be regarded as a common standard. One convenient, stable and easy to use local Inference server is Lemonade Server. Under the hood it uses for example llama_cpp. It allows to load any GGUF model and provides an API to work with it. Lemonade also comes with a useful web interface, CLI and interesting development examples. Definitively give it a try if you haven’t.

For a model I experimented with several of the popular choices like Llama, DeepSeek and Qwen. But the rather freshly released gpt-oss from OpenAI lead repeatedly to quite good results. I simply took the one already advertised in Lemonade.

Making it useful

Here is where I reached my first major challenge. So far I got one response for every question I sent. No continued conversation, no formatting, no error handling. Which I tried to add piece by piece by myself. I also started using AI coding assistance. Initially I was unsure what to expect when writing mobile apps and Dart and not the dominant AI language of Python. But it seems the coding models handle common tasks quite ok.

Speech - a solved problem?

Computers can speak and understand spoken language. Automated Speech Recognition (ASR) and Text To Speech (TTS) have been around for many decades. In addition are new and better models released every couple of months. So I optimistically decided to add speech as an user interface.

In the meantime I established a main rule for this project:

All core data processing happens entirely on self-hosted systems.

This help to keep ownership of data. So for speech recognition and playback everything has to happen on the device or on our server (which we don’t have at this time).

On the lookout for a suitable solution that would allow me to use good quality, free and open ASR and TTS via Dart I came across Next-gen Kaldi and their quite extensive sherpa_onnx package. Which covers ASR, TTS, voice activation and more. Technically it wraps C++ code that executes ONNX models and makes that available as native Dart library.

After a few days I got the related flutter example of streaming ASR isolated, provided an online ASR model (sherpa-onnx-streaming-zipformer-en-kroko-2025-08-06) and integrated that into my app. Online in this context means that the recognition happens in real time while the audio is captured.

Debugging loop of doom

At this stage I decided, that for the first time I want to hand off a bigger coding tasks to AI. The goal was to add playback of incoming messages. This was also my first more detailed encounter with Claude Code. Which I picked because the Sonnet models from Anthropic have a really good reputation and are used in many coding tools. Also do I like the low noise of a terminal session. The 20 USD Pro plan was also in the affordable range for me.

The implementation was surprisingly good and straight forward. With a few remarks on simplicity and after adding a few coding and guidance rules on the project the outcome was good. Or it seemed good. I quickly got optimistic and put features on top of features that seemed nice. Since that went much quicker than could have done it, I did not pay close attention to the code or quality. Quite late did I notice that everything around audio became unreliable. Sometimes audio would not play or text not being recognized. For a long time the microphone seemed to shut off just seconds after it was opened. I spend days trying to lead the AI trough debugging loops. While I was focusing on asking questions or giving suggestions, the AI always responded so confident and elaborate that the solution seemed always very close. At the end I used more time to chase not so relevant bugs than getting new key features tested.

Later I also tried more expensive and maybe better suited models like Gemini 3. Which might have an advantage since both Flutter and Gemini are Google products. From personal perception I would say It has a small edge against the mid tier Anthropic LLM, but at the end struggled and failed at the same point.

Learning to code (through AI)

The main takeaway is nothing new, but a quite interesting challenge to adjust to:

  1. Expect big variations in quality
  2. Learn what you want and what not
  3. Accept to let go of details

The coding itself it mostly good or very good. From what I can say on un-familiar territory the output on function or module level is concise and professional. When changes are ready I rarely got build errors or obvious logical mistakes.

The amount of comments is quite high for my taste. I think that can be improved by defining the rules better. Those aspects are the user’s responsible for to define and shape.

I got the impression of a very human like coder with long experience in the technology performing quite efficient changes . The results were always bold and confidently reported. While the result often was not so great or no improvement at all. To the defense of the system, I have to say the problem area is quite unfortunate. Feedback relies on me to manually build run the app. Then test it on a actual phone listening and speaking test sentences hundreds of times. That feedback cycle is slow, imprecise and not reliable. Also audio is traditionally quite hard to get right on computers.

Looking back I think I can improve “our” efficiency and quality by:

  1. Treat changes as experiments. So they become disposable: Everything starts in a branch named experiment-*. If the result is good it get’s squash-merged into the main branch. If not I can keep it around or even better delete it right away. But for the later I have to improve my discipline.
  2. Commit frequently. Very often I saw something that was good. Maybe just 1-2 adjustments away from being finished. While working on the presumed 20% left - more and more issues show up. When time advances it is quite hard to go back and specify which parts you want to keep and which to drop.
  3. Understand the big picture and core technologies. Like a human team member, the AI needs feedback and someone who understand the direction the software moves towards. Without that guidance it just implements towards somewhere. Might look okay now, but very likely not for long.
  4. Maintain and verify goals and features. Over time it is hard to keep focus on what is needed. Also can the actual capabilities drift away from what was once good. A bit of formalizing can help. I am not convinced unit tests are very useful here. Maybe spec-driven development or a simplification of it would work.

Pasted image 20251205014923.png - Screenshot of the Relagent app prototype