Whisper for Wayland - A vibe coding journey

2025-08-18 Software-Craftmanship Augmented-by-AI

Sometimes it is the simple things that are hard.

These days I am doing a lot of vibe-coding or agentic-coding or … whatver you want to call it when you do not describe how to build something, but describe what you want the AI to build.


And sometimes you reach a state of flow, when the requirements just flow out of you, but then you have to start typing and the typing can break the flow.

So instead of typing you want to start talking to the AI to capture your thoughts.

And that should be no problem. You just use whisper and you are done. Right?

Wrong! Let be more specific …

  • I need a voice-to-text solution …
  • … that runs on ubuntu …
  • … with wayland …
  • … and sway …
  • … without root permissions …
  • … that features a push-to-talk experience …
  • … and the ability to insert the recorded/recognized text at the point where the cursor is

Couple of things to notice here …

  • A lot of the out-of-the-box solutions do only support windows and mac (not ubuntu)
  • And even when they support ubuntu they might only support X11 (not wayland, because wayland is much more restrictive when it comes to accessing devices)
  • And even then there might be a requirement to run with root permissions (because the python keyboard package requires you to run as root)
  • And with root you will probably run into problems accessing other devices (like the microphone)
  • And … last but not least … not sure, if I have mentioned it: Wayland is much more restrictive when it comes to accessing devices, means the way the out-of-the-box solution uses to insert the text at the point where the cursor is, will probably not work

I guess you start to get the picture.

Means we need a voice-to-text/push-to-talk solution that satisfies the above list of requirements.

I looked (believe me; long and hard), but was not able to find one. If you do, email me and I will amend this blog post.

Now … meet whisper-wayland. It was created with/by claude-code. The details are in the README.md and the CLAUDE.md file.

The entire vibe-coding session started with a prompt that looked a lot like this blog-post (actually the post was developed from the prompt).

I then asked claude to ask me at least 5 questions to clarify the requirements even more. That proved to be extremely helpful.

After we had agreed on what needs to get done, we moved to how to implement it. I asked NOT to do it in one go, but in 5 distinct phases or steps. After every step we wanted to end-up with a runable artifact and with a code-coverage of at least 80%. I also asked to commit the code and the changes as soon as it makes sense, means you can follow the journey by looking at the commit history. And I asked again to ask me at least 5 clarifying questions about the approach and the steps.

Means after 30 mins of conversation we ended up with a good, initial understanding of what needs to be build and how to go about it.

Then we kicked off the implementation of step 1. That step (and actually every step after that) took another 30 mins to complete. Creating runable artifacts is a great way to create a feedback loop for the AI to confirm that it is done. If it is crashing or not working the AI will loop until it works.

Overall we worked on whisper-wayland for ~5 hours, which was incredible fast. And it is working. Feel free to clone the repo and try it.

A word of warning for this to work you need (for instance) a claude-code max (or at least pro) account or comparable (do not use on-demand accounts; vibe coding will use a lot of tokens).

In the next couple of weeks I will spend a couple of days to polish and harden the code.

Happy voice-/vibe-coding.