Designing audio-first reading experience

June 14, 2025

With text-to-speech models now widely available, voice interfaces are becoming increasingly common. I’ve started using tools like Super Whisper to dictate my messages and Voice Notes to document random ideas while away from computer. It’s not like I stopped using the keyboard entirely, but the shift towards audio-first interfaces is clear.

However, when it comes to my reading list, I’m still a long-time Pocket user. They do have basic text-to-speech, but its quality isn’t good enough for any serious “reading”. Now, imagine my shock when I learned that Mozilla shuts down Pocket later this year.

This got me thinking - what would be an audio-first reading experience? I’m quite certain it’s not just converting raw text into speech with AI. So I want to document the journey of building Katalog, the audio-first read-it-later app, and what I learned designing interfaces for an audio-first. Focusing on how UX and UI should change when there are no visuals to support the experience.

Why basic text-to-speech isn’t enough

Many existing read-it-later apps, like Pocket or Instapaper, have basic text-to-speech. While they technically convert text to audio, I found they don’t provide a good listening experience.

It’s hard to understand the article this way because the text, which is meant for visual consumption, is being directly translated into sound. Things like section breaks, list indicators, images are missing. All the structure and implicit affordances of written words is gone.

Listen to the text below and try to understand what’s going on without looking at it.

Key takeaways from the 24 Series A companies include:

Business automation and operational tooling dominance. Probably the most surprising part of this analysis was how many of the winners were in internal business automation and operational platforms. This suggests the two obvious things:
- Network advantages. The batches provide a built-in customer base for B2B operations and automation that can drive success (e.g, Deel, Brex).
- Technical talent. YC’s founder tends to be young, technical, and good at executing, which is an archetype that naturally gravitates towards automation infrastructure, developer tools, and general optimizations.
“AI for X” verticals are surprisingly narrow. Despite the hype around “AI for X” (e.g. AI for dentists), the only vertical AI categories that made it into the data are legal and patent-focused (e.g. Legora, Solve).

From “What’s Working for YC Companies Since the AI Boom.”

Imagine this list above containing 10 items instead of 4. Simply converting text to audio wouldn’t work. An audio-first approach means breaking down each element of an article and thinking how to convey its meaning and structure through sound.

Audio design examples

Translating the visual language of articles into an understandable audio format presented several design challenges. My approach was mostly “trial and error”, saving a bunch of articles, listening to the narration, noticing confusing parts, and then iterating on different solutions.

Lists (Especially Nested Lists):

Visually, lists use indentation and bullet points or numbers to show hierarchy and structure. It’s easy to scan and understand quickly.

In audio, a simple narration of list items falls flat. It’s just a sequence of spoken sentences. For a simple list, this might be okay, but for nested lists with multiple levels of indentation, it becomes impossible to understand the relationship between items just by listening.

I first tried adding a voice instruction before each item, like “List item one,” “List item two.” Or enunciating the beginning of the list and its end:

I’ve seen two questions come up most often when people try to use these tools in real life:

Starting with first item - How do you make your prototypes look good enough to show customers or senior stakeholders?

Next - How do you successfully adopt these tools as a team, instead of a lot of individuals working in silos?

Figma currently supports four MCP actions:

First list item - Get Code

Second list item - Get Variable Definitions

Third list item - Get Image

Fourth list item - Get Code Connect

For simple lists, this was just annoying because it quickly became overly repetitive. For nested lists, it still didn’t clearly communicate the structure or the depth of the indentation.

My current solution involves using a subtle sound cue – a chime – played before each list item. The sound indicates that a new list item is starting.

People coding with LLMs today use agents. Agents get to poke around your codebase on their own. They author files directly. They run tools. They compile code, run tests, and iterate on the results. They also:

pull in arbitrary code from the tree, or from other trees online, into their context windows,

standard Unix tools to navigate the tree and extract information,

run existing tooling, like linters, formatters, and model checkers, and

make essentially arbitrary tool calls (that you set up) through MCP.

For nested lists, I’m also exploring varying the sound slightly or repeating it to signify the indentation level. For example, one chime for the first level, two for a sub-item, and so on.

Essential Diet Items for Marathon Preparation:

Complex Carbohydrates:

whole grains,

fruits, and vegetables

Lean Protein Sources:

chicken, fish

legumes

This way we use sound design to provide an “audio affordance,” a cue that signals the structure of the content, similar to how visual indentation works.

Images:

In an audio-only experience, images are completely inaccessible unless described in some way. Relying solely on the alt attribute is a start, but these are often too brief or missing entirely.

The first immediate solution was to check for the alt description in the HTML and narrate it after a short voice instruction like, “Here’s an image.” However, for more complex images like charts, diagrams, or screenshots, a short alt text isn’t enough. For these cases, I use AI to generate a more detailed description of the image’s content, focusing on the key information the image is meant to convey. The prompt needs a lot of tweaking, though. AI often over-emphasizes the style of the image over its content.

Here is an image — a cube-shaped spacecraft with two wing-like solar panels flies above a white orb streaked with browns and oranges

An image caption says — An illustration of NASA’s Europa Clipper probe above Jupiter’s icy moon Europa. The spacecraft launched on Oct. 14, 2024 and will arrive at Europa in April 2030. (Image credit: NASA/JPL-Caltech) End of image caption

If the article includes a figcaption element with a detailed description written by the author, I prioritize narrating that after an instruction like, “Here is the image’s description.” So that the listener doesn’t miss out on important information presented visually.

Code Blocks:

Code blocks are a significant challenge. Visually, code has syntax highlighting, indentation, and structure that help with reading it.

Plainly narrating code line by line would be incredibly unpleasant to listen to. It wouldn’t be accessible or useful for understanding the code’s purpose within the article’s context.

I haven’t fully solved this yet, but I’m exploring using AI, similar to how I handle complex images. The idea is to have AI analyze the code block and generate a summary or explanation of what the code does or what key concept it illustrates, rather than narrating the code itself. This would allow the listener to get the main point the author was making with the code example.

Links:

Naturally, links are visually interactive elements. In audio, narrating the URL is useless. The listener can’t do anything with it immediately.

Eventually I decided not to narrate links directly within the article flow. Instead, I’d collect all the links from the article and provide them in a separate reference section that the user can access after listening. While this means the listener doesn’t know when a link is mentioned during narration, it avoids interrupting the listening experience with irrelevant information.

Quotes:

Blockquotes are visually set apart to highlight specific text. Translating them to audio was relatively simple. I use a brief voice instruction like, “Begin quote,” before narrating the quoted text, and then “End quote,” afterwards.

The introduction to Model Context Protocol starts out with:

A quote says — MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools. End of quote

This clearly delineates the quoted section for the listener without being overly disruptive.

Beyond the narration

Katalog is still in its early stages, and there’s a lot to figure out here. But after the initial exploration, I can clearly see where this audio-first reading experience can go next.

Notes and highlights:

One area I’d love to explore is bi-directional audio communication. While I’m listening to an article, I often want to record my own thoughts, notes, or highlights using my voice, without pausing or interrupting the narration flow too much. This would make the listening experience more active and integrated into a workflow.

Expressive voices:

Another exciting possibility is using AI to implement variable tone and voice expressions. OpenAI’s text-to-speech API, for example, allows setting an emotional tone. I envision analyzing the article’s content to determine the most appropriate voice and emotional style for the narration. A scientific paper could have a calm, informative tone, while a dramatic piece might be narrated with more expression. This moves beyond a single monotone voice towards a more engaging and contextually rich listening experience.

Content discoverability:

Finally, I’d love to make the audio versions of articles created in Katalog more accessible. Imagine subscribing to articles from Katalog in your favourite podcast app. You also wouldn’t need to choose which articles you’ll read that day. They’d just arrive to you automatically based on all the content you saved before.

Summing up

Designing for audio consumption first is a different way of thinking about content and user experience. It’s not about adding text-to-speech as an extra button. It’s about fundamentally questioning how information, structure, and meaning can be conveyed solely through sound.

If you found examples above useful, you can follow @usekatalog on X or try the current version of Katalog web app.