How browser-native text-to-speech actually works

Your browser already speaks. Here is what is happening under the hood when you press Listen on a website that uses the Web Speech API — and what is not.

2026-05-08 · 5 min read

2 min listen

Every modern browser ships with a built-in text-to-speech engine. You have probably never noticed because most websites do not use it. When you press Listen on a site that does, a single line of JavaScript hands your text to your browser, and your browser hands it to your operating system, which produces audio. No upload. No server. No model download.

The interface is called the Web Speech API. It has been a W3C draft since 2012 and shipped in Chrome 33 in 2014. Today it is supported in every major browser on every major platform. The implementation differs, but the contract is the same: you create a SpeechSynthesisUtterance, you set its text and voice, you call speechSynthesis.speak.

Where do the voices come from? It depends on the platform. On macOS and iOS, your browser uses the same voices the operating system uses for Siri and accessibility features. On Windows, it uses SAPI for offline voices and Microsoft Online for the high-quality Natural voices. On Android, it uses whatever TTS engine you have selected in your system settings — usually Google Text-to-Speech. On Linux it depends on whether speech-dispatcher is installed.

Chrome on desktop ships with an additional bonus: a set of voices labeled "Google" that are actually neural network voices served from Google's cloud. They are free. They sound like WaveNet. They require an internet connection. If you check the localService property on a voice, you can tell whether it is offline (true) or remote (false).

There is one bug worth knowing about. Chrome on desktop stops reading after about 15 seconds. The Chromium issue has been open since 2016. The fix is to call pause and resume every 10 seconds. Most TTS libraries do this automatically, but if you are writing your own, you will hit it.

What the Web Speech API cannot do: save audio as MP3, clone a voice, generate speech that does not exist in your browser. For those things you want a paid TTS service like Azure or ElevenLabs. But for the everyday case of "let me listen to this article while I cook dinner," your browser already has everything it needs.