code

streaming music with agora

2026.05.28

If you’ve used Discord, you know the music-bot routine: someone drops a !play <link> in chat, a bot joins the voice channel, and the whole room hears it. I wanted that exact thing in a different voice app — one built on Agora — so I rebuilt a music bot from scratch around it.

I like Go, so the plan was to keep the whole thing in Go, one language end to end. But little of what follows is actually Go-specific — the SDK gaps and the audio work below would look much the same in any language whose Agora SDK you’d reach for. It also has to run when my own machine is off, so it lives on a Linux server rather than my laptop.

This is how it works, and a few things I learned about Agora’s server SDK on the way.

what it does

It joins a voice channel as a participant and publishes audio, the same way a person’s microphone would — and it takes the commands you’d expect from a Discord music bot, typed into the call’s chat with a ! prefix:

!play <url | search> — queue a track, or start it if nothing’s playing
!skip — skip to the next track in the queue
!stop — stop and clear the queue
!pause / !resume — pause and pick back up
!queue — show what’s lined up
!np — what’s playing now, with elapsed time
!volume <0–100> — set the output level
!clear — empty the queue but keep the current track
!help — list the commands

Behind that is a queue and a small state machine — idle, loading, playing, paused. None of that was the hard part. The rest of this post is about getting the music to actually sound good once it reached the call.

publishing audio

Agora’s server SDK lets you push raw PCM into a channel. So the job is to turn “a youtube link” into a stream of raw samples and hand them over at the right pace. Two tools do the heavy lifting: yt-dlp resolves and downloads the audio, and ffmpeg decodes it to signed 16-bit PCM.

yt-dlp -f bestaudio --no-playlist -o - "<url>" 
  | ffmpeg -hide_banner -loglevel error -i pipe:0 
           -f s16le -acodec pcm_s16le -ar 48000 -ac 2 pipe:1

That gives 48 kHz, stereo, s16le — published with a music-quality profile and the call’s noise/echo processing turned off, since it’s music, not a voice. One second of it is 48000 × 2 channels × 2 bytes = 192 KB, which matters once you start pacing it out twenty milliseconds at a time.

One real-world wrinkle on a VPS: YouTube flags datacenter IPs aggressively, and yt-dlp ends up failing with Sign in to confirm you're not a bot on a clean box.

Cookies from a throwaway Google account get you most of the way (yt-dlp --cookies cookies.txt), and they last a long time as long as the account stays active.

For the videos that refuse to play from datacenter ranges entirely — Topic channels are a common case — the only fix I’ve found is to route the YouTube fetch through a residential IP. I run a tiny Cloudflare Tunnel on a Raspberry Pi at home and point yt-dlp through it with --proxy.

reading the binary, not the changelog

The cheapest always-on box I had was a Raspberry Pi (arm64) sitting at home, so the plan was to run the server there. The SDK’s changelog said arm64 was supported — with the quiet note “not strictly tested.” That caveat was enough to make me check before building on it.

So I fetched the arm64 and amd64 SDKs and compared their exported symbols with nm, instead of trusting the release notes:

symbol	arm64	amd64
bundled `.so` files	3	12
`agora_` C-API exports	183	551
`agora_rtc_conn_create`	0	2
`agora_rtm_*` (messaging)	0	67

The arm64 build is missing the whole C-API wrapper layer — the part the Go (and Python) SDK actually calls — and ships none of the messaging library. Only the underlying C++ engine is there.

My first thought was to try a different language SDK. But Go and Python both bind to that same missing C layer, so switching between them wouldn’t change anything — the gap is below the language, not in it. The Agora parts have to run on amd64, and no language choice gets around that — so the server ended up on a cheap amd64 cloud box instead of the Pi.

If the SDK ever updates, the whole question is one line:

nm -D agora_sdk/*.so | grep -c agora_rtc_conn_create
# amd64 → 2,  arm64 → 0

While that prints 0, arm64 is out.

a native bridge

The first version published straight from Go, through the server SDK’s audio-track API, and the music came out genuinely bad — thin, muffled, smeared with artifacts. Recognizable, but nothing you’d want to sit and listen to. This is what ended up shaping the rest of the design.

The cause is that the SDK treats whatever you publish as a voice. By default it runs the audio through the call’s voice processing — echo cancellation, noise suppression, automatic gain (Agora’s “3A”) — and leans on a narrow, low-bitrate profile tuned for speech.

That’s right for a microphone and wrong for music: the noise suppression chews up the quiet passages, the auto-gain pumps the levels, and the low bitrate blurs the rest. The high-quality stereo music profile exists in Agora’s docs, but through the Go wrapper I couldn’t get it to actually take hold.

That’s the thing I came away unhappy about: outside C++, Agora’s server SDKs aren’t mature. The Go one is a thin cgo wrapper over the C++ engine — it lags behind it, ships the incomplete arm64 build from above, and doesn’t reliably expose the audio settings that matter. The real engine is the C++ one; the rest are wrappers of uneven quality.

The obvious move would be to write the whole thing in C++ and be done with it. But I wanted to stay in Go, so instead I cut out the smallest possible piece: a small C++ bridge that does only one thing — take PCM and publish it — while a Go program keeps everything else (the queue, the commands, the orchestration) and talks to the bridge over stdin/stdout. Only the audio path crosses into C++.

Talking to the C++ engine directly, I could set what music actually needs. These are the settings that made the difference — and the ones that aren’t easy to piece together from Agora’s docs:

// music, high quality, stereo, with the "game streaming" scenario
rtc::AudioEncoderConfiguration enc;
enc.audioProfile = rtc::AUDIO_PROFILE_MUSIC_HIGH_QUALITY_STEREO;
localUser->setAudioEncoderConfiguration(enc);
localUser->setAudioScenario(rtc::AUDIO_SCENARIO_GAME_STREAMING);

// turn off the voice processing (the "3A") and force Opus — these are only
// reachable through private parameters, not the normal API
auto* p = conn->getAgoraParameter();
p->setParameters("{"che.audio.aec.enable":false}");        // echo cancellation
p->setParameters("{"che.audio.ans.enable":false}");        // noise suppression
p->setParameters("{"che.audio.agc.enable":false}");        // automatic gain
p->setParameters("{"che.audio.custom_payload_type":122}"); // Opus

The audio goes through a custom track — one fed by raw PCM, with no microphone or device attached — and then out 20 ms (960 samples per channel) at a time:

auto sender = factory->createAudioPcmDataSender();
auto track = service->createCustomAudioTrack(sender);
track->setEnabled(true);

conn->connect(token, channel, uid); // once "connected" fires:
localUser->publishAudio(track);

sender->sendAudioPcmData(frame, 0, 0,
                         /*samples_per_channel=*/960,
                         rtc::TWO_BYTES_PER_SAMPLE,
                         /*channels=*/2, /*sample_rate=*/48000);

Same source audio as the Go version, night-and-day result. The bridge is only here because, right now, the C++ engine is the one that reliably exposes those che.audio.* knobs. If a higher-level SDK matured enough to set them, it could go away and the whole thing could stay in one language.

The architecture, end to end:

Agora RTM chat
   ↓
!play / !skip / !volume
   ↓
Go bot
   ├─ queue / state machine
   ├─ yt-dlp
   └─ ffmpeg
   ↓  PCM + JSON commands
stdin / stdout pipe
   ↓
C++ bridge
   ├─ ring buffer
   ├─ volume / ducking / mixing
   └─ Agora C++ SDK
   ↓
Agora RTC channel
   ↓
listeners hear it

the pipe between them

So there are two processes now — the Go program and the C++ bridge — that have to talk. They do it over the bridge’s stdin and stdout: Go writes audio and commands down, the bridge writes events (connected, a user joined, an error) back up. To put two kinds of message on one pipe, each is length-prefixed:

[1 byte: type][4 bytes: length, big-endian][length bytes: payload]

The reader switches on the type byte and reads exactly length bytes before moving on, so PCM frames (0x01) and JSON control commands (0x02) share the stream without ever being mistaken for each other:

{"cmd":"volume","value":70}
{"cmd":"flush"}

Writing a frame is just a header then a payload. The whole bridge client on the Go side is standard library, no dependencies:

// 1-byte type, 4-byte big-endian length, then the payload
func (c *Client) writeFrame(t byte, payload []byte) error {
	var header [5]byte
	header[0] = t
	binary.BigEndian.PutUint32(header[1:], uint32(len(payload)))

	c.mu.Lock()
	defer c.mu.Unlock()
	if _, err := c.stdin.Write(header[:]); err != nil {
		return err
	}
	_, err := c.stdin.Write(payload)
	return err
}

backpressure for free

The naive way to pace 48 kHz audio is a timer: send one 20 ms frame, sleep, repeat, forever. But there’s a simpler mechanism already in the pipe. The bridge reads PCM off stdin into a small fixed-size ring buffer, and the read side blocks whenever that ring is full:

void push(const uint8_t* data, size_t n) {
	std::unique_lock<std::mutex> lk(mu_);
	// block here while the ring is full — this is the backpressure
	notFull_.wait(lk, [&] { return stopping_ || buf_.size() < kHighWater; });
	if (stopping_) return;
	buf_.insert(buf_.end(), data, data + n);
	notEmpty_.notify_one();
}

When the ring fills, the bridge stops draining stdin; the OS pipe buffer fills behind it; and Go’s next write blocks until there’s room. The bridge keeps the timing — it pulls a 20 ms frame every 20 ms of wall clock — and the full pipe just keeps Go from running ahead of it. So there’s no pacing loop on the Go side at all, and no clock or drift to manage there.

One small consequence of the bridge holding the buffer: !volume has to be applied at the send side, not in Go.

If Go multiplied the PCM samples before writing them down the pipe, the gain change would sit behind whatever’s already in the ring (up to a few seconds of pre-multiplied audio), and the listener would hear the new level a few seconds after the command.

So the gain knob travels as a JSON command ({"cmd":"volume","value":N}), and the bridge applies it in-place on each 20 ms frame, just before sendAudioPcmData.

mixing in another stream

Once the bridge owns the publish path, anything you can produce as PCM mixes in — TTS for chat read-aloud is the obvious one. The mix itself is a 20 ms add: int32 sum, clamp back to int16. Everything interesting is in what surrounds it.

Both streams need headroom. Sum two full-level streams and every loud sample clips, so the music has to duck while speech plays. Where to land takes trial — too shallow and the music drowns the speech, too deep and the music vanishes.

Boosting the speech back up to compensate doesn’t work. The int16 ceiling cares about instantaneous peaks, not averages, and speech’s loud transients (consonants, word onsets) get amplified by the boost until the sum clips even with the music well below full. Ducking is one-sided: the music gets quieter, the speech’s peaks stay where they were, and the worst case stays under the ceiling.

A one-frame gain change is itself a transient, so the duck is a linear ramp over several frames and the TTS gain ramps up symmetrically. Even then, the first sample of speech can land on a loud music sample and clip; the first TTS frame gets an extra sample-level 0→1 ramp across its samples, so the first sample is silence and the rest grows into what the frame-level ramp is asking for.

Speech surfaces music-side jitter, too. An underrun in the music ring — normally a silent gap nobody notices — becomes very audible against voice, so the ring’s high-water mark needs to be more generous than pacing alone would call for. Tighter numbers showed up as continuous crackle exactly when TTS overlapped.

A safety timeout, finally: if a TTS start arrives but no PCM follows (the producer crashed mid-utterance), the bridge ends after a short delay so the music doesn’t stay ducked forever.

None of this is Agora-specific, but it hides behind “just sum two streams,” and you only meet it once you’re past the make-audio-come-out problem.

driving it from chat

The commands are just text, and the player doesn’t care where they come from. In a call built on Agora, the chat side is almost always Agora’s other product, RTM (real-time messaging). RTC handles the audio, RTM handles everything else.

They’re separate SDKs with separate tokens, but the channel name is shared, so the same <channel> string gets you the call’s audio on one connection and the call’s chat on another.

For chat, the right type is RTM v2’s message channel (the newer “stream channel” is shaped for ordered topic streams). Subscribing with withMessage(true) starts the message flow: each event has a sender uid and a plain string body, and what the body contains is up to the app on top. The bot just reads the body, matches anything starting with !, and publishes replies back into the same channel.

That’s only the transport, though — if a call didn’t expose RTM, the same commands could arrive over a websocket, an HTTP endpoint, or even stdin, and nothing downstream would change.

Wherever they come from, a single owner — one goroutine — holds the queue and the playing state. Instead of guarding it with a mutex, every command is sent to that goroutine over a channel and handled in order. Only one place ever touches that state, so there’s nothing to race.

!play resolves and decodes in a short-lived worker, then hands the decoded stream to the player. !skip closes the current stream so the next one starts. !stop empties the queue and tells the bridge to flush what it has buffered. Small, boring, and easy to reason about — the right shape for the part users actually touch.

a personal tool

This is something I run for myself, in my own calls. Streaming music you don’t own into a room sits in a gray area — platform terms, copyright. I built it because turning a link into sound for a room of people is a fun problem, and because the constraints above were worth working through once and writing down.

It’s plain to use, which is what I wanted: type !play, and the room hears it — the same routine I knew from Discord, now in a call that didn’t have it.