code
streaming music with agora
2026.05.28
If you’ve used Discord, you know the music-bot routine:
someone drops a !play <link> in chat, a bot joins the voice channel, and the
whole room hears it. I wanted that exact thing in a different voice app — one
built on Agora — so I rebuilt a music bot from scratch
around it. I like Go, so the plan was to keep the whole thing
in Go, one language end to end. But little of what follows is actually
Go-specific — the SDK gaps and the audio work below would look much the same in
any language whose Agora SDK you’d reach for. This is how it works, and a few
things I learned about Agora’s server SDK on the way.
what it does
It joins a voice channel as a participant and publishes audio, the same way a
person’s microphone would — and it takes the commands you’d expect from a
Discord music bot, typed into the call’s chat with a ! prefix:
!play <url | search>— queue a track, or start it if nothing’s playing!skip— skip to the next track in the queue!stop— stop and clear the queue!pause/!resume— pause and pick back up!queue— show what’s lined up!np— what’s playing now, with elapsed time!volume <0–100>— set the output level!clear— empty the queue but keep the current track!help— list the commands
Behind that is a queue and a small state machine — idle, loading, playing, paused. None of that was the hard part, though. The hard part — and the rest of this post — was getting the music to actually sound good once it reached the call.
publishing audio
Agora’s server SDK lets you push raw PCM into a channel. So the job is to turn “a youtube link” into a stream of raw samples and hand them over at the right pace. Two tools do the heavy lifting: yt-dlp resolves and downloads the audio, and ffmpeg decodes it to signed 16-bit PCM.
yt-dlp -f bestaudio --no-playlist -o - "<url>"
| ffmpeg -hide_banner -loglevel error -i pipe:0
-f s16le -acodec pcm_s16le -ar 48000 -ac 2 pipe:1 That gives 48 kHz, stereo, s16le — published with a music-quality profile and
the call’s noise/echo processing turned off, since it’s music, not a voice. One
second of it is 48000 × 2 channels × 2 bytes = 192 KB, which matters once you
start pacing it out twenty milliseconds at a time.
reading the binary, not the changelog
The plan was to run this on a Raspberry Pi (arm64) and keep it cheap and always-on. The SDK’s changelog said arm64 was supported — with the quiet note “not strictly tested.” That caveat was enough to make me check before building on it.
So I fetched the arm64 and amd64 SDKs and compared their exported symbols with nm, instead of trusting the release notes:
| symbol | arm64 | amd64 |
|---|---|---|
bundled .so files | 3 | 12 |
agora_ C-API exports | 183 | 551 |
agora_rtc_conn_create | 0 | 2 |
agora_rtm_* (messaging) | 0 | 67 |
The arm64 build is missing the whole C-API wrapper layer — the part the Go (and Python) SDK actually calls — and ships none of the messaging library. Only the underlying C++ engine is there.
My first thought was to try a different language SDK. But Go and Python both bind to that same missing C layer, so switching between them wouldn’t change anything — the gap is below the language, not in it. The Agora parts have to run on amd64, and no language choice gets around that.
If the SDK ever updates, the whole question is one line:
nm -D agora_sdk/*.so | grep -c agora_rtc_conn_create
# amd64 → 2, arm64 → 0 While that prints 0, arm64 is out.
a native bridge
The first version published straight from Go, through the server SDK’s audio-track API, and the music came out genuinely bad — thin, muffled, smeared with artifacts. Recognizable, but nothing you’d want to sit and listen to. This is what ended up shaping the rest of the design.
The cause is that the SDK treats whatever you publish as a voice. By default it runs the audio through the call’s voice processing — echo cancellation, noise suppression, automatic gain (Agora’s “3A”) — and leans on a narrow, low-bitrate profile tuned for speech. That’s right for a microphone and wrong for music: the noise suppression chews up the quiet passages, the auto-gain pumps the levels, and the low bitrate blurs the rest. The high-quality stereo music profile exists in Agora’s docs, but through the Go wrapper I couldn’t get it to actually take hold.
That’s the thing I came away unhappy about: outside C++, Agora’s server SDKs aren’t mature. The Go one is a thin cgo wrapper over the C++ engine — it lags behind it, ships the incomplete arm64 build from above, and doesn’t reliably expose the audio settings that matter. The real engine is the C++ one; the rest are wrappers of uneven quality.
The obvious move would be to write the whole thing in C++ and be done with it. But I wanted to stay in Go, so instead I cut out the smallest possible piece: a small C++ bridge that does only one thing — take PCM and publish it — while a Go program keeps everything else (the queue, the commands, the orchestration) and talks to the bridge over stdin/stdout. Only the audio path crosses into C++.
Talking to the C++ engine directly, I could set what music actually needs. These are the settings that made the difference — and the ones I had the hardest time piecing together from Agora’s docs:
// music, high quality, stereo, with the "game streaming" scenario
rtc::AudioEncoderConfiguration enc;
enc.audioProfile = rtc::AUDIO_PROFILE_MUSIC_HIGH_QUALITY_STEREO;
localUser->setAudioEncoderConfiguration(enc);
localUser->setAudioScenario(rtc::AUDIO_SCENARIO_GAME_STREAMING);
// turn off the voice processing (the "3A") and force Opus — these are only
// reachable through private parameters, not the normal API
auto* p = conn->getAgoraParameter();
p->setParameters("{"che.audio.aec.enable":false}"); // echo cancellation
p->setParameters("{"che.audio.ans.enable":false}"); // noise suppression
p->setParameters("{"che.audio.agc.enable":false}"); // automatic gain
p->setParameters("{"che.audio.custom_payload_type":122}"); // Opus The audio goes through a custom track — one fed by raw PCM, with no microphone or device attached — and then out 20 ms (960 samples per channel) at a time:
auto sender = factory->createAudioPcmDataSender();
auto track = service->createCustomAudioTrack(sender);
track->setEnabled(true);
conn->connect(token, channel, uid); // once "connected" fires:
localUser->publishAudio(track);
sender->sendAudioPcmData(frame, 0, 0,
/*samples_per_channel=*/960,
rtc::TWO_BYTES_PER_SAMPLE,
/*channels=*/2, /*sample_rate=*/48000); Same source audio as the Go version, night-and-day result. The bridge is only
here because, right now, the C++ engine is the one that reliably exposes those che.audio.* knobs. If a higher-level SDK matured enough to set them, it could
go away and the whole thing could stay in one language.
the pipe between them
So there are two processes now — the Go program and the C++ bridge — that have to talk. They do it over the bridge’s stdin and stdout: Go writes audio and commands down, the bridge writes events (connected, a user joined, an error) back up. To put two kinds of message on one pipe, each is length-prefixed:
[1 byte: type][4 bytes: length, big-endian][length bytes: payload] The reader switches on the type byte and reads exactly length bytes before
moving on, so PCM frames (0x01) and JSON control commands (0x02) share the
stream without ever being mistaken for each other:
{"cmd":"volume","value":70}
{"cmd":"flush"} Writing a frame is just a header then a payload. The whole bridge client on the Go side is standard library, no dependencies:
// 1-byte type, 4-byte big-endian length, then the payload
func (c *Client) writeFrame(t byte, payload []byte) error {
var header [5]byte
header[0] = t
binary.BigEndian.PutUint32(header[1:], uint32(len(payload)))
c.mu.Lock()
defer c.mu.Unlock()
if _, err := c.stdin.Write(header[:]); err != nil {
return err
}
_, err := c.stdin.Write(payload)
return err
} backpressure for free
The naive way to pace 48 kHz audio is a timer: send one 20 ms frame, sleep, repeat, forever. But there’s a simpler mechanism already in the pipe. The bridge reads PCM off stdin into a small fixed-size ring buffer, and the read side blocks whenever that ring is full:
void push(const uint8_t* data, size_t n) {
std::unique_lock<std::mutex> lk(mu_);
// block here while the ring is full — this is the backpressure
notFull_.wait(lk, [&] { return stopping_ || buf_.size() < kHighWater; });
if (stopping_) return;
buf_.insert(buf_.end(), data, data + n);
notEmpty_.notify_one();
} When the ring fills, the bridge stops draining stdin; the OS pipe buffer fills behind it; and Go’s next write blocks until there’s room. The bridge keeps the timing — it pulls a 20 ms frame every 20 ms of wall clock — and the full pipe just keeps Go from running ahead of it. So there’s no pacing loop on the Go side at all, and no clock or drift to manage there.
driving it from chat
The commands are just text, and the player doesn’t care where they come from.
In my case the call already carries its chat over Agora’s RTM (its real-time
messaging side), so the bot subscribes there and treats any message starting
with ! as a command. But that’s only the transport — if a call didn’t expose
RTM, the same commands could arrive over a websocket, an HTTP endpoint, or even
stdin, and nothing downstream would change.
Wherever they come from, a single owner — one goroutine — holds the queue and the playing state. Instead of guarding it with a mutex, every command is sent to that goroutine over a channel and handled in order. Only one place ever touches that state, so there’s nothing to race.
!play resolves and decodes in a short-lived worker, then hands the decoded
stream to the player. !skip closes the current stream so the next one starts. !stop empties the queue and tells the bridge to flush what it has buffered.
Small, boring, and easy to reason about — which is what you want for the part
users actually touch.
a personal tool
This is something I run for myself, in my own calls. Streaming music you don’t own into a room sits in a gray area — platform terms, copyright. I built it because turning a link into sound for a room of people is a fun problem, and because the constraints above were worth working through once and writing down.
It’s plain to use, which is what I wanted: type !play, and the room hears it —
the same routine I knew from Discord, now in a call that didn’t have it.