stealth startup
timeline and tech
january 2024 - april 2024
- python
- cuda
- pytorch
- huggingface
- various other libraries and frameworks(ml and python)
role
stealth startup that i worked for is a miami-based vr/xr startup working on integrating operating systems into extended and virtual reality. i joined the project replicant team working on local llm inference. unfortunately, i cannot disclouse more information about the company/project due to confidentiality reasons. i will try to update this section as best as i can by taking a more generalized approach to work product.
work product
working at a startup meant wearing a lot of hats. one of my key project areas was real-time voice conversion, and i spent time understanding and optimizing the stack from model architecture to runtime performance.
voice conversion fundamentals
like i said, one of the main technical problems i worked around was real-time voice conversion (rvc), where the goal is to transform one speaker's voice characteristics into another speaker's identity while preserving the linguistic content. conceptually, this maps to source-filter speech theory: the source signal (vocal vibration) is preserved as content, while vocal tract characteristics are transformed to shift identity. in practice, that means modeling and controlling pitch/f0, timbre/spectral envelope, and prosody/duration without losing intelligibility.
modern rvc systems use neural encoder-decoder pipelines rather than older statistical approaches. the encoder extracts content-oriented latent features, speaker embeddings represent target identity, and the decoder/vocoder reconstructs audio with the desired voice profile. training quality comes from balancing reconstruction objectives with adversarial/perceptual constraints so converted audio sounds natural rather than robotic.
inference and architecture
a big part of the work was understanding inference as a compute problem, not just a model problem. once training is complete, runtime is effectively a forward pass through large matrix operations and nonlinear transformations. as model size grows, inference cost scales quickly, so architecture choices directly affect latency budgets. i spent time thinking about this from both sides: model-level representation quality and systems-level execution cost.
the original deployment shape relied on cloud api calls. that architecture is convenient, but for real-time audio it introduces unavoidable latency from network round-trip, serialization/deserialization, queue time, and tls/session overhead. even with a fast backend, physics puts a floor under internet latency, which is the wrong profile for interactive voice loops.
latency sources in cloud api:
total latency (typical): ~500 ms network latency: ~100-200 ms (this is kinda slow, ideally you'd want <50ms regionally!!!!) propagation delay: speed-of-light bound in fiber transmission delay: time to place bits on wire queuing delay: router/buffer wait time round-trip time (rtt): request + response path serialization/deserialization: ~10-30 ms (depends on payload) server queue time: ~50-100 ms (on shared infra) inference time (cloud): ~100-200 ms connection setup (tls/auth): well?? TLS handshake adds ~1 RTT if connection not reused
why this stays slow: even perfect infrastructure cannot beat physics. network distance enforces a hard latency floor, so cloud-first inference struggles to hit strict real-time interaction targets.
local runtime strategy
the most important shift was moving inference local. edge execution removed the network path entirely and collapsed the latency chain to device-local compute and memory movement. this is where onnx and onnx runtime mattered most: onnx gave us a portable graph format, and the runtime gave us an optimized execution engine with graph-level transformations like operator fusion, constant folding, and backend-specific scheduling. technically these are all optimizations that onnx runtimes does before the graph execution so the theory is not all that important. a good one to know is operator fusion because that combines multiple operations into one so you save memory and have fewer kernel launches. quantization is also good to know as it reduces the precision of the model so you can run it on cheaper hardware like a cpu as compared to a stronger gpu.
memory hierarchy:
cpu registers: ~0.5 ns l1 cache: ~1 ns l2 cache: ~4 ns l3 cache: ~10 ns ram: ~100 ns ssd: ~100 μs network (local): ~1 ms
local inference stays within the fastest memory tiers.
performance engineering takeaways
the biggest learning was that latency optimization is always full-stack. meaningful gains came from network removal, model/runtime optimization, and system-level execution discipline together, not from one isolated trick. for voice systems, percentile latency matter more than average latency, so measurement had to focus on cold vs warm paths
this wasnt the only project i worked on, but it was probably the most technically challenging. there are other cool things i did like tts ner(named entity recognition), chat markup language prompt ingestions, and fast api optimizations.