AI-Powered Music Generation: How We Integrated ACE-Step 1.5 Into Arnaldus

Date2026-03-01
AuthorMach5 Engineering
AI-Powered Music Generation: How We Integrated ACE-Step 1.5 Into Arnaldus

When we set out to build Arnaldus — a cultural asset investment platform with AI-powered creative tools — we wanted real AI music generation. Not MIDI loops. Not 8-bit samples. Full 48kHz stereo audio synthesized by a transformer model. Here’s the engineering story of integrating ACE-Step 1.5.

The Vision: 100% Artist Ownership + AI Composition

Arnaldus’s philosophy is radical: creators and investors share ownership transparently. No middlemen. No opacity. But we wanted to go further — give every creator access to AI-powered composition tools. Not to replace artists, but to accelerate them.

The challenge: most AI music models are research projects. Getting them to run reliably in production, on real hardware, with acceptable latency, is a different problem entirely.

The ACE-Step Architecture

ACE-Step 1.5 uses a two-stage pipeline:

Stage 1: Diffusion Transformer (DiT)

The DiT takes a text prompt ("ambient electronic track with reverb-heavy pads, 120 BPM") and generates latent representations of the audio. Think of latents as a compressed mathematical description of what the music should sound like.

Key specs:

  • Model size: 10.1GB transformer core
  • Input: Text prompt + optional style conditioning
  • Output: Latent tensor (compressed audio representation)

Stage 2: VAE Decoder (AutoencoderOobleck)

The VAE (Variational Autoencoder) takes the latent representation and reconstructs full-fidelity audio:

  • Output format: 48kHz stereo WAV
  • Architecture: Oobleck VAE optimized for audio reconstruction
  • Quality: Broadcasting-quality audio suitable for distribution

The Engineering Challenges

Running on Apple Silicon

Our development and early production environment runs on Apple Silicon Macs. This meant getting PyTorch's MPS (Metal Performance Shaders) backend to work with a 10GB transformer model.

The problems we solved:

  1. Meta Tensor crashes: The model's lazy-loading mechanism conflicted with MPS. We implemented custom tensor materialization.
  2. dtype mismatches: The model expects BFloat16 but MPS doesn't fully support it. We built automatic dtype normalization (BFloat16 → Float32 → MPS inference → Float32 output).
  3. Memory management: 10.1GB on a GPU with shared system memory requires careful allocation. We implemented explicit garbage collection between inference passes.

Latent Permutation

The transformer outputs latents in a specific dimension ordering that doesn't match the VAE's expected input format. We discovered this the hard way when the VAE produced white noise instead of music. The fix: a custom permutation layer that reorders latent dimensions before VAE decoding.

Browser Playback

Generated audio files need to play in browsers. We hit the "No supported source" error because our initial pipeline produced WAV files without proper MIME type headers. The fix: automatic S3 MIME type detection and content-type header injection in our MinIO storage layer.

The Full Stack

From prompt to playback:

User Prompt → Text Encoder → DiT Transformer → Latent Permutation →
AutoencoderOobleck VAE → WAV Encoding → MinIO Storage →
Presigned URL → Browser Playback

Backend stack:

  • FastAPI for the inference API
  • PyTorch with MPS/CPU backends
  • Docker for reproducible environments
  • MinIO for audio file storage
  • PostgreSQL for track metadata

Frontend stack:

  • Next.js 16 with React 19
  • Web Audio API for playback controls
  • "Midnight Neon" design aesthetic

The Result

Arnaldus now generates broadcast-quality AI music directly in the platform. Creators can:

  1. Describe what they want in natural language
  2. Generate multiple variations in seconds
  3. Download full 48kHz stereo files
  4. Publish directly to the Arnaldus marketplace
  5. Keep 100% ownership — forever

What This Demonstrates

AI integration isn't about calling an API. It's about:

  • Understanding model architectures well enough to debug dtype crashes
  • Building infrastructure that handles 10GB models gracefully
  • Solving the last-mile problems (MIME types, browser playback, storage)
  • Creating a user experience that hides all this complexity

This is what a venture studio delivers end-to-end.


Building AI-powered creative tools? Let's talk architecture.