FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

🚀 Get Started with FlashAI Voice Agents!

Production-ready voice AI solutions powered by Chroma | Open-source model for developers & researchers

Model Description

Chroma 1.0 is an advanced multimodal model developed by FlashLabs. It is designed to understand and generate content across multiple modalities, including text and audio. As a virtual human model, Chroma possesses the ability to process auditory inputs and respond with both text and synthesized speech, enabling natural voice interactions.

Model Type: Multimodal Causal Language Model
Developed by: FlashLabs
Language(s): English
License: Apache-2.0
Model Architecture:
- Reasoner: Based on Qwen2.5-Omni-3B
- Backbone: Based on Llama3 (16 layers, 2048 hidden size)
- Decoder: Based on Llama3 (4 layers, 1024 hidden size)
- Codec: Mimi (24kHz sampling rate)

Model Architecture

Capabilities

Chroma 1.0 is capable of:

Speech Understanding: Processing user audio input directly.
Multimodal Generation: Generating coherent text and speech responses simultaneously.
Voice Cloning: utilizing reference audio prompts to guide speech generation style.

Usage

Installation

Requirements

Python 3.11 or higher
CUDA 12.6 or compatible version (for GPU support)

Quick Install

# Clone the repository
git clone https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma.git
cd FlashLabs-Chroma

# Optional: Create a new conda environment with Python 3.11
conda create -n chroma python=3.11 -y
conda activate chroma

# Install dependencies in the correct order
pip install -r requirements.txt

Loading the Model

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "FlashLabs/Chroma-4B" # Or local path

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    device_map="auto"
)

# Load processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Inference Example

Here is how to perform a simple conversation with audio input and audio output:

import torch
from IPython.display import Audio



# Construct conversation history
system_prompt = (
    "You are Chroma, an advanced virtual human created by the FlashLabs. "
    "You possess the ability to understand auditory inputs and generate both text and speech."
)
conversation = [[
    {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt}
        ],
    },
    {
        "role": "user",
        "content": [
            # Input audio file path
            {"type": "audio", "audio": "example/make_taco.wav"}, 
        ],
    },
]]

# Provide reference audio/text for style or context
def load_prompt(speaker_name):
    text_path = f"example/prompt_text/{speaker_name}.txt"
    audio_path = f"example/prompt_audio/{speaker_name}.wav"
    
    with open(text_path, "r", encoding="utf-8") as f:
        prompt_text = f.read()
    
    return [prompt_text], [audio_path]

prompt_text, prompt_audio = load_prompt("scarlett_johansson")

# Process inputs
inputs = processor(
    conversation,
    add_generation_prompt=True, 
    tokenize=False,
    prompt_audio=prompt_audio,
    prompt_text=prompt_text
)

# Move inputs to device
device = model.device
inputs = {k: v.to(device) for k, v in inputs.items()}

# 2. Generate
output = model.generate(
    **inputs, 
    max_new_tokens=100, 
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    use_cache=True
)

# 3. Decode Audio
# The model outputs raw tokens; we decode the audio part using the codec
audio_values = model.codec_model.decode(output.permute(0, 2, 1)).audio_values

# Save or play audio (e.g., in Jupyter)
Audio(audio_values[0].cpu().detach().numpy(), rate=24_000)

Troubleshooting

Common Issues

TypeError: argument of type 'NoneType' is not iterable

Problem: This error occurs when loading the processor if torchvision is not properly detected during transformers module initialization.

Solution:

Ensure you install packages in the correct order (PyTorch/torchvision before transformers)

If you already installed them, reinstall in the correct order:

pip uninstall transformers torchvision torch -y
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu126
pip install transformers==5.0.0rc0

Restart your Python kernel/session after reinstalling

CUDA out of memory

Solution: Use device_map="auto" when loading the model, or specify GPU device explicitly:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16  # Use bfloat16 to reduce memory usage
)

Citation

If you use Chroma in your research, please cite:

@misc{chen2026flashlabschroma10realtime,
      title={FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning}, 
      author={Tanyu Chen and Tairan Chen and Kai Shen and Zhenghua Bao and Zhihui Zhang and Man Yuan and Yi Shi},
      year={2026},
      eprint={2601.11141},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.11141}, 
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Copyright 2025 FlashLabs

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Contact

For questions or issues, please contact: chroma@flashlabs.ai

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
chroma		chroma
example		example
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.ipynb		example.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

🚀 Get Started with FlashAI Voice Agents!

Model Description

Model Architecture

Capabilities

Usage

Installation

Requirements

Quick Install

Loading the Model

Inference Example

Troubleshooting

Common Issues

TypeError: argument of type 'NoneType' is not iterable

CUDA out of memory

Citation

License

Contact

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

FlashLabs-AI-Corp/FlashLabs-Chroma

Folders and files

Latest commit

History

Repository files navigation

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

🚀 Get Started with FlashAI Voice Agents!

Model Description

Model Architecture

Capabilities

Usage

Installation

Requirements

Quick Install

Loading the Model

Inference Example

Troubleshooting

Common Issues

TypeError: argument of type 'NoneType' is not iterable

CUDA out of memory

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages