The elephant-alpha mystery on OpenRouter is solved. For a few days, a model called elephant-alpha was trending on OpenRouter and no one knew what it was. It turned out to be Ling-flash-2.6 from Ant Group. It's now live on Modular Cloud on day zero. The model: 104B parameters, 7.4B active. 256K context window. Designed for speed and execution across code completion, document processing, and lightweight agent workflows. OpenClaw and Hermes Agent both work with it cleanly. It handles coding execution subagent work well too, especially on high-frequency, short-chain tasks where inference speed is the constraint. Book a demo to get started: https://lnkd.in/eFnyMp3S
About us
The next-generation AI developer platform unifying the development and deployment of AI for the world.
- Website
-
https://www.modular.com
External link for Modular
- Industry
- Software Development
- Company size
- 51-200 employees
- Headquarters
- Everywhere
- Type
- Privately Held
- Founded
- 2022
- Specialties
- machinelearning, ai, software, tensorflow, pytorch, and hardware
Locations
-
Primary
Get directions
Everywhere, US
Employees at Modular
Updates
-
The Modular community has been cooking! 🍳 During next week's community meeting, we'll hear about three community projects: - Marrow, an Apache Arrow implementation in Mojo - Mojo support on Tensara, a GPU programming challenge platform - MAV ffmpeg bindings for Mojo Join via Zoom: https://lnkd.in/ee8GWfMV
-
HDF5 (Hierarchical Data Format 5) is the standard file format for large scientific and numerical datasets. Particle physics simulations, climate models, ML training pipelines - if you work with scientific data at any scale, you've probably run into it. Community member Photon recently shipped native HDF5 Mojo bindings! 🔥 There’s a two-layer design: a thin FFI wrapper over the HDF5 C API for full control, and a higher-level interface (HSFile, NDArray) for everyday use. Today, you can read 1D and 2D datasets without knowing shapes ahead of time, write datasets, safely create groups, and automatically discover HDF5 libraries via $CONDA_PREFIX. If you work with HDF5 in scientific computing or physics simulations, take a look: https://lnkd.in/eYqH4DgQ
-
We recently shipped TileTensor: Mojo's new tensor type for GPU kernel authors. The core problem: tile-level instructions (NVIDIA TMA, AMD DME) are now performance-critical, but most tensor abstractions were designed around flat, strided arrays. TileTensor fixes that. Fully static layouts carry an 8-byte runtime footprint, which cuts register pressure directly. When we migrated our MHA kernel for AMD MI300X, we got a 5% throughput gain from the type change alone. Our Part 1 blog post covers the design and how it compares to CuTe. Part 2 will cover the Mojo internals that made it possible. https://lnkd.in/eD9V3iy8
-
We partnered with Proximal to run five frontier coding agents on a hard task: rebuild the full Wan 2.1 text-to-video pipeline on MAX (no PyTorch, no diffusers) in 20 hours as part of their new Frontier-SWE benchmark. Two nearly pulled it off. GPT-5.4 and Claude Opus 4.6 both built working pipelines from scratch: a 30-layer DiT denoiser, 3D causal VAE, UMT5-XXL text encoding, and flow matching scheduler, all running on MAX's graph engine. Every model understood the architecture. What separated the successful runs was debugging discipline: the patience to inspect intermediate activations layer by layer, fix scheduler settings, track down a VAE normalization error, and keep going. Claude started at 12 dB and reached 41.1 dB by finding and fixing issues one at a time. GPT-5.4 hit 41.5 dB. The agents that topped out at 14 dB weren't confused about the task; they just stopped too early, often abandoning the actual problem to sneak in torch imports instead. This is one of 18 tasks in Frontier-SWE, Proximal's benchmark for hard engineering problems. Full report: https://lnkd.in/epRVAV7Y
-
Most serving stacks run FLUX.2 as four separate stages with Python overhead between each one. We collapsed all four into a single fused execution graph using MLIR-based compilation. On AMD MI355X, this means a 3.8x speedup over torch.compile, 1024x1024 images in under 3.5 seconds, and a deployment container under 700MB. We ran the same pipeline on Blackwell, too. AMD delivers equivalent generation quality at a 5.5x lower cost. Chris Lattner is presenting the full breakdown at AMD AI DevDay. Register: https://lnkd.in/ga9Yk5wt
-
AI infrastructure isn't just being built in San Francisco. On May 2nd, Mojo developers in Uyo, Nigeria are coming together to build, learn, and connect. On the agenda: roadmap updates, a talk on where Mojo fits in the AI stack, open Q&A, and networking. Register here: https://lnkd.in/eVhxU5zA
-
Fish Audio just benchmarked SGLang, vLLM, and MAX 👀 TLDR: 16% faster throughput than vLLM on L40, p99 TTFT of 13.1ms vs 23.6ms, containers under 700MB. The only stack in the comparison built without CUDA, running across NVIDIA, AMD, Apple Silicon, and CPU from one codebase. https://lnkd.in/ewipy5ZZ
-
What actually happens between submitting a prompt and getting a response? Kyle Caverly is an AI Performance Engineer on the MAX serve team. In this interview, he walks through the full request lifecycle inside MAX serve: from the moment JSON lands on the API server to the moment text streams back to the client. Topics covered: * Why MAX splits into two separate processes (API server and model worker) * How the batch constructor decides what to run next * How prefix caching and chunked prefill stack on top of each other * Why multimodal inputs require a different approach than text at almost every stage If you build on top of LLM APIs and want to understand what's underneath them, this is a complete guided tour. And all the code discussed is open source: https://lnkd.in/g5SQ5YEu
Inside MAX Serve: From Prompt to Response
https://www.youtube.com/
-
The Modular Community Grant Program is open! If you're building on MAX or Mojo 🔥, hosting a meetup, or speaking at a conference, there's funding for that. Grants start at $500 and scales with scope. https://lnkd.in/eSAz--uK