Last week, DeepSeek announced that it would open source five projects next week:

Netizens said, “This time, OpenAI is really here.”
Just now, the first open source project came, related to inference acceleration, FlashMLA:

Open source project address:
It has been open source for two hours, and Github already has 2.7k+ stars:

The core function of the project is:
“FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.”
Translated, it is:
“FlashMLA is an efficient MLA decoding kernel optimized for NVIDIA Hopper architecture GPUs, specifically optimized for service scenarios that process variable-length sequences.”
In a nutshell:
FlashMLA is an efficient decoding core designed by DeepInference for Hopper-architecture GPUs (such as the H800). By optimizing the multi-head potential attention calculation of variable-length sequences, it achieves the ultimate performance of 3000GB/s memory bandwidth and 580TFLOPS computing power in the decoding stage, significantly improving the efficiency of reasoning with long contexts for large models.
Some netizens said:

Some people are already using it, and they say Pure engineering:

This project belongs to engineering optimization and squeezes the hardware performance to the limit.
The project is ready to use out of the box.

Environment requirements:
- Hopper GPU
- CUDA 12.3 and above
- PyTorch 2.0 and above
At the end of the project, the official also stated that it was inspired by the FlashAttention 2&3 and NVIDIA CUTLASS projects.

FlashAttention is capable of achieving fast and memory-efficient precise attention, and is used in mainstream large models. The latest third-generation version can increase the utilization rate of the H100 to 75%.
Training speed is increased by 1.5-2 times, and the computational throughput under FP16 is as high as 740 TFLOPs/s, reaching 75% of the theoretical maximum throughput and making fuller use of computing resources, which was previously only 35%.
FlashMLA not only achieves a leap in performance through hardware-level optimization, but also provides an out-of-the-box solution for engineering practices in AI inference, becoming a key technological breakthrough in accelerating inference of large models.
There was such a big reveal on the first day.
I’m looking forward to the open source stuff in the next four days!
As the netizen said:

The whale is making waves!
DeepSeek is awesome!