Last week, DeepSeek announced that it would open source five projects next week:

Netizens said, “This time, OpenAI is really here.”

Just now, the first open source project came, related to inference acceleration, FlashMLA:

Open source project address:

DeepSeek FlashMLA

It has been open source for two hours, and Github already has 2.7k+ stars:

The core function of the project is:

“FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.”

Translated, it is:

“FlashMLA is an efficient MLA decoding kernel optimized for NVIDIA Hopper architecture GPUs, specifically optimized for service scenarios that process variable-length sequences.”

In a nutshell:

FlashMLA is an efficient decoding core designed by DeepInference for Hopper-architecture GPUs (such as the H800). By optimizing the multi-head potential attention calculation of variable-length sequences, it achieves the ultimate performance of 3000GB/s memory bandwidth and 580TFLOPS computing power in the decoding stage, significantly improving the efficiency of reasoning with long contexts for large models.

Some netizens said:

Some people are already using it, and they say Pure engineering:

This project belongs to engineering optimization and squeezes the hardware performance to the limit.

The project is ready to use out of the box.

Environment requirements:

  • Hopper GPU
  • CUDA 12.3 and above
  • PyTorch 2.0 and above

At the end of the project, the official also stated that it was inspired by the FlashAttention 2&3 and NVIDIA CUTLASS projects.

FlashAttention is capable of achieving fast and memory-efficient precise attention, and is used in mainstream large models. The latest third-generation version can increase the utilization rate of the H100 to 75%.

Training speed is increased by 1.5-2 times, and the computational throughput under FP16 is as high as 740 TFLOPs/s, reaching 75% of the theoretical maximum throughput and making fuller use of computing resources, which was previously only 35%.

FlashMLA not only achieves a leap in performance through hardware-level optimization, but also provides an out-of-the-box solution for engineering practices in AI inference, becoming a key technological breakthrough in accelerating inference of large models.

There was such a big reveal on the first day.

I’m looking forward to the open source stuff in the next four days!

As the netizen said:

The whale is making waves!

DeepSeek is awesome!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *