• Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA
  • Disclaimer
Friday, August 30, 2024
CryptoBangs.com
Advertisement
  • Home
  • Live Crypto Prices
  • Crypto News
    • Bitcoin
    • Ethereum
    • Ripple
    • Altcoin
    • NFT News
  • DeFi
  • Blockchain
  • Regulation
  • Shop
  • Blog
  • Calculator
No Result
View All Result
  • Home
  • Live Crypto Prices
  • Crypto News
    • Bitcoin
    • Ethereum
    • Ripple
    • Altcoin
    • NFT News
  • DeFi
  • Blockchain
  • Regulation
  • Shop
  • Blog
  • Calculator
No Result
View All Result
CryptoBangs.com
No Result
View All Result

NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

August 29, 2024
in Blockchain
Reading Time: 3 mins read
A A
NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer
ShareShareShareShareShare

Related articles

CoreWeave Leads AI Infrastructure with NVIDIA H200 Tensor Core GPUs

CoreWeave Leads AI Infrastructure with NVIDIA H200 Tensor Core GPUs

August 29, 2024
Whales Offload 1.16 Trillion $PEPE As Pepe Unchained Raises $11 Million In ICO

Whales Offload 1.16 Trillion $PEPE As Pepe Unchained Raises $11 Million In ICO

August 29, 2024


Lawrence Jengar
Aug 29, 2024 16:10

NVIDIA’s TensorRT Model Optimizer significantly boosts performance of Meta’s Llama 3.1 405B large language model on H200 GPUs.





Meta’s Llama 3.1 405B large language model (LLM) is achieving new levels of performance thanks to NVIDIA’s TensorRT Model Optimizer, according to the NVIDIA Technical Blog. The enhancements have resulted in up to a 1.44x increase in throughput when running on NVIDIA H200 GPUs.

Outstanding Llama 3.1 405B Inference Throughput with TensorRT-LLM

TensorRT-LLM has already delivered remarkable inference throughput for Llama 3.1 405B since the model’s release. This was achieved through various optimizations, including in-flight batching, KV caching, and optimized attention kernels. These techniques have accelerated inference performance while maintaining lower precision compute.

TensorRT-LLM added support for the official Llama FP8 quantization recipe, which calculates static and dynamic scaling factors to preserve maximum accuracy. Additionally, user-defined kernels such as matrix multiplications from FBGEMM are optimized via plug-ins inserted into the network graph at compile time.

Boosting Performance Up to 1.44x with TensorRT Model Optimizer

NVIDIA’s custom FP8 post-training quantization (PTQ) recipe, available through the TensorRT Model Optimizer library, enhances Llama 3.1 405B throughput and reduces latency without sacrificing accuracy. This recipe incorporates FP8 KV cache quantization and self-attention static quantization, reducing inference compute overhead.

Table 1 demonstrates the maximum throughput performance, showing significant improvements across various input and output sequence lengths on an 8-GPU HGX H200 system. The system features eight NVIDIA H200 Tensor Core GPUs with 141 GB of HBM3e memory each and four NVLink Switches, providing 900 GB/s of GPU-to-GPU bandwidth.








Maximum Throughput Performance – Output Tokens/Second
8 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths2,048 | 12832,768 | 2,048120,000 | 2,048
TensorRT Model Optimizer FP8463.1320.171.5
Official Llama FP8 Recipe399.9230.849.6
Speedup1.16x1.39x1.44x

Table 1. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements

Similarly, Table 2 presents the minimum latency performance using the same input and output sequence lengths.








Batch Size = 1 Performance – Output Tokens/Second
8 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths2,048 | 12832,768 | 2,048120,000 | 2,048
TensorRT Model Optimizer FP849.644.227.2
Official Llama FP8 Recipe37.433.122.8
Speedup1.33x1.33x1.19x

Table 2. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

These results indicate that H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer are delivering superior performance in both latency-optimized and throughput-optimized scenarios. The TensorRT Model Optimizer FP8 recipe also achieved comparable accuracy with the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Fitting Llama 3.1 405B on Just Two H200 GPUs with INT4 AWQ

For developers with hardware resource constraints, the INT4 AWQ technique in TensorRT Model Optimizer compresses the model, allowing Llama 3.1 405B to fit on just two H200 GPUs. This method reduces the required memory footprint significantly by compressing the weights down to 4-bit integers while encoding activations using FP16.

Tables 4 and 5 show the maximum throughput and minimum latency performance measurements, demonstrating that the INT4 AWQ method provides comparable accuracy scores to the Llama 3.1 official FP8 recipe from Meta.






Maximum Throughput Performance – Output Tokens/Second
2 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths2,048 | 12832,768 | 2,04860,000 | 2,048
TensorRT Model Optimizer INT4 AWQ75.628.716.2

Table 4. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements






Batch Size = 1 Performance – Output Tokens/Second
2 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths2,048 | 12832,768 | 2,04860,000 | 2,048
TensorRT Model Optimizer INT4 AWQ21.618.712.8

Table 5. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

NVIDIA’s advancements in TensorRT Model Optimizer and TensorRT-LLM are paving the way for enhanced performance and efficiency in running large language models like Llama 3.1 405B. These improvements offer developers more flexibility and cost-efficiency, whether they have extensive hardware resources or more constrained environments.

Image source: Shutterstock


Credit: Source link

ShareTweetSendPinShare
Previous Post

SolForge Fusion Debuts on Solana, Unlocking New Web3 Features

Next Post

Bitcoin Market Reaches ‘Equilibrium’ as Speculation Declines, Glassnode Reports

Related Posts

CoreWeave Leads AI Infrastructure with NVIDIA H200 Tensor Core GPUs

CoreWeave Leads AI Infrastructure with NVIDIA H200 Tensor Core GPUs

August 29, 2024

Terrill Dicki Aug 29, 2024 15:10 CoreWeave becomes the first cloud provider to offer NVIDIA H200...

Whales Offload 1.16 Trillion $PEPE As Pepe Unchained Raises $11 Million In ICO

Whales Offload 1.16 Trillion $PEPE As Pepe Unchained Raises $11 Million In ICO

August 29, 2024

Join Our Telegram channel to stay up to date on breaking news coverage Pepe Unchained ($PEPU) has raised over $11...

Donald Trump’s New NFT Collection Surpasses 22,000 Sales on Polygon

Donald Trump’s New NFT Collection Surpasses 22,000 Sales on Polygon

August 29, 2024

Join Our Telegram channel to stay up to date on breaking news coverage Former US President Donald Trump’s newly-released non-fungible...

Helium Price Prediction: HNT Soars 9%, But Investors Are Flocking To This Base Meme Coin With 6 Days Left

Helium Price Prediction: HNT Soars 9%, But Investors Are Flocking To This Base Meme Coin With 6 Days Left

August 29, 2024

Join Our Telegram channel to stay up to date on breaking news coverage The Helium price surged over 9% in...

NVIDIA Triton Inference Server Excels in MLPerf Inference 4.1 Benchmarks

NVIDIA Triton Inference Server Excels in MLPerf Inference 4.1 Benchmarks

August 29, 2024

Rongchai Wang Aug 29, 2024 06:56 NVIDIA Triton Inference Server achieves exceptional performance in MLPerf Inference...

Load More
Next Post
Bitcoin Market Reaches ‘Equilibrium’ as Speculation Declines, Glassnode Reports

Bitcoin Market Reaches ‘Equilibrium’ as Speculation Declines, Glassnode Reports

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

CoreWeave Leads AI Infrastructure with NVIDIA H200 Tensor Core GPUs

CoreWeave Leads AI Infrastructure with NVIDIA H200 Tensor Core GPUs

August 29, 2024
Russia to launch crypto exchanges for global trade in Moscow and St Petersburg

Russia to launch crypto exchanges for global trade in Moscow and St Petersburg

August 23, 2024
Will SHIB Hit $0.01? If So, When?

Will SHIB Hit $0.01? If So, When?

August 25, 2024
How To Be A Millionaire When XRP Reclaims ATH

How To Be A Millionaire When XRP Reclaims ATH

August 28, 2024
NFL Rivals and Kansas City Chiefs Team Up for 2024 Season

NFL Rivals and Kansas City Chiefs Team Up for 2024 Season

August 28, 2024
CryptoBangs.com

CryptoBangs.com is an online news portal that aims to share the latest crypto news, bitcoin, altcoin, blockchain, nft news and much more stuff like that.

What’s New Here!

  • How One 2017 Pattern Could Have XRP Targeting $27
  • Threshold Network proposes acquisition of BitGo’s WBTC to address centralization worries
  • Shiba Inu To Reach 5 Cents? Here’s When SHIB Could Hit $0.05
  • Telegram Under Ban Threat In Indonesia For Alleged Pornography & Gambling, TON Price Dive Deepens

Newsletter

Don't miss a beat and stay up to date with our Newsletter!
Loading

  • Contact Us
  • Privacy Policy
  • Terms of Use
  • DMCA
  • Disclaimer

© 2023 - CryptoBangs.com - All Rights Reserved!

No Result
View All Result
  • Home
  • Live Crypto Prices
  • Crypto News
    • Bitcoin
    • Ethereum
    • Ripple
    • Altcoin
    • NFT News
  • DeFi
  • Blockchain
  • Regulation
  • Shop
  • Blog
  • Calculator

© 2018 JNews by Jegtheme.

  • bitcoinBitcoin(BTC)$59,866.001.49%
  • ethereumEthereum(ETH)$2,550.271.19%
  • tetherTether(USDT)$1.00-0.28%
  • binancecoinBNB(BNB)$536.610.87%
  • solanaSolana(SOL)$142.26-0.39%
  • usd-coinUSDC(USDC)$1.00-0.31%
  • rippleXRP(XRP)$0.57-0.44%
  • staked-etherLido Staked Ether(STETH)$2,549.671.22%
  • dogecoinDogecoin(DOGE)$0.1001110.81%
  • tronTRON(TRX)$0.1600221.43%
  • the-open-networkToncoin(TON)$5.45-1.93%
  • cardanoCardano(ADA)$0.3587862.67%
  • Wrapped stETHWrapped stETH(WSTETH)$3,017.961.86%
  • avalanche-2Avalanche(AVAX)$23.61-0.34%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$59,992.001.92%
  • shiba-inuShiba Inu(SHIB)$0.0000143.41%
  • WETHWETH(WETH)$2,559.401.65%
  • chainlinkChainlink(LINK)$11.12-1.68%
  • bitcoin-cashBitcoin Cash(BCH)$324.110.87%
  • polkadotPolkadot(DOT)$4.290.58%
  • leo-tokenLEO Token(LEO)$5.78-0.38%
  • daiDai(DAI)$1.00-0.21%
  • nearNEAR Protocol(NEAR)$4.34-0.45%
  • litecoinLitecoin(LTC)$63.063.90%
  • uniswapUniswap(UNI)$5.882.64%
  • Wrapped eETHWrapped eETH(WEETH)$2,678.541.73%
  • kaspaKaspa(KAS)$0.1658072.71%
  • matic-networkPolygon(MATIC)$0.426854-2.29%
  • internet-computerInternet Computer(ICP)$7.912.92%
  • aptosAptos(APT)$6.972.20%
  • PepePepe(PEPE)$0.0000082.63%
  • moneroMonero(XMR)$165.195.38%
  • fetch-aiArtificial Superintelligence Alliance(FET)$1.15-7.20%
  • First Digital USDFirst Digital USD(FDUSD)$0.99-0.59%
  • Ethena USDeEthena USDe(USDE)$1.00-0.09%
  • ethereum-classicEthereum Classic(ETC)$18.570.72%
  • stellarStellar(XLM)$0.0931321.12%
  • blockstackStacks(STX)$1.610.74%
  • immutable-xImmutable(IMX)$1.454.48%
  • okbOKB(OKB)$36.970.56%
  • crypto-com-chainCronos(CRO)$0.081718-0.24%
  • suiSui(SUI)$0.824.02%
  • render-tokenRender(RENDER)$5.47-6.91%
  • BittensorBittensor(TAO)$294.43-4.33%
  • filecoinFilecoin(FIL)$3.680.71%
  • mantleMantle(MNT)$0.601.39%
  • aaveAave(AAVE)$126.351.29%
  • vechainVeChain(VET)$0.0226610.12%
  • hedera-hashgraphHedera(HBAR)$0.0514910.48%
  • cosmosCosmos Hub(ATOM)$4.652.01%
WP Twitter Auto Publish Powered By : XYZScripts.com