LoRA Inference on Groq

Groq provides inference services for pre-made Low-Rank Adaptation (LoRA) adapters. LoRA is a Parameter-efficient Fine-tuning (PEFT) technique that customizes model behavior without altering base model weights. Upload your existing LoRA adapters to run specialized inference while maintaining the performance and efficiency of Groq's infrastructure.

This service is not available currently for use with regional / sovereign endpoints.

Note: Groq offers LoRA inference services only. We do not provide LoRA fine-tuning services - you must create your LoRA adapters externally using other providers or tools.

With LoRA inference on Groq, you can:

Run inference with your pre-made LoRA adapters
Deploy multiple specialized adapters alongside a single base model
Maintain high performance without compromising inference speed
Leverage existing fine-tuned models created with external tools

Enterprise Feature

LoRA is available exclusively to enterprise-tier customers. To get started with LoRA on GroqCloud, please reach out to our enterprise team.

Why LoRA vs. Base Model?

Compared to using just the base model, LoRA adapters offer significant advantages:

Task-Specific Optimization: Tune model outputs to your particular use case, enabling increased accuracy and quality of responses
Domain Expertise: Adapt models to understand industry-specific terminology, context, and requirements
Consistent Behavior: Ensure predictable outputs that align with your business needs and brand voice
Performance Maintenance: Achieve customization without compromising the high-speed inference that Groq is known for

Why LoRA vs. Traditional Fine-tuning?

LoRA provides several key advantages over traditional fine-tuning approaches:

Lower Total Cost of Ownership

LoRA reduces fine-tuning costs significantly by avoiding full base model fine-tuning. This efficiency makes it cost-effective to customize models at scale.

Rapid Deployment with High Performance

Smaller, task-specific LoRA adapters can match or exceed the performance of fully fine-tuned models while delivering faster inference. This translates to quicker experimentation, iteration, and real-world impact.

Non-Invasive Model Adaptation

Since LoRA adapters don't require changes to the base model, you avoid the complexity and liability of managing and validating a fully retrained system. Adapters are modular, independently versioned, and easily replaceable as your data evolves—simplifying governance and compliance.

Full Control, Less Risk

Customers keep control of how and when updates happen—no retraining, no surprise behavior changes. Just lightweight, swappable adapters that fit into existing systems with minimal disruption. And with self-service APIs, updating adapters is quick, intuitive, and doesn't require heavy engineering lift.

LoRA Options on GroqCloud

Two Hosting Modalities

Groq supports LoRAs through two deployment options:

LoRAs in our public cloud
LoRAs on a dedicated instance

LoRAs (Public Cloud)

Pay-per-token usage model with no dedicated hardware requirements, ideal for customers with a small number of LoRA adapters across different tasks like customer support, document summarization, and translation.

No dedicated hardware requirements - pay per token usage
Shared instance capabilities across customers with potential rate limiting
Less consistent latency/throughput compared to dedicated instances
Gradual rollout to enterprise customers only via enterprise access form

LoRAs (Dedicated Instance)

Deployed on dedicated Groq hardware instances purchased by the customer, providing optimized performance for multiple LoRA adapters and consistent inference speeds, best suited for high-traffic scenarios or customers serving personalized adapters to many end users.

Dedicated hardware instances optimized for LoRA performance
More consistent performance and lower average latency
No LoRA-specific rate limiting
Ideal for SaaS platforms with dozens of internal use cases or hundreds of customer-specific adapters

Supported Models

LoRA support is currently available for the following models:

Model ID	Model	Base Model
`llama-3.1-8b-instant`	Llama 3.1 8B	meta-llama/Llama-3.1-8B-Instruct

Please reach out to our enterprise support team for additional model support.

LoRA Pricing

Please reach out to our enterprise support team for pricing.

Getting Started

To begin using LoRA on GroqCloud:

Contact Enterprise Sales: Reach out to become an enterprise-tier customer
Request LoRA Access: Inform the team that you would like access to LoRA support
Create Your LoRA Adapters: Use external providers or tools to fine-tune Groq-supported base models (exact model versions required)
Upload Adapters: Use the self-serve portal to upload your LoRA adapters to GroqCloud
Deploy: Call the unique model ID created for your specific LoRA adapter(s)

Important: You must fine-tune the exact base model versions that Groq supports for your LoRA adapters to work properly.

Using the Fine-Tuning API

Once you have access to LoRA, you can upload and deploy your adapters using Groq's Fine-Tuning API. This process involves two API calls: one to upload your LoRA adapter files and another to register them as a fine-tuned model. When you upload your LoRA adapters, Groq will store and process your files to provide this service. LoRA adapters are your Customer Data and will only be available for your organization's use.

Requirements

Supported models: Text generation models only
Supported ranks: 8, 16, 32, and 64 only
File format: ZIP file containing exactly 2 files

Note: Cold start times are proportional to the LoRA rank. Higher ranks (32, 64) will take longer to load initially but have no impact on inference performance once loaded.

Step 1: Prepare Your LoRA Adapter Files

Create a ZIP file containing exactly these 2 files:

adapter_model.safetensors - A safetensors file containing your LoRA weights in float16 format
adapter_config.json - A JSON configuration file with required fields:
- "lora_alpha": (integer or float) The LoRA alpha parameter
- "r": (integer) The rank of your LoRA adapter (must be 8, 16, 32, or 64)

Step 2: Upload the LoRA Adapter Files

Upload your ZIP file to the /files endpoint with purpose="fine_tuning":

bash

curl --location 'https://api.groq.com/openai/v1/files' \
--header "Authorization: Bearer ${TOKEN}" \
--form "file=@<file-name>.zip" \
--form 'purpose="fine_tuning"'

This returns a file ID that you'll use in the next step:

JSON

{
  "id": "file_01jxnqc8hqebx343rnkyxw47e",
  "object": "file",
  "bytes": 155220077,
  "created_at": 1749854594,
  "filename": "<file-name>.zip",
  "purpose": "fine_tuning"
}

Step 3: Register as Fine-Tuned Model

Use the file ID to register your LoRA adapter as a fine-tuned model:

bash

curl --location 'https://api.groq.com/v1/fine_tunings' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${TOKEN}" \
--data '{
    "input_file_id": "<file-id>",
    "name": "my-lora-adapter",
    "type": "lora",
    "base_model": "llama-3.1-8b-instant"
}'

This returns your unique model ID:

JSON

{
  "id": "ft_01jxx7abvdf6pafdthfbfmb9gy",
  "object": "fine_tuning",
  "data": {
    "name": "my-lora-adapter",
    "base_model": "llama-3.1-8b-instant",
    "type": "lora",
    "fine_tuned_model": "ft:llama-3.1-8b-instant:org_01hqed9y3fexcrngzqm9qh6ya9/my-lora-adapter-ef36419a0010"
  }
}

Step 4: Use Your LoRA Model

Use the returned fine_tuned_model ID in your inference requests just like any other model:

bash

curl --location 'https://api.groq.com/openai/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer ${TOKEN}" \
--data '{
    "model": "ft:llama-3.1-8b-instant:org_01hqed9y3fexcrngzqm9qh6ya9/my-lora-adapter-ef36419a0010",
    "messages": [
        {
            "role": "user",
            "content": "Your prompt here"
        }
    ]
}'

Frequently Asked Questions

Does Groq offer LoRA fine-tuning services?

No. Groq provides LoRA inference services only. Customers must create their LoRA adapters externally using fine-tuning providers or tools (e.g., Hugging Face PEFT, Unsloth, or custom solutions) and then upload their pre-made adapters to Groq for inference. You must fine-tune the exact base model versions that Groq supports.

Keep LoRA rank low (8 or 16) to minimize cold start times - higher ranks increase loading latency
Use float16 precision when loading the base model during fine-tuning to maintain optimal inference accuracy
Avoid 4-bit quantization during LoRA training as it may cause small accuracy drops during inference
Save LoRA weights in float16 format in your adapter_model.safetensors file
Test different ranks to find the optimal balance between adaptation quality and cold start performance

Getting Started

Core Features

Tools & Integrations

Compound (Agentic AI)

Guides

Advanced

Production Readiness

Account and Console

Developer Resources

Legal

LoRA Inference on Groq

Enterprise Feature

Why LoRA vs. Base Model?

Why LoRA vs. Traditional Fine-tuning?

LoRA Options on GroqCloud

Two Hosting Modalities

LoRAs (Public Cloud)

LoRAs (Dedicated Instance)

Supported Models

LoRA Pricing

Getting Started

Using the Fine-Tuning API

Requirements

Step 1: Prepare Your LoRA Adapter Files

Step 2: Upload the LoRA Adapter Files

Step 3: Register as Fine-Tuned Model

Step 4: Use Your LoRA Model

Frequently Asked Questions

Does Groq offer LoRA fine-tuning services?

Will LoRA support be available to Developer tier customers?

Does Groq have recommended fine-tuning providers?

How do I get access to LoRA on GroqCloud?

How long are LoRA adapter files retained for?

Best Practices

Was this page helpful?

On this page