Prompting is the methodology through which we communicate instructions, parameters, and expectations to large language models. Consider a prompt as a detailed specification document provided to the model: the more precise and comprehensive the specifications, the higher the quality of the output. This guide establishes the fundamental principles for crafting effective prompts for open-source instruction-tuned models, including Llama, Deepseek, and Gemma.
Large language models require clear direction to produce optimal results. Without precise instructions, they may produce inconsistent outputs. Well-structured prompts provide several benefits:
Most high-quality prompts contain five elements: role, instructions, context, input, expected output.
Element | What it does |
---|---|
Role | Sets persona or expertise ("You are a data analyst…") |
Instructions | Bullet-proof list of required actions |
Context | Background knowledge or reference material |
Input | The data or question to transform |
Expected Output | Schema or miniature example to lock formatting |
Here's a real-world example demonstrating how these prompt building blocks work together to extract structured data from an email. Each element plays a crucial role in ensuring accurate, consistent output:
### System
You are a data-extraction bot. Return **ONLY** valid JSON.
### Instructions
Return only JSON with keys:
- name (string)
- street (string)
- city (string)
- postcode (string)
### Context
"Ship-to" or "Deliver to" often precedes the address.
Postcodes may include letters (e.g., SW1A 1AA).
### Input
Subject: Shipping Request - Order #12345
Hi Shipping Team,
Please process the following delivery for Order #12345:
Deliver to:
Jane Smith
123 Oak Avenue
Manchester
M1 1AA
Items:
- 2x Widget Pro (SKU: WP-001)
- 1x Widget Case (SKU: WC-100)
Thanks,
Sales Team
### Example Output
{"name":"John Doe","street":"456 Pine Street","city":"San Francisco","postcode":"94105"}
Most chat-style APIs expose three channels:
Channel | Typical Use |
---|---|
system | High-level persona & non-negotiable rules ("You are a helpful financial assistant."). |
user | The actual request or data, such as a user's message in a chat. |
assistant | The model's response. In multi-turn conversations, the assistant role can be used to track the conversation history. |
The following example demonstrates how to implement a customer service chatbot using role channels. Role channels provide a structured way for the model to maintain context and generate contextually appropriate responses throughout the conversation.
from groq import Groq
client = Groq()
system_prompt = """
You are a helpful IT support chatbot for 'Tech Solutions'.
Your role is to assist employees with common IT issues, provide guidance on using company software, and help troubleshoot basic technical problems.
Respond clearly and patiently. If an issue is complex, explain that you will create a support ticket for a human technician.
Keep responses brief and ask a maximum of one question at a time.
"""
chat_completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": "My monitor isn't turning on.",
},
{
"role": "assistant",
"content": "Let's try to troubleshoot. Is the monitor properly plugged into a power source?",
},
{
"role": "user",
"content": "Yes, it's plugged in."
}
],
model="llama-3.3-70b-versatile",
)
print(chat_completion.choices[0].message.content)
Is the power button on the monitor being pressed, and are any lights or indicators on the monitor turning on when you press it?
Prompt priming is the practice of giving the model an initial block of instructions or context that influences every downstream token the model generates. Think of it as "setting the temperature of the conversation room" before anyone walks in. This usually lives in the system message; in single-shot prompts it's the first paragraph you write. Unlike one- or few-shot demos, priming does not need examples; the power comes from describing roles ("You are a medical billing expert"), constraints ("never reveal PII"), or seed knowledge ("assume the user's database is Postgres 16").
Large language models generate text by conditioning on all previous tokens, weighting earlier tokens more heavily than later ones. By positioning high-leverage tokens (role, style, rules) first, priming biases the probability distribution over next tokens toward answers that respect that frame.
### System (Priming)
You are ComplianceLlama, an expert in U.S. financial-services regulation.
Always cite the relevant CFR section and warn when user requests violate §1010.620.
### User
"Can my fintech app skip KYC if all transfers are under $500?"
### Assistant
"Transfers below $1,000 still trigger the customer-identification program requirements in 31 CFR §1022.220. Skipping KYC would violate FinCEN rules…"
Situation | Why priming helps |
---|---|
Stable persona or voice across many turns | Guarantees the model keeps the same tone (e.g., "seasoned litigator") without repeating instructions. |
Policy & safety guardrails | Embeds non-negotiable rules such as "do not reveal trade secrets." |
Injecting domain knowledge (e.g., product catalog, API schema) | Saves tokens vs. repeating specs each turn; the model treats the primed facts as ground truth. |
Special formatting or citation requirements | Place markdown/JSON/XML templates in the primer so every answer starts correct. |
Consistent style transfer (pirate talk, Shakespearean English) | Role-play seeds ensure creative outputs stay on-brand. |
Zero-shot tasks that need extra context | A brief primer often outperforms verbose instructions alone. |
While many models can handle up to 128K tokens (or more), using a longer system prompt still costs latency and money. While you might be able to fit a lot of information in the model's context window, it could increase latency and reduce the model's accuracy. As a best practice, only include what is needed for the model to generate the desired response in the context.
Try these 10-second tweaks before adding examples or complex logic:
Quick Fix | Outcome |
---|---|
Add a one-line persona ("You are a veteran copy editor.") | Sharper, domain-aware tone |
Show a mini output sample (one-row table / tiny JSON) | Increased formatting accuracy |
Use numbered steps in instructions | Reduces answers with extended rambling |
Add "no extra prose" at the end | Stops model from adding greetings or apologies |
Review these recommended practices and solutions to avoid common prompting issues.
Common Mistake | Result | Solution |
---|---|---|
Hidden ask buried mid-paragraph | Model ignores it | Move all instructions to top bullet list |
Over-stuffed context | Truncated or slow responses | Summarize, remove old examples |
Ambiguous verbs ("analyze") | Vague output | Be explicit ("Summarize in one bullet per metric") |
Partial JSON keys in sample | Model Hallucinates extra keys | Show the full schema: even if brief |
Optimize model outputs by configuring key parameters like temperature and top-p. These settings control the balance between deterministic and creative responses, with recommended values based on your specific use case.
Parameter | What it does | Safe ranges | Typical use |
---|---|---|---|
Temperature | Global randomness (higher = more creative) | 0 - 1.0 | 0 - 0.3 facts, 0.7 - 0.9 creative |
Top-p | Keeps only the top p cumulative probability mass - use this or temperature, not both | 0.5 - 1.0 | 0.9 facts, 1.0 creative |
Top-k | Limits to the k highest-probability tokens | 20 - 100 | Rarely needed; try k = 40 for deterministic extraction |
The following are recommended values to set temperature or top-p to (but not both) for various use cases:
Scenario | Temp | Top-p | Comments |
---|---|---|---|
Factual Q&A | 0.2 | 0.9 | Keeps dates & numbers stable |
Data extraction (JSON) | 0.0 | 0.9 | Deterministic keys/values |
Creative copywriting | 0.8 | 1.0 | Vivid language, fresh ideas |
Brainstorming list | 0.7 | 0.95 | Variety without nonsense |
Long-form code | 0.3 | 0.85 | Fewer hallucinated APIs |
The following are recommended settings for controlling token usage and costs with length limits, stop sequences, and deterministic outputs.
Setting | Purpose | Tip |
---|---|---|
max_completion_tokens | Hard cap on completion size | Set 10-20 % above ideal answer length |
Stop sequences | Early stop when model hits token(s) | Use "###" or another delimiter |
System length hints | "less than 75 words" or "return only table rows" | Model respects explicit numbers |
seed | Controls randomness deterministically | Use same seed for consistent outputs across runs |
Real-world example:
Invoice summarizer returns exactly three bullets by stating "Provide three bullets, each less than 12 words" and using
max_completion_tokens=60
.
The stop
parameter allows you to define sequences where the model will stop generating tokens. This is particularly useful for:
# Using a custom stop sequence for structured, concise output.
# The model is instructed to produce '###' at the end of the desired content.
# The API will stop generation when '###' is encountered and will NOT include '###' in the response.
from groq import Groq
client = Groq()
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Provide a 2-sentence summary of the concept of 'artificial general intelligence'. End your summary with '###'."
}
# Model's goal before stop sequence removal might be:
# "Artificial general intelligence (AGI) refers to a type of AI that possesses the ability to understand, learn, and apply knowledge across a wide range of tasks at a level comparable to that of a human being. This contrasts with narrow AI, which is designed for specific tasks. ###"
],
model="llama-3.1-8b-instant",
stop=["###"],
max_tokens=100 # Ensure enough tokens for the summary + stop sequence
)
print(chat_completion.choices[0].message.content)
Artificial general intelligence (AGI) refers to a type of AI that possesses the ability to understand, learn, and apply knowledge across a wide range of tasks at a level comparable to that of a human being. This contrasts with narrow AI, which is designed for specific tasks.
When defining stop sequences:
###END###
or </response>
}
or ;
The seed
parameter enables deterministic generation, making outputs consistent across multiple runs with the same parameters. This is valuable for:
from groq import Groq
client = Groq()
chat_completion = client.chat.completions.create(
messages=[
{ "role": "system", "content": "You are a creative storyteller." },
{ "role": "user", "content": "Write a brief opening line to a mystery novel." }
],
model="llama-3.1-8b-instant",
temperature=0.8, # Some creativity allowed
seed=700, # Deterministic seed
max_tokens=100
)
print(chat_completion.choices[0].message.content)
"It was the night the clock tower's chimes fell silent, and Detective Jameson received a mysterious letter with a single, chilling phrase: 'The truth lies in Ravenswood.'"
Important notes about seed
:
system_fingerprint
in responses to track backend changesseed
with a lower temperature (0.0 - 0.3) may improve determinismGood prompts set the rules; dedicated guardrail models enforce them. Meta's Llama Guard 4 is designed to sit in front of: or behind: your main model, classifying prompts or outputs for safety violations (hate, self-harm, private data). Integrating a moderation step can cut violation rates without changing your core prompt structure.
When stakes are high (finance, health, compliance), pair clear instructions ("never reveal PII") with an automated filter that rejects or sanitizes unsafe content before it reaches the user.
Ready to level up? Explore dedicated prompt patterns like zero-shot, one-shot, few-shot, chain-of-thought, and more to match the pattern to your task complexity. From there, iterate and refine to improve your prompts.