Prompt Basics

Prompting is the methodology through which we communicate instructions, parameters, and expectations to large language models. Consider a prompt as a detailed specification document provided to the model: the more precise and comprehensive the specifications, the higher the quality of the output. This guide establishes the fundamental principles for crafting effective prompts for open-source instruction-tuned models, including Llama, Deepseek, and Gemma.

Why Prompts Matter

Large language models require clear direction to produce optimal results. Without precise instructions, they may produce inconsistent outputs. Well-structured prompts provide several benefits:

  • Reduce development time by minimizing iterations needed for acceptable results.
  • Enhance output consistency to ensure responses meet validation requirements without modification.
  • Optimize resource usage by maintaining efficient context window utilization.

Prompt Building Blocks

Most high-quality prompts contain five elements: role, instructions, context, input, expected output.

ElementWhat it does
RoleSets persona or expertise ("You are a data analyst…")
InstructionsBullet-proof list of required actions
ContextBackground knowledge or reference material
InputThe data or question to transform
Expected OutputSchema or miniature example to lock formatting

Real-world use case

Here's a real-world example demonstrating how these prompt building blocks work together to extract structured data from an email. Each element plays a crucial role in ensuring accurate, consistent output:

  1. System - fixes the model's role so it doesn't add greetings or extra formatting.
  2. Instructions - lists the exact keys; pairing this with JSON mode or tool use further guarantees parseable output.
  3. Context - gives domain hints ("Deliver to", postcode format) that raise extraction accuracy without extra examples.
  4. Input - the raw e-mail; keep original line breaks so the model can latch onto visual cues.
  5. Example Output - a miniature few-shot sample that locks the reply shape to one JSON object.
### System
You are a data-extraction bot. Return **ONLY** valid JSON.

### Instructions
Return only JSON with keys:
- name (string)
- street (string)
- city (string)
- postcode (string)

### Context
"Ship-to" or "Deliver to" often precedes the address.
Postcodes may include letters (e.g., SW1A 1AA).

### Input
Subject: Shipping Request - Order #12345

Hi Shipping Team,

Please process the following delivery for Order #12345:

Deliver to:
Jane Smith
123 Oak Avenue
Manchester
M1 1AA

Items:
- 2x Widget Pro (SKU: WP-001)
- 1x Widget Case (SKU: WC-100)

Thanks,
Sales Team

### Example Output
{"name":"John Doe","street":"456 Pine Street","city":"San Francisco","postcode":"94105"}

Role Channels

Most chat-style APIs expose three channels:

ChannelTypical Use
systemHigh-level persona & non-negotiable rules ("You are a helpful financial assistant.").
userThe actual request or data, such as a user's message in a chat.
assistantThe model's response. In multi-turn conversations, the assistant role can be used to track the conversation history.

The following example demonstrates how to implement a customer service chatbot using role channels. Role channels provide a structured way for the model to maintain context and generate contextually appropriate responses throughout the conversation.

python
from groq import Groq

client = Groq()

system_prompt = """
You are a helpful IT support chatbot for 'Tech Solutions'.
Your role is to assist employees with common IT issues, provide guidance on using company software, and help troubleshoot basic technical problems.
Respond clearly and patiently. If an issue is complex, explain that you will create a support ticket for a human technician.
Keep responses brief and ask a maximum of one question at a time.
"""

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": "My monitor isn't turning on.",
        },
        {
            "role": "assistant",
            "content": "Let's try to troubleshoot. Is the monitor properly plugged into a power source?",
        },
        {
            "role": "user",
            "content": "Yes, it's plugged in."
        }
    ],
    model="llama-3.3-70b-versatile",
)

print(chat_completion.choices[0].message.content)
Output

Is the power button on the monitor being pressed, and are any lights or indicators on the monitor turning on when you press it?

Prompt Priming

Prompt priming is the practice of giving the model an initial block of instructions or context that influences every downstream token the model generates. Think of it as "setting the temperature of the conversation room" before anyone walks in. This usually lives in the system message; in single-shot prompts it's the first paragraph you write. Unlike one- or few-shot demos, priming does not need examples; the power comes from describing roles ("You are a medical billing expert"), constraints ("never reveal PII"), or seed knowledge ("assume the user's database is Postgres 16").

Why it Works

Large language models generate text by conditioning on all previous tokens, weighting earlier tokens more heavily than later ones. By positioning high-leverage tokens (role, style, rules) first, priming biases the probability distribution over next tokens toward answers that respect that frame.

Example (Primed Chat)

### System (Priming)
You are ComplianceLlama, an expert in U.S. financial-services regulation.
Always cite the relevant CFR section and warn when user requests violate §1010.620.

### User
"Can my fintech app skip KYC if all transfers are under $500?"

### Assistant
Output

"Transfers below $1,000 still trigger the customer-identification program requirements in 31 CFR §1022.220. Skipping KYC would violate FinCEN rules…"

When to Use

SituationWhy priming helps
Stable persona or voice across many turnsGuarantees the model keeps the same tone (e.g., "seasoned litigator") without repeating instructions.
Policy & safety guardrailsEmbeds non-negotiable rules such as "do not reveal trade secrets."
Injecting domain knowledge (e.g., product catalog, API schema)Saves tokens vs. repeating specs each turn; the model treats the primed facts as ground truth.
Special formatting or citation requirementsPlace markdown/JSON/XML templates in the primer so every answer starts correct.
Consistent style transfer (pirate talk, Shakespearean English)Role-play seeds ensure creative outputs stay on-brand.
Zero-shot tasks that need extra contextA brief primer often outperforms verbose instructions alone.

Tips

  • Keep it concise: 300-600 tokens is usually enough; longer primers steal context window from the user.
  • Separate roles: Use dedicated system, user, and assistant roles so the model understands hierarchy.
  • Test for drift: Over many turns, the model can "forget" earlier tokens: re-send the primer or summarize it periodically.
  • Watch for over-constraining: Heavy persona priming can hurt factual accuracy on analytical tasks; disable or slim down when precision matters.
  • Combine with examples: For structured outputs, prime the schema then add one-shot examples to lock formatting.

Core Principles

  1. Lead with the must-do. Put critical instructions first; the model weighs early tokens more heavily.
  2. Show, don't tell. A one-line schema or table example beats a paragraph of prose.
  3. State limits explicitly. Use "Return only JSON" or "less than 75 words" to eliminate chatter.
  4. Use plain verbs. "Summarize in one bullet per metric" is clearer than "analyze."
  5. Chunk long inputs. Delimit data with ``` or <<< … >>> so the model sees clear boundaries.

Context Budgeting

While many models can handle up to 128K tokens (or more), using a longer system prompt still costs latency and money. While you might be able to fit a lot of information in the model's context window, it could increase latency and reduce the model's accuracy. As a best practice, only include what is needed for the model to generate the desired response in the context.

Quick Prompting Wins

Try these 10-second tweaks before adding examples or complex logic:

Quick FixOutcome
Add a one-line persona ("You are a veteran copy editor.")Sharper, domain-aware tone
Show a mini output sample (one-row table / tiny JSON)Increased formatting accuracy
Use numbered steps in instructionsReduces answers with extended rambling
Add "no extra prose" at the endStops model from adding greetings or apologies

Common Mistakes to Avoid

Review these recommended practices and solutions to avoid common prompting issues.

Common MistakeResultSolution
Hidden ask buried mid-paragraphModel ignores itMove all instructions to top bullet list
Over-stuffed contextTruncated or slow responsesSummarize, remove old examples
Ambiguous verbs ("analyze")Vague outputBe explicit ("Summarize in one bullet per metric")
Partial JSON keys in sampleModel Hallucinates extra keysShow the full schema: even if brief

Parameter Tuning

Optimize model outputs by configuring key parameters like temperature and top-p. These settings control the balance between deterministic and creative responses, with recommended values based on your specific use case.

ParameterWhat it doesSafe rangesTypical use
TemperatureGlobal randomness (higher = more creative)0 - 1.00 - 0.3 facts, 0.7 - 0.9 creative
Top-pKeeps only the top p cumulative probability mass - use this or temperature, not both0.5 - 1.00.9 facts, 1.0 creative
Top-kLimits to the k highest-probability tokens20 - 100Rarely needed; try k = 40 for deterministic extraction

Quick presets

The following are recommended values to set temperature or top-p to (but not both) for various use cases:

ScenarioTempTop-pComments
Factual Q&A0.20.9Keeps dates & numbers stable
Data extraction (JSON)0.00.9Deterministic keys/values
Creative copywriting0.81.0Vivid language, fresh ideas
Brainstorming list0.70.95Variety without nonsense
Long-form code0.30.85Fewer hallucinated APIs

Controlling Length & Cost

The following are recommended settings for controlling token usage and costs with length limits, stop sequences, and deterministic outputs.

SettingPurposeTip
max_completion_tokensHard cap on completion sizeSet 10-20 % above ideal answer length
Stop sequencesEarly stop when model hits token(s)Use "###" or another delimiter
System length hints"less than 75 words" or "return only table rows"Model respects explicit numbers
seedControls randomness deterministicallyUse same seed for consistent outputs across runs

Real-world example:

Invoice summarizer returns exactly three bullets by stating "Provide three bullets, each less than 12 words" and using max_completion_tokens=60.

Stop Sequences

The stop parameter allows you to define sequences where the model will stop generating tokens. This is particularly useful for:

  • Creating structured outputs with clear boundaries
  • Preventing the model from continuing beyond a certain point
  • Implementing custom dialogue patterns
python
# Using a custom stop sequence for structured, concise output.
# The model is instructed to produce '###' at the end of the desired content.
# The API will stop generation when '###' is encountered and will NOT include '###' in the response.

from groq import Groq

client = Groq()
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Provide a 2-sentence summary of the concept of 'artificial general intelligence'. End your summary with '###'."
        }
        # Model's goal before stop sequence removal might be:
        # "Artificial general intelligence (AGI) refers to a type of AI that possesses the ability to understand, learn, and apply knowledge across a wide range of tasks at a level comparable to that of a human being. This contrasts with narrow AI, which is designed for specific tasks. ###"
    ],
    model="llama-3.1-8b-instant",
    stop=["###"],
    max_tokens=100 # Ensure enough tokens for the summary + stop sequence
)

print(chat_completion.choices[0].message.content)
Output

Artificial general intelligence (AGI) refers to a type of AI that possesses the ability to understand, learn, and apply knowledge across a wide range of tasks at a level comparable to that of a human being. This contrasts with narrow AI, which is designed for specific tasks.


When defining stop sequences:

  • Include instructions in your prompt to tell the model to produce the stop sequence in the response
  • Use unique patterns unlikely to appear in normal text, such as ###END### or </response>
  • For code generation, use language-specific endings like } or ;

Deterministic Outputs with Seed

The seed parameter enables deterministic generation, making outputs consistent across multiple runs with the same parameters. This is valuable for:

  • Reproducible results in research or testing
  • Consistent user experiences in production
  • A/B testing different prompts with controlled randomness
python
from groq import Groq

client = Groq()
chat_completion = client.chat.completions.create(
    messages=[
      { "role": "system", "content": "You are a creative storyteller." },
      { "role": "user", "content": "Write a brief opening line to a mystery novel." }
    ],
    model="llama-3.1-8b-instant",
    temperature=0.8,  # Some creativity allowed
    seed=700,  # Deterministic seed
    max_tokens=100
)

print(chat_completion.choices[0].message.content)
Output

"It was the night the clock tower's chimes fell silent, and Detective Jameson received a mysterious letter with a single, chilling phrase: 'The truth lies in Ravenswood.'"


Important notes about seed:

  • Determinism is best-effort and is not guaranteed across model versions
  • Check the system_fingerprint in responses to track backend changes
  • Combining seed with a lower temperature (0.0 - 0.3) may improve determinism
  • Useful for debugging and improving prompts iteratively

Guardrails & Safety

Good prompts set the rules; dedicated guardrail models enforce them. Meta's Llama Guard 4 is designed to sit in front of: or behind: your main model, classifying prompts or outputs for safety violations (hate, self-harm, private data). Integrating a moderation step can cut violation rates without changing your core prompt structure.

When stakes are high (finance, health, compliance), pair clear instructions ("never reveal PII") with an automated filter that rejects or sanitizes unsafe content before it reaches the user.

Next Steps

Ready to level up? Explore dedicated prompt patterns like zero-shot, one-shot, few-shot, chain-of-thought, and more to match the pattern to your task complexity. From there, iterate and refine to improve your prompts.

Was this page helpful?