To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!
To install Unsloth on your own computer, follow the installation instructions on our Github page here.
You will learn how to do data prep, how to train, how to run the model, & how to save it
News
Read our Gemma 3 blog for what's new in Unsloth and our Reasoning blog on how to train reasoning models.
Visit our docs for all our model uploads and notebooks.
Installation
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
!pip install unsloth vllm
else:
# [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
!pip install --no-deps unsloth vllm
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
!pip install unsloth vllm
else:
!pip install --no-deps unsloth vllm
# [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
# Skip restarting message in Colab
import sys, re, requests; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
# vLLM requirements - vLLM breaks Colab due to reinstalling numpy
f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
with open("vllm_requirements.txt", "wb") as file:
file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
!pip install -r vllm_requirements.txt
Unsloth
FastModel
supports loading nearly any model now! This includes Vision and Text models!
from unsloth import FastModel
import torch
fourbit_models = [
# 4bit dynamic quants for superior accuracy and low memory use
"unsloth/gemma-3-1b-it-unsloth-bnb-4bit",
"unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
"unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
"unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
# Other popular models!
"unsloth/Llama-3.1-8B",
"unsloth/Llama-3.2-3B",
"unsloth/Llama-3.3-70B",
"unsloth/mistral-7b-instruct-v0.3",
"unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-3-4b-it",
max_seq_length = 2048, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
full_finetuning = False, # [NEW!] We have full finetuning now!
# token = "hf_...", # use one if using gated models
)
We now add LoRA adapters so we only need to update a small amount of parameters!
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # Turn off for just text!
finetune_language_layers = True, # Should leave on!
finetune_attention_modules = True, # Attention good for GRPO
finetune_mlp_modules = True, # SHould leave on always!
r = 8, # Larger = higher accuracy, but might overfit
lora_alpha = 8, # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
)
Data Prep We now use the Gemma-3
format for conversation style finetunes.
We use Maxime Labonne's FineTome-100k dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
We use our get_chat_template
function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3
and more.
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-3",
)
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
We now use standardize_data_formats
to try converting datasets to the correct format for finetuning purposes!
from unsloth.chat_templates import standardize_data_formats
dataset = standardize_data_formats(dataset)
Let's see how row 100 looks like!
dataset[100]
We now have to apply the chat template for Gemma-3
onto the conversations, and save it to text
. We remove the <bos>
token using removeprefix('<bos>'
) since we're finetuning. The Processor will add this token before training and the model expects only one.
for row in dataset[:5]["text"]:
print("=========================")
print(row)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
return { "text" : texts, }
dataset = dataset.map(formatting_prompts_func, batched = True)
Let's see how the chat template did! Notice there is no <bos>
token as the processor tokenizer will be adding one.
dataset[100]["text"]
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100
from unsloth import FastLanguageModel
# Before running inference, call `FastLanguageModel.for_inference` first
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
# "Once upon a time, in a galaxy, far far away,"
"파인튜닝으로 질문생성기를 만들고 싶어. 데이터셋 포맷을 알려줘"
]*1, return_tensors = "pt").to("cuda")
generation_kwargs = dict(
inputs,
streamer = text_streamer,
max_new_tokens = 256,
use_cache = True,
)
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()
length = 0
for j, new_text in enumerate(text_streamer):
if j == 0:
wrapped_text = textwrap.wrap(new_text, width = max_print_width)
length = len(wrapped_text[-1])
wrapped_text = "\n".join(wrapped_text)
print(wrapped_text, end = "")
else:
length += len(new_text)
if length >= max_print_width:
length = 0
print()
print(new_text, end = "")
pass
pass
Set Google Drive
from google.colab import drive
drive.mount('/content/drive')
output_dir = "/content/drive/MyDrive/gemma3-checkpoints"
Train the model Now let's use Huggingface TRL's SFTTrainer
! More docs
here: TRL SFT docs. We do 60
steps to speed things up, but you can set num_train_epochs=1
for a full run,
and turn off max_steps=None
.
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
eval_dataset = None, # Can set up evaluation!
args = SFTConfig(
output_dir="/content/drive/MyDrive/gemma3-checkpoints",
# output_dir="./results", # 체크포인트 저장 경로
# eval_strategy="steps", # ✅ 평가 비활성화 (이전 eval_strategy → evaluation_strategy로 수정)
eval_strategy="no",
# evaluation_strategy="steps",
eval_steps=500,
save_strategy="steps", # Step 단위로 저장
save_steps=1, # 500 step마다 체크포인트 저장
dataset_text_field = "text",
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 30,
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none", # Use this for WandB etc
dataset_num_proc=2,
),
)
We also use Unsloth's train_on_completions
method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part = "<start_of_turn>user\n",
response_part = "<start_of_turn>model\n",
)
Let's verify masking the instruction part is done! Let's print the 100th row again. Notice how the sample only has a single <bos>
as expected!
tokenizer.decode(trainer.train_dataset[100]["input_ids"])
Now let's print the masked out example - you should see only the answer is present:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
Let's train the model! To resume a training run, set trainer.train(resume_from_checkpoint = True)
trainer_stats = trainer.train()
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory _ 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory _ 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
Inference
Let's run the model via Unsloth native inference! According to the Gemma-3
team, the recommended settings for inference are temperature = 1.0, top_p = 0.95, top_k = 64
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-3",
)
messages = [{
"role": "user",
"content": [{
"type" : "text",
"text" : "Continue the sequence: 1, 1, 2, 3, 5, 8,",
}]
}]
text = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
**tokenizer([text], return_tensors = "pt").to("cuda"),
max_new_tokens = 64, # Increase for longer outputs!
# Recommended Gemma-3 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)
You can also use a TextStreamer
for continuous inference - so you can see the generation token by token, instead of waiting the whole time!
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "Why is the sky blue?",}]
}]
text = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
)
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt = True) # Initialize TextStreamer
_ = model.generate(
**tokenizer([text], return_tensors = "pt").to("cuda"),
max_new_tokens = 64, # Increase for longer outputs! # Recommended Gemma-3 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer=streamer, # Use the initialized streamer
)
Saving, loading finetuned models To save the final model as LoRA adapters,
either use Huggingface's push_to_hub
for an online save or save_pretrained
for a local save.
[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down! model.save_pretrained("gemma-3") # Local saving
tokenizer.save_pretrained("gemma-3")
# model.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
# tokenizer.push_to_hub("HF_ACCOUNT/gemma-3", token = "...") # Online saving
Now if you want to load the LoRA adapters we just saved for inference, set False
to True
:
if False:
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 2048,
load_in_4bit = True,
)
messages = [{
"role": "user",
"content": [{"type" : "text", "text" : "What is Gemma-3?",}]
}]
text = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True, # Must add for generation
)
from transformers import TextStreamer = model.generate(
**tokenizer([text], return_tensors = "pt").to("cuda"),
max_new_tokens = 64, # Increase for longer outputs! # Recommended Gemma-3 settings!
temperature = 1.0, top_p = 0.95, top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt = True),
)
Saving to float16 for VLLM
We also support saving to float16
directly for deployment! We save it in the folder gemma-3-finetune
. Set if False
to if True
to let it run!
if False: # Change to True to save finetune!
model.save*pretrained_merged("gemma-3-finetune", tokenizer)
If you want to upload / push to your Hugging Face account, set if False
to if True
and add your Hugging Face token and upload location!
if False: # Change to True to upload finetune
model.push_to_hub_merged(
"HF_ACCOUNT/gemma-3-finetune", tokenizer,
token = "hf*..."
)
GGUF / llama.cpp Conversion
To save to GGUF
/ llama.cpp
, we support it natively now for all models! For now, you can convert easily to Q8_0, F16 or BF16
precision. Q4_K_M
for 4bit will come later!
if False: # Change to True to save to GGUF
model.save*pretrained_gguf(
"gemma-3-finetune",
quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
)
Likewise, if you want to instead push to GGUF to your Hugging Face account, set if False
to if True
and add your Hugging Face token and upload location!
if False: # Change to True to upload GGUF
model.push_to_hub_gguf(
"gemma-3-finetune",
quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
repo_id = "HF_ACCOUNT/gemma-finetune-gguf",
token = "hf*...",
)
2. Hugging Face에 모델 업로드
# HF 로그인
# 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
from huggingface_hub import notebook_login
notebook_login() # HF 토큰 입력
model.save_pretrained("dasomaru/gemma-3-4b-it", safe_serialization=False)
tokenizer.save_pretrained("dasomaru/gemma-3-4b-it")
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# 1. base 모델 로딩
base = AutoModelForCausalLM.from_pretrained("unsloth/gemma-3-4b-it")
# 2. LoRA 적용
model = PeftModel.from_pretrained(base, "dasomaru/gemma-3-4b-it")
# 3. 병합
model = model.merge_and_unload()
# 4. 병합된 모델 저장
model.save_pretrained("/content/drive/MyDrive/gemma-3-4b-it", safe_serialization=False)
tokenizer = AutoTokenizer.from_pretrained("unsloth/gemma-3-4b-it")
tokenizer.save_pretrained("/content/drive/MyDrive/gemma-3-4b-it")
from huggingface_hub import login, create_repo, upload_folder
from huggingface_hub import HfApi
# 👉 토큰 로그인 (한 번만 실행)
login(token="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") # ← 너의 HF 토큰 입력
# 👉 업로드할 repo 이름 설정
# repo_name = "llama3-finetuned-realestate"
api = HfApi()
# 1. 저장소 생성 (선택 사항)
# api.create_repo(repo_id="dasomaru/gemma-3-4bit-it-demo", repo_type="model")
# 👉 먼저 repo 생성 (private=True 하면 비공개)
# create_repo(repo_name, private=True)
# 👉 폴더 업로드
upload_folder( # folder_path="llama3-final", # 저장된 모델 경로
folder_path='/content/drive/MyDrive/gemma-3-4b-it', # 모델 저장 경로
repo_id=f"dasomaru/gemma-3-4bit-it-demo", # 예: "dasomaru/llama3-finetuned-realestate"
repo_type="model"
)
from huggingface_hub import HfApi, upload_folder
upload_folder(
repo_id="dasomaru/gemma-3-4b-it", # 새로운 이름 권장
folder_path="/content/drive/MyDrive/gemma-3-4b-it",
repo_type="model",
commit_message="Upload merged full model"
)
Now, use the gemma-3-finetune.gguf
file or gemma-3-finetune-Q4_K_M.gguf
file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan here and Open WebUI here
And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other links:
- Train your own reasoning model - Llama GRPO notebook Free Colab
- Saving finetunes to Ollama. Free notebook
- Llama 3.2 Vision finetuning - Radiography use case. Free Colab
- See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!