In this quick tutorial we will deploy various AI models and compare the differences. We will be using smaller models so that most users should be able to follow this guide as long as they have at least 16G of memory and no other applications using much memory.
In this tutorial we recommend that you use Docker to deploy ollama. If you are not familiar with Docker, you can check our Docker Tutorial guide here.
I recommend having a Ubuntu or Debian host for this, but it is supported in Windows and Mac too.
Visit the Ollama website to download it.
Install curl if you don't have it already: apt update && apt install curl
curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
WARNING: Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
ollama serve&
Make sure you use the & so it puts the service in the background, yet it will still output to the console. This is very useful for debugging and troubleshooting performance issues.
2025/04/22 21:00:21 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-04-22T21:00:21.697Z level=INFO source=images.go:458 msg="total blobs: 0"
time=2025-04-22T21:00:21.697Z level=INFO source=images.go:465 msg="total unused blobs removed: 0"
time=2025-04-22T21:00:21.697Z level=INFO source=routes.go:1298 msg="Listening on 127.0.0.1:11434 (version 0.6.5)"
time=2025-04-22T21:00:21.697Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-04-22T21:00:21.727Z level=INFO source=gpu.go:377 msg="no compatible GPUs were discovered"
time=2025-04-22T21:00:21.727Z level=INFO source=types.go:130 msg="inference compute" id=0 library=cpu variant="" compute="" driver=0.0 name="" total="267.6 GiB" available="255.5 GiB"
Find the LLM you want to test on the ollama website.
Keep in mind that the larger the model, the more memory and more disk space and resources it consumes when running. For this reason, our example will use a smaller model.
ollama run qwen2.5:0.5b
Note that this smaller 0.5b model uses just about 397MB
You should see similar output as below and then have a chat prompt at the end.
ollama run qwen2.5:0.5b
[GIN] 2025/04/22 - 21:04:44 | 200 | 79.024µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/04/22 - 21:04:44 | 404 | 455.218µs | 127.0.0.1 | POST "/api/show"
pulling manifest ⠇ time=2025-04-22T21:04:45.299Z level=INFO source=download.go:177 msg="downloading c5396e06af29 in 4 100 MB part(s)"
pulling manifest
pulling manifest
pulling c5396e06af29... 100% ▕██████████████████████████████████████████████████████████████████████████████▏ 397 MB tpulling manifest
pulling c5396e06af29... 100% ▕██████████████████████████████████████████████████████████████████████████████▏ 397 MB
pulling manifest
pulling c5396e06af29... 100% ▕██████████████████████████████████████████████████████████████████████████████▏ 397 MB
pulling manifest
pulling c5396e06af29... 100% ▕██████████████████████████████████████████████████████████████████████████████▏ 397 MB
pulling manifest
pulling manifest
pulling c5396e06af29... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 397 MB
pulling 66b9ea09bd5b... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 68 B
pulling eb4402837c78... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.5 KB
pulling 832dd9e00a68... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 KB
pulling 005f95c74751... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 490 B
verifying sha256 digest
writing manifest
success
[GIN] 2025/04/22 - 21:04:57 | 200 | 66.099807ms | 127.0.0.1 | POST "/api/show"
⠙ time=2025-04-22T21:04:57.759Z level=INFO source=server.go:105 msg="system memory" total="267.6 GiB" free="255.5 GiB" free_swap="976.0 MiB"
time=2025-04-22T21:04:57.759Z level=WARN source=ggml.go:152 msg="key not found" key=qwen2.vision.block_count default=0
time=2025-04-22T21:04:57.759Z level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.key_length default=64
time=2025-04-22T21:04:57.759Z level=WARN source=ggml.go:152 msg="key not found" key=qwen2.attention.value_length default=64
time=2025-04-22T21:04:57.759Z level=INFO source=server.go:138 msg=offload library=cpu layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[255.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="782.6 MiB" memory.required.partial="0 B" memory.required.kv="96.0 MiB" memory.required.allocations="[782.6 MiB]" memory.weights.total="373.7 MiB" memory.weights.repeating="235.8 MiB" memory.weights.nonrepeating="137.9 MiB" memory.graph.full="298.5 MiB" memory.graph.partial="405.0 MiB"
llama_model_loader: loaded meta data with 34 key-value pairs and 290 tensors from /root/.ollama/models/blobs/sha256-c5396e06af294bd101b30dce59131a76d2b773e76950acc870eda801d3ab0515 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 0.5B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5
llama_model_loader: - kv 5: general.size_label str = 0.5B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-0...
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 0.5B
llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-0.5B
llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 14: qwen2.block_count u32 = 24
llama_model_loader: - kv 15: qwen2.context_length u32 = 32768
llama_model_loader: - kv 16: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 4864
llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: general.file_type u32 = 15
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
⠹ llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 132 tensors
llama_model_loader: - type q8_0: 13 tensors
llama_model_loader: - type q4_K: 12 tensors
llama_model_loader: - type q6_K: 12 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 373.71 MiB (6.35 BPW)
⠼ load: special tokens cache size = 22
⠴ load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 494.03 M
print_info: general.name = Qwen2.5 0.5B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-04-22T21:04:58.219Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /root/.ollama/models/blobs/sha256-c5396e06af294bd101b30dce59131a76d2b773e76950acc870eda801d3ab0515 --ctx-size 8192 --batch-size 512 --threads 12 --no-mmap --parallel 4 --port 38569"
time=2025-04-22T21:04:58.219Z level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-22T21:04:58.219Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-22T21:04:58.220Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-22T21:04:58.241Z level=INFO source=runner.go:853 msg="starting go runner"
time=2025-04-22T21:04:58.243Z level=INFO source=ggml.go:109 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-04-22T21:04:58.250Z level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:38569"
⠦ llama_model_loader: loaded meta data with 34 key-value pairs and 290 tensors from /root/.ollama/models/blobs/sha256-c5396e06af294bd101b30dce59131a76d2b773e76950acc870eda801d3ab0515 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 0.5B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5
llama_model_loader: - kv 5: general.size_label str = 0.5B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-0...
llama_model_loader: - kv 8: general.base_model.count u32 = 1
llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 0.5B
llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-0.5B
llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 14: qwen2.block_count u32 = 24
llama_model_loader: - kv 15: qwen2.context_length u32 = 32768
llama_model_loader: - kv 16: qwen2.embedding_length u32 = 896
llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 4864
llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 14
llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: general.file_type u32 = 15
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
⠧ llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q5_0: 132 tensors
llama_model_loader: - type q8_0: 13 tensors
llama_model_loader: - type q4_K: 12 tensors
llama_model_loader: - type q6_K: 12 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 373.71 MiB (6.35 BPW)
⠇ time=2025-04-22T21:04:58.472Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
⠏ load: special tokens cache size = 22
⠋ load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 896
print_info: n_layer = 24
print_info: n_head = 14
print_info: n_head_kv = 2
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 128
print_info: n_embd_v_gqa = 128
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 4864
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 1B
print_info: model params = 494.03 M
print_info: general.name = Qwen2.5 0.5B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: CPU model buffer size = 373.71 MiB
⠹ llama_init_from_model: n_seq_max = 4
llama_init_from_model: n_ctx = 8192
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
llama_kv_cache_init: CPU KV buffer size = 96.00 MiB
llama_init_from_model: KV self size = 96.00 MiB, K (f16): 48.00 MiB, V (f16): 48.00 MiB
llama_init_from_model: CPU output buffer size = 2.33 MiB
⠸ llama_init_from_model: CPU compute buffer size = 300.25 MiB
llama_init_from_model: graph nodes = 846
llama_init_from_model: graph splits = 1
time=2025-04-22T21:04:58.973Z level=INFO source=server.go:619 msg="llama runner started in 0.75 seconds"
[GIN] 2025/04/22 - 21:04:58 | 200 | 1.319178734s | 127.0.0.1 | POST "/api/generate"
>>> Send a message (/? for help)
Checking the ollama process we can see it uses about
1589132 root 20 0 2874.7m 553.7m 21.5m S 1140 0.2 3:03.60 ollama
553.7MB of RAM
ai, artificial, linux, ubuntu, deploying, llm, learnng, models, ollama, llmain, tutorial, deploy, various, differences, users, applications, docker, container,