Fast Inference Test: Qwen3-4B-Thinking-2507¶

Tests fast_inference=True with vLLM backend on Qwen3-4B-Thinking-2507.

Key features tested:

FastLanguageModel loading with fast_inference mode
Thinking model output with <think>...</think> tags
Parsing and displaying thinking vs response separately

Important: This notebook includes a kernel shutdown cell at the end. vLLM does not release GPU memory in single-process mode (Jupyter), so kernel restart is required between different model tests.

In [1]:

  Copied!     
 
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

import unsloth
from unsloth import FastLanguageModel
import vllm
import torch

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, vLLM {vllm.__version__}, {gpu}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() import unsloth from unsloth import FastLanguageModel import vllm import torch # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, vLLM {vllm.__version__}, {gpu}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, vLLM 0.14.0rc1.dev201+gadcf682fc, NVIDIA GeForce RTX 4080 SUPER

In [2]:

  Copied!     
 
# Test Qwen3-4B-Thinking-2507 with fast_inference=True
MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with fast_inference=True...")

from vllm import SamplingParams

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=1024,  # Increased for thinking content
    load_in_4bit=True,
    fast_inference=True,
)

print(f"Model loaded: {type(model).__name__}")
# Test Qwen3-4B-Thinking-2507 with fast_inference=True MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]} with fast_inference=True...") from vllm import SamplingParams model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=1024, # Increased for thinking content load_in_4bit=True, fast_inference=True, ) print(f"Model loaded: {type(model).__name__}")

Loading Qwen3-4B-Thinking-2507-unsloth-bnb-4bit with fast_inference=True...WARNING 01-03 11:43:50 [vllm.py:1427] Current vLLM config is not set.INFO 01-03 11:43:50 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.INFO 01-03 11:43:50 [vllm.py:609] Disabling NCCL for DP synchronization when using async scheduling.INFO 01-03 11:43:50 [vllm.py:614] Asynchronous scheduling is enabled.INFO 01-03 11:43:51 [vllm_utils.py:702] Unsloth: Patching vLLM v1 graph capture==((====))==  Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!Unsloth: vLLM loading unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit with actual GPU utilization = 44.45%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 15.57 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 32.
Unsloth: vLLM's KV Cache can use up to 4.06 GB. Also swap space = 6 GB.
Unsloth: Not an error, but `use_cudagraph` is not supported in vLLM.config.CompilationConfig. Skipping.
Unsloth: Not an error, but `use_inductor` is not supported in vLLM.config.CompilationConfig. Skipping.
WARNING 01-03 11:43:56 [compilation.py:739] Level is deprecated and will be removed in the next release,either 0.12.0 or 0.11.2 whichever is soonest.Use mode instead.If both level and mode are given,only mode will be used.Unsloth: Not an error, but `device` is not supported in vLLM. Skipping.
WARNING 01-03 11:43:56 [attention.py:82] Using VLLM_ATTENTION_BACKEND environment variable is deprecated and will be removed in v0.14.0 or v1.0.0, whichever is soonest. Please use --attention-config.backend command line argument or AttentionConfig(backend=...) config field instead.INFO 01-03 11:43:56 [utils.py:253] non-default args: {'load_format': 'bitsandbytes', 'dtype': torch.bfloat16, 'max_model_len': 1024, 'enable_prefix_caching': True, 'swap_space': 6, 'gpu_memory_utilization': 0.4445226398597971, 'max_num_batched_tokens': 4096, 'max_num_seqs': 32, 'max_logprobs': 0, 'disable_log_stats': True, 'quantization': 'bitsandbytes', 'enable_lora': True, 'max_lora_rank': 64, 'enable_chunked_prefill': True, 'compilation_config': {'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': None, 'inductor_compile_config': {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False, 'debug': False, 'dce': True, 'memory_planning': True, 'coordinate_descent_tuning': False, 'trace.graph_diagram': False, 'compile_threads': 16, 'group_fusion': True, 'disable_progress': False, 'verbose_progress': True, 'triton.multi_kernel': 0, 'triton.use_block_ptr': True, 'triton.enable_persistent_tma_matmul': True, 'triton.autotune_at_compile_time': False, 'triton.cooperative_reductions': False, 'cuda.compile_opt_level': '-O2', 'cuda.enable_cuda_lto': True, 'combo_kernels': False, 'benchmark_combo_kernel': True, 'combo_kernel_foreach_dynamic_shapes': True, 'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}, 'model': 'unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit'}WARNING 01-03 11:43:56 [arg_utils.py:1196] The global random seed is set to 0. Since VLLM_ENABLE_V1_MULTIPROCESSING is set to False, this may affect the random state of the Python process that launched vLLM.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/pydantic/type_adapter.py:605: UserWarning: Pydantic serializer warnings:
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [field_name='mode', input_value=3, input_type=int])
  return self.serializer.to_python(

INFO 01-03 11:43:57 [model.py:517] Resolved architecture: Qwen3ForCausalLMINFO 01-03 11:43:57 [model.py:1688] Using max model len 1024WARNING 01-03 11:43:57 [vllm.py:1427] Current vLLM config is not set.INFO 01-03 11:43:57 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.INFO 01-03 11:43:57 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=4096.Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'nf4', 'bnb_4bit_use_double_quant': True, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': ['embed_tokens', 'embedding', 'lm_head', 'multi_modal_projector', 'merger', 'modality_projection', 'model.layers.27.mlp', 'model.layers.34.mlp', 'model.layers.6.self_attn', 'model.layers.0.mlp', 'model.layers.5.mlp', 'model.layers.4.mlp', 'model.layers.5.self_attn', 'model.layers.1.mlp', 'model.layers.6.mlp'], 'llm_int8_threshold': 6.0}INFO 01-03 11:43:59 [core.py:95] Initializing a V1 LLM engine (v0.14.0rc1.dev201+gadcf682fc) with config: model='unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False), seed=0, served_model_name=unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': 3, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False, 'debug': False, 'dce': True, 'memory_planning': True, 'coordinate_descent_tuning': False, 'trace.graph_diagram': False, 'compile_threads': 16, 'group_fusion': True, 'disable_progress': False, 'verbose_progress': True, 'triton.multi_kernel': 0, 'triton.use_block_ptr': True, 'triton.enable_persistent_tma_matmul': True, 'triton.autotune_at_compile_time': False, 'triton.cooperative_reductions': False, 'cuda.compile_opt_level': '-O2', 'cuda.enable_cuda_lto': True, 'combo_kernels': False, 'benchmark_combo_kernel': True, 'combo_kernel_foreach_dynamic_shapes': True, 'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 64, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}INFO 01-03 11:43:59 [parallel_state.py:1210] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.89.0.21:57215 backend=nccl[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0INFO 01-03 11:43:59 [parallel_state.py:1418] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/pydantic/type_adapter.py:605: UserWarning: Pydantic serializer warnings:
  PydanticSerializationUnexpectedValue(Expected `enum` - serialized value may not be as expected [field_name='mode', input_value=3, input_type=int])
  return self.serializer.to_python(

INFO 01-03 11:43:59 [topk_topp_sampler.py:47] Using FlashInfer for top-p & top-k sampling.INFO 01-03 11:43:59 [gpu_model_runner.py:3762] Starting to load model unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit...INFO 01-03 11:44:00 [cuda.py:315] Using AttentionBackendEnum.FLASHINFER backend.INFO 01-03 11:44:00 [bitsandbytes_loader.py:790] Loading weights with BitsAndBytes quantization. May take a while ...INFO 01-03 11:44:01 [weight_utils.py:550] No model.safetensors.index.json found in remote.

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

INFO 01-03 11:44:01 [punica_selector.py:20] Using PunicaWrapperGPU.INFO 01-03 11:44:02 [gpu_model_runner.py:3859] Model loading took 3.5851 GiB memory and 1.476188 secondsINFO 01-03 11:44:08 [backends.py:644] Using cache directory: /workspace/.cache/vllm/torch_compile_cache/fcdf551ae6/rank_0_0/backbone for vLLM's torch.compileINFO 01-03 11:44:08 [backends.py:704] Dynamo bytecode transform time: 5.36 sINFO 01-03 11:44:11 [backends.py:226] Directly load the compiled graph(s) for compile range (1, 4096) from the cache, took 0.489 sINFO 01-03 11:44:11 [monitor.py:34] torch.compile takes 5.85 s in totalINFO 01-03 11:44:12 [gpu_worker.py:363] Available KV cache memory: 2.90 GiBINFO 01-03 11:44:12 [kv_cache_utils.py:1305] GPU KV cache size: 21,072 tokensINFO 01-03 11:44:12 [kv_cache_utils.py:1310] Maximum concurrency for 1,024 tokens per request: 20.58xINFO 01-03 11:44:12 [kernel_warmup.py:64] Warming up FlashInfer attention.INFO 01-03 11:44:12 [vllm_utils.py:707] Unsloth: Running patched vLLM v1 `capture_model`.

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/22 [00:00<?, ?it/s]

WARNING 01-03 11:44:12 [utils.py:256] Using default LoRA kernel configs

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   5%|▍         | 1/22 [00:00<00:02,  7.58it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  14%|█▎        | 3/22 [00:00<00:01, 12.30it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  23%|██▎       | 5/22 [00:00<00:01, 14.22it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  32%|███▏      | 7/22 [00:00<00:00, 15.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  41%|████      | 9/22 [00:00<00:00, 16.02it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  50%|█████     | 11/22 [00:00<00:00, 16.45it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  59%|█████▉    | 13/22 [00:00<00:00, 16.88it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  68%|██████▊   | 15/22 [00:00<00:00, 17.20it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  77%|███████▋  | 17/22 [00:01<00:00, 16.94it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  86%|████████▋ | 19/22 [00:01<00:00, 17.14it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  95%|█████████▌| 21/22 [00:01<00:00, 17.14it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 22/22 [00:01<00:00, 16.43it/s]
Capturing CUDA graphs (decode, FULL):   0%|          | 0/14 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):  14%|█▍        | 2/14 [00:00<00:00, 14.91it/s]
Capturing CUDA graphs (decode, FULL):  29%|██▊       | 4/14 [00:00<00:00, 15.59it/s]
Capturing CUDA graphs (decode, FULL):  43%|████▎     | 6/14 [00:00<00:00, 15.94it/s]
Capturing CUDA graphs (decode, FULL):  57%|█████▋    | 8/14 [00:00<00:00, 16.36it/s]
Capturing CUDA graphs (decode, FULL):  71%|███████▏  | 10/14 [00:00<00:00, 16.87it/s]
Capturing CUDA graphs (decode, FULL):  86%|████████▌ | 12/14 [00:00<00:00, 17.06it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 14/14 [00:00<00:00, 17.13it/s]

INFO 01-03 11:44:15 [gpu_model_runner.py:4810] Graph capturing finished in 2 secs, took 0.61 GiBINFO 01-03 11:44:15 [vllm_utils.py:714] Unsloth: Patched vLLM v1 graph capture finished in 2 secs.

INFO 01-03 11:44:16 [core.py:272] init engine (profile, create kv cache, warmup model) took 13.55 secondsINFO 01-03 11:44:16 [core.py:184] Batch queue is enabled with size 2INFO 01-03 11:44:16 [llm.py:344] Supported tasks: ('generate',)Unsloth: Just some info: will skip parsing ['layer_norm2', 'layer_norm1', 'norm', 'attention_norm', 'post_feedforward_layernorm', 'norm1', 'post_attention_layernorm', 'ffn_norm', 'k_norm', 'input_layernorm', 'post_layernorm', 'q_norm', 'pre_feedforward_layernorm', 'norm2']

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

Performing substitution for additional_keys=set()
Unsloth: Just some info: will skip parsing ['cross_attn_post_attention_layernorm', 'layer_norm2', 'layer_norm1', 'norm', 'attention_norm', 'post_feedforward_layernorm', 'norm1', 'post_attention_layernorm', 'ffn_norm', 'k_norm', 'input_layernorm', 'post_layernorm', 'cross_attn_input_layernorm', 'q_norm', 'pre_feedforward_layernorm', 'norm2']Model loaded: Qwen3ForCausalLM

In [3]:

  Copied!     
 
# Test generation with thinking model
FastLanguageModel.for_inference(model)

# Use a prompt that encourages reasoning
messages = [{"role": "user", "content": "What is 15 + 27? Show your thinking."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

sampling_params = SamplingParams(
    temperature=0.6,  # Recommended for thinking models
    top_p=0.95,
    top_k=20,
    max_tokens=2048,  # Increased to allow full thinking + response
)

import time
start = time.time()
outputs = model.fast_generate([prompt], sampling_params=sampling_params)
elapsed = time.time() - start

# Get raw output text and token IDs for parsing
raw_output = outputs[0].outputs[0].text
output_token_ids = outputs[0].outputs[0].token_ids

print(f"Generation time: {elapsed:.2f}s")
print(f"Output tokens: {len(output_token_ids)}")
print(f"\n{'='*60}")
print("RAW OUTPUT:")
print(f"{'='*60}")
print(raw_output)
# Test generation with thinking model FastLanguageModel.for_inference(model) # Use a prompt that encourages reasoning messages = [{"role": "user", "content": "What is 15 + 27? Show your thinking."}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) sampling_params = SamplingParams( temperature=0.6, # Recommended for thinking models top_p=0.95, top_k=20, max_tokens=2048, # Increased to allow full thinking + response ) import time start = time.time() outputs = model.fast_generate([prompt], sampling_params=sampling_params) elapsed = time.time() - start # Get raw output text and token IDs for parsing raw_output = outputs[0].outputs[0].text output_token_ids = outputs[0].outputs[0].token_ids print(f"Generation time: {elapsed:.2f}s") print(f"Output tokens: {len(output_token_ids)}") print(f"\n{'='*60}") print("RAW OUTPUT:") print(f"{'='*60}") print(raw_output)

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Generation time: 8.95s
Output tokens: 987

============================================================
RAW OUTPUT:
============================================================
Okay, the user is asking "What is 15 + 27? Show your thinking." Hmm, this seems like a very basic arithmetic question, but they specifically want to see the thinking process. 

First, I should consider who might ask this. Could be a child learning addition, a non-native English speaker, or someone double-checking their math. Since they want "show your thinking," they're probably not just after the answer but want to understand the steps. 

I'll break it down step by step. The most straightforward way is to do it mentally: 15 + 20 is 35, then +7 more is 42. But I should also show the column method since that's how it's taught in schools. 

Wait, should I mention other methods? Like counting on? But for 15+27, that might be overkill. The user didn't specify age, so I'll keep it simple. 

*imagines the user* If this is a kid, I should avoid jargon. If it's an adult, I can be slightly more technical. But since they said "show your thinking," I'll go for clear, step-by-step. 

I'll do: 
- Start with 15 + 20 = 35 (since 27 is 20+7) 
- Then 35 + 7 = 42 
- And the column method for thoroughness 

*checks mental math* 15+27... 10+20=30, 5+7=12, 30+12=42. Yep, same answer. 

Should I add that 42 is the final answer? The user said "show your thinking" so the conclusion is important. 

*thinks about errors* Oh! I should make sure not to say "15+20=35" is wrong - that's correct. Some people might do 15+25=40 then +2=42, but that's more steps. 

I'll go with the two methods: 
1. Breaking into tens and ones 
2. Column addition 

The user didn't specify method, so both are good. I'll write it as: 

"First, I can add 15 and 20 to get 35, then add the remaining 7 to get 42. 
Alternatively, using column addition: 
  15
+ 27
------
  42"

*double-checks* 15+27: 5+7=12, write 2 carry 1; 1+2+1=4. So 42. Perfect. 

I think that's clear. The user gets the answer and the reasoning. No need to overcomplicate it since it's a simple addition. 

*considers tone* I'll keep it friendly and encouraging - "Here's how I think about it" so it feels like a conversation, not a textbook. 

Final answer: 42, with the two methods shown. Done.


Here's my step-by-step thinking for **15 + 27**:

1. **I break it down into parts** to make it easier:  
   - I know 15 is 10 + 5, and 27 is 20 + 7.  
   - So, I can add the "tens" first: **10 + 20 = 30**.  
   - Then add the "ones": **5 + 7 = 12**.  
   - Now, combine the results: **30 + 12 = 42**.  

2. **I also try column addition** (a common method for addition):  
   ```
     15
   + 27
   ----
   ```
   - Add the **ones place first**: 5 + 7 = 12. Write down the **2** and carry over the **1** to the tens place.  
   - Add the **tens place**: 1 (from 15) + 2 (from 27) + 1 (carried over) = 4.  
   - So, the result is **42**.

**Why this works**:  
- Both methods confirm the same answer.  
- Breaking it into tens and ones (10+20 and 5+7) is intuitive and helps avoid mistakes.  
- Column addition is systematic and works for any two numbers.

**Final answer**:  
**15 + 27 = 42** 🌟

In [4]:

  Copied!     
 
# Parse thinking content vs final response using token IDs
THINK_END_TOKEN_ID = 151668  # </think> token ID for Qwen3-Thinking models

def parse_thinking_response_tokens(token_ids, tokenizer):
    """
    Parse thinking model output using token ID 151668 (</think>) as boundary.
    
    With Thinking-2507 models + add_generation_prompt=True:
    - Template adds <think> to prompt
    - Model output starts with thinking content (no <think> prefix)
    - Model outputs </think> (token 151668) when done, followed by response
    """
    token_list = list(token_ids)
    
    if THINK_END_TOKEN_ID in token_list:
        end_idx = token_list.index(THINK_END_TOKEN_ID)
        thinking_tokens = token_list[:end_idx]
        response_tokens = token_list[end_idx + 1:]
        thinking = tokenizer.decode(thinking_tokens, skip_special_tokens=True).strip()
        response = tokenizer.decode(response_tokens, skip_special_tokens=True).strip()
    else:
        # No </think> found - model may still be thinking (need more tokens)
        thinking = tokenizer.decode(token_list, skip_special_tokens=True).strip()
        response = "(Model did not complete thinking - increase max_tokens)"
    
    return thinking, response

# Parse the output using token IDs
thinking_content, response_content = parse_thinking_response_tokens(output_token_ids, tokenizer)

print(f"{'='*60}")
print("THINKING CONTENT:")
print(f"{'='*60}")
print(thinking_content if thinking_content else "(No thinking content found)")

print(f"\n{'='*60}")
print("FINAL RESPONSE:")
print(f"{'='*60}")
print(response_content if response_content else "(No response found)")
# Parse thinking content vs final response using token IDs THINK_END_TOKEN_ID = 151668 # token ID for Qwen3-Thinking models def parse_thinking_response_tokens(token_ids, tokenizer): """ Parse thinking model output using token ID 151668 () as boundary. With Thinking-2507 models + add_generation_prompt=True: - Template adds  to prompt - Model output starts with thinking content (no  prefix) - Model outputs  (token 151668) when done, followed by response """ token_list = list(token_ids) if THINK_END_TOKEN_ID in token_list: end_idx = token_list.index(THINK_END_TOKEN_ID) thinking_tokens = token_list[:end_idx] response_tokens = token_list[end_idx + 1:] thinking = tokenizer.decode(thinking_tokens, skip_special_tokens=True).strip() response = tokenizer.decode(response_tokens, skip_special_tokens=True).strip() else: # No  found - model may still be thinking (need more tokens) thinking = tokenizer.decode(token_list, skip_special_tokens=True).strip() response = "(Model did not complete thinking - increase max_tokens)" return thinking, response # Parse the output using token IDs thinking_content, response_content = parse_thinking_response_tokens(output_token_ids, tokenizer) print(f"{'='*60}") print("THINKING CONTENT:") print(f"{'='*60}") print(thinking_content if thinking_content else "(No thinking content found)") print(f"\n{'='*60}") print("FINAL RESPONSE:") print(f"{'='*60}") print(response_content if response_content else "(No response found)")

============================================================
THINKING CONTENT:
============================================================
Okay, the user is asking "What is 15 + 27? Show your thinking." Hmm, this seems like a very basic arithmetic question, but they specifically want to see the thinking process. 

First, I should consider who might ask this. Could be a child learning addition, a non-native English speaker, or someone double-checking their math. Since they want "show your thinking," they're probably not just after the answer but want to understand the steps. 

I'll break it down step by step. The most straightforward way is to do it mentally: 15 + 20 is 35, then +7 more is 42. But I should also show the column method since that's how it's taught in schools. 

Wait, should I mention other methods? Like counting on? But for 15+27, that might be overkill. The user didn't specify age, so I'll keep it simple. 

*imagines the user* If this is a kid, I should avoid jargon. If it's an adult, I can be slightly more technical. But since they said "show your thinking," I'll go for clear, step-by-step. 

I'll do: 
- Start with 15 + 20 = 35 (since 27 is 20+7) 
- Then 35 + 7 = 42 
- And the column method for thoroughness 

*checks mental math* 15+27... 10+20=30, 5+7=12, 30+12=42. Yep, same answer. 

Should I add that 42 is the final answer? The user said "show your thinking" so the conclusion is important. 

*thinks about errors* Oh! I should make sure not to say "15+20=35" is wrong - that's correct. Some people might do 15+25=40 then +2=42, but that's more steps. 

I'll go with the two methods: 
1. Breaking into tens and ones 
2. Column addition 

The user didn't specify method, so both are good. I'll write it as: 

"First, I can add 15 and 20 to get 35, then add the remaining 7 to get 42. 
Alternatively, using column addition: 
  15
+ 27
------
  42"

*double-checks* 15+27: 5+7=12, write 2 carry 1; 1+2+1=4. So 42. Perfect. 

I think that's clear. The user gets the answer and the reasoning. No need to overcomplicate it since it's a simple addition. 

*considers tone* I'll keep it friendly and encouraging - "Here's how I think about it" so it feels like a conversation, not a textbook. 

Final answer: 42, with the two methods shown. Done.

============================================================
FINAL RESPONSE:
============================================================
Here's my step-by-step thinking for **15 + 27**:

1. **I break it down into parts** to make it easier:  
   - I know 15 is 10 + 5, and 27 is 20 + 7.  
   - So, I can add the "tens" first: **10 + 20 = 30**.  
   - Then add the "ones": **5 + 7 = 12**.  
   - Now, combine the results: **30 + 12 = 42**.  

2. **I also try column addition** (a common method for addition):  
   ```
     15
   + 27
   ----
   ```
   - Add the **ones place first**: 5 + 7 = 12. Write down the **2** and carry over the **1** to the tens place.  
   - Add the **tens place**: 1 (from 15) + 2 (from 27) + 1 (carried over) = 4.  
   - So, the result is **42**.

**Why this works**:  
- Both methods confirm the same answer.  
- Breaking it into tens and ones (10+20 and 5+7) is intuitive and helps avoid mistakes.  
- Column addition is systematic and works for any two numbers.

**Final answer**:  
**15 + 27 = 42** 🌟

In [5]:

  Copied!     
 
# Verification summary
has_thinking = bool(thinking_content) and "(Model did not complete" not in response_content
has_response = bool(response_content) and "(Model did not complete" not in response_content
think_tag_found = THINK_END_TOKEN_ID in list(output_token_ids)

print(f"\n{'='*60}")
print("VERIFICATION SUMMARY")
print(f"{'='*60}")
print(f"Model: {MODEL_NAME}")
print(f"FastInference: ✅ SUPPORTED")
print(f"</think> token found: {'✅ YES' if think_tag_found else '❌ NO (increase max_tokens)'}")
print(f"Thinking content: {'✅ YES' if has_thinking else '❌ NO'}")
print(f"Response generated: {'✅ YES' if has_response else '❌ NO'}")
print(f"Output tokens: {len(output_token_ids)}")
print(f"Generation time: {elapsed:.2f}s")
print(f"{'='*60}")

if has_thinking and has_response and think_tag_found:
    print("\n✅ Qwen3-4B-Thinking-2507 Fast Inference Test PASSED")
elif not think_tag_found:
    print("\n❌ Model did not complete thinking - increase max_tokens")
else:
    print("\n⚠️ Test completed but thinking output may need review")
# Verification summary has_thinking = bool(thinking_content) and "(Model did not complete" not in response_content has_response = bool(response_content) and "(Model did not complete" not in response_content think_tag_found = THINK_END_TOKEN_ID in list(output_token_ids) print(f"\n{'='*60}") print("VERIFICATION SUMMARY") print(f"{'='*60}") print(f"Model: {MODEL_NAME}") print(f"FastInference: ✅ SUPPORTED") print(f" token found: {'✅ YES' if think_tag_found else '❌ NO (increase max_tokens)'}") print(f"Thinking content: {'✅ YES' if has_thinking else '❌ NO'}") print(f"Response generated: {'✅ YES' if has_response else '❌ NO'}") print(f"Output tokens: {len(output_token_ids)}") print(f"Generation time: {elapsed:.2f}s") print(f"{'='*60}") if has_thinking and has_response and think_tag_found: print("\n✅ Qwen3-4B-Thinking-2507 Fast Inference Test PASSED") elif not think_tag_found: print("\n❌ Model did not complete thinking - increase max_tokens") else: print("\n⚠️ Test completed but thinking output may need review")

============================================================
VERIFICATION SUMMARY
============================================================
Model: unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit
FastInference: ✅ SUPPORTED
</think> token found: ✅ YES
Thinking content: ✅ YES
Response generated: ✅ YES
Output tokens: 987
Generation time: 8.95s
============================================================

✅ Qwen3-4B-Thinking-2507 Fast Inference Test PASSED

Test Complete¶

The Qwen3-4B-Thinking-2507 fast_inference test has completed. The kernel will now shut down to release all GPU memory.

What Was Verified¶

FastLanguageModel loading with fast_inference mode (vLLM backend)
Thinking model generates <think>...</think> content
Parsing separates thinking from final response
Self-questioning reasoning style in thinking block

Ready for Production¶

If this test passed, your environment is ready for:

Thinking model inference with vLLM acceleration
Chain-of-thought reasoning workflows
Training notebooks that require thinking output parsing

In [6]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

Out[6]:

{'status': 'ok', 'restart': False}