skills: consolidate mlops redundancies (gguf+llama-cpp, grpo+trl, guidance→optional) (#11965)

Three tightly-scoped built-in skill consolidations to reduce redundancy in the available_skills listing injected into every system prompt: 1. gguf-quantization → llama-cpp (merged) GGUF is llama.cpp's format; two skills covered the same toolchain. The merged llama-cpp skill keeps the full K-quant table + imatrix workflow from gguf and the ROCm/benchmarks/supported-models sections from the original llama-cpp. All 5 reference files preserved. 2. grpo-rl-training → fine-tuning-with-trl (folded in) GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB deep-dive SKILL.md to references/grpo-training.md and the working template to templates/basic_grpo_training.py. TRL's GRPO workflow section now points to both. Atropos skill's related_skills updated. 3. guidance → optional-skills/mlops/ Dropped from built-in. Outlines (still built-in) covers the same structured-generation ground with wider adoption. Listed in the optional catalog for users who specifically want Guidance. Net: 3 fewer built-in skill lines in every system prompt, zero content loss. Contributor authorship preserved via git rename detection.
2026-04-26 01:01:40 +00:00 · 2026-04-17 21:36:40 -07:00 · 2026-04-17 21:36:40 -07:00 · 73bccc94c7
commit 73bccc94c7
parent 598cba62ad
15 changed files with 470 additions and 889 deletions
--- a/skills/mlops/inference/llama-cpp/SKILL.md
+++ b/skills/mlops/inference/llama-cpp/SKILL.md
@ -1,138 +1,271 @@
 ---
 name: llama-cpp
-description: Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
-version: 1.0.0
+description: Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
+version: 2.0.0
 author: Orchestra Research
 license: MIT
-dependencies: [llama-cpp-python]
+dependencies: [llama-cpp-python>=0.2.0]
 metadata:
  hermes:
-    tags: [Inference Serving, Llama.cpp, CPU Inference, Apple Silicon, Edge Deployment, GGUF, Quantization, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded]
-
+    tags: [llama.cpp, GGUF, Quantization, CPU Inference, Apple Silicon, Edge Deployment, Non-NVIDIA, AMD GPUs, Intel GPUs, Embedded, Model Compression]
 ---

-# llama.cpp
+# llama.cpp + GGUF

-Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
+Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.

-## When to use llama.cpp
+## When to use

-**Use llama.cpp when:**
- Running on CPU-only machines
- Deploying on Apple Silicon (M1/M2/M3/M4)
- Using AMD or Intel GPUs (no CUDA)
- Edge deployment (Raspberry Pi, embedded systems)
- Need simple deployment without Docker/Python
+**Use llama.cpp + GGUF when:**
+- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
+- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
+- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
+- Need flexible quantization (2–8 bit with K-quants)
+- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
+- Want a single binary deploy without Docker/Python

-**Use TensorRT-LLM instead when:**
- Have NVIDIA GPUs (A100/H100)
- Need maximum throughput (100K+ tok/s)
- Running in datacenter with CUDA
+**Key advantages:**
+- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
+- No Python runtime required (pure C/C++)
+- K-quants + imatrix for better low-bit quality
+- OpenAI-compatible server built in
+- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)

-**Use vLLM instead when:**
- Have NVIDIA GPUs
- Need Python-first API
- Want PagedAttention
+**Use alternatives instead:**
+- **vLLM** — NVIDIA GPUs, PagedAttention, Python-first, max throughput
+- **TensorRT-LLM** — Production NVIDIA (A100/H100), maximum speed
+- **AWQ/GPTQ** — Calibrated quantization for NVIDIA-only deployments
+- **bitsandbytes** — Simple HuggingFace transformers integration
+- **HQQ** — Fast calibration-free quantization

 ## Quick start

-### Installation
+### Install

 ```bash
-# macOS/Linux
+# macOS / Linux (simplest)
 brew install llama.cpp

 # Or build from source
-git clone https://github.com/ggerganov/llama.cpp
+git clone https://github.com/ggml-org/llama.cpp
 cd llama.cpp
-make
+make                        # CPU
+make GGML_METAL=1           # Apple Silicon
+make GGML_CUDA=1            # NVIDIA CUDA
+make LLAMA_HIP=1            # AMD ROCm

-# With Metal (Apple Silicon)
-make LLAMA_METAL=1
-
-# With CUDA (NVIDIA)
-make LLAMA_CUDA=1
-
-# With ROCm (AMD)
-make LLAMA_HIP=1
+# Python bindings (optional)
+pip install llama-cpp-python
+# With CUDA:   CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
+# With Metal:  CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
 ```

-### Download model
+### Download a pre-quantized GGUF

 ```bash
-# Download from HuggingFace (GGUF format)
+# TheBloke hosts most popular models pre-quantized
 huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/
+```

-# Or convert from HuggingFace
-python convert_hf_to_gguf.py models/llama-2-7b-chat/
+### Or convert a HuggingFace model to GGUF
+
+```bash
+# 1. Download HF model
+huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
+
+# 2. Convert to FP16 GGUF
+python convert_hf_to_gguf.py ./llama-3.1-8b \
+    --outfile llama-3.1-8b-f16.gguf \
+    --outtype f16
+
+# 3. Quantize to Q4_K_M
+./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
 ```

 ### Run inference

 ```bash
-# Simple chat
-./llama-cli \
-    -m models/llama-2-7b-chat.Q4_K_M.gguf \
-    -p "Explain quantum computing" \
-    -n 256  # Max tokens
+# One-shot prompt
+./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256

 # Interactive chat
-./llama-cli \
-    -m models/llama-2-7b-chat.Q4_K_M.gguf \
-    --interactive
+./llama-cli -m model.Q4_K_M.gguf --interactive
+
+# With GPU offload
+./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
 ```

-### Server mode
+### Serve an OpenAI-compatible API

 ```bash
-# Start OpenAI-compatible server
 ./llama-server \
-    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    -m model.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
-    -ngl 32  # Offload 32 layers to GPU
+    -ngl 35 \
+    -c 4096 \
+    --parallel 4 \
+    --cont-batching
+```

-# Client request
+```bash
 curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
-    "model": "llama-2-7b-chat",
+    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'
 ```

-## Quantization formats
+## Quantization formats (GGUF)

-### GGUF format overview
+### K-quant methods (recommended)

-| Format | Bits | Size (7B) | Speed | Quality | Use Case |
-|--------|------|-----------|-------|---------|----------|
-| **Q4_K_M** | 4.5 | 4.1 GB | Fast | Good | **Recommended default** |
-| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
-| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
-| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
-| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
-| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
+| Type | Bits | Size (7B) | Quality | Use Case |
+|------|------|-----------|---------|----------|
+| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
+| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
+| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
+| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
+| **Q4_K_M** | 4.5 | ~4.1 GB | High | **Recommended default** |
+| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
+| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
+| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
+| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |

-### Choosing quantization
+**Variant suffixes** — `_S` (Small, faster, lower quality), `_M` (Medium, balanced), `_L` (Large, better quality).
+
+**Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist** but always prefer K-quants for better quality/size ratio.
+
+**IQ quantization** — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require `--imatrix`.
+
+**Task-specific defaults:**
+- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
+- Code generation: Q5_K_M or Q6_K (higher precision helps)
+- Technical / medical: Q6_K or Q8_0
+- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
+- Raspberry Pi / edge: Q2_K or Q3_K_S
+
+## Conversion workflows
+
+### Basic: HF → GGUF → quantized

 ```bash
-# General use (balanced)
-Q4_K_M  # 4-bit, medium quality
+python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
+./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
+./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
+```

-# Maximum speed (more degradation)
-Q2_K or Q3_K_M
+### With importance matrix (imatrix) — better low-bit quality

-# Maximum quality (slower)
-Q6_K or Q8_0
+`imatrix` gives 10–20% perplexity improvement at Q4, essential at Q3 and below.

-# Very large models (70B, 405B)
-Q3_K_M or Q4_K_S  # Lower bits to fit in memory
+```bash
+# 1. Convert to FP16 GGUF
+python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
+
+# 2. Prepare calibration data (diverse text, ~100MB is ideal)
+cat > calibration.txt << 'EOF'
+The quick brown fox jumps over the lazy dog.
+Machine learning is a subset of artificial intelligence.
+# Add more diverse text samples...
+EOF
+
+# 3. Generate importance matrix
+./llama-imatrix -m model-f16.gguf \
+    -f calibration.txt \
+    --chunk 512 \
+    -o model.imatrix \
+    -ngl 35
+
+# 4. Quantize with imatrix
+./llama-quantize --imatrix model.imatrix \
+    model-f16.gguf model-q4_k_m.gguf Q4_K_M
+```
+
+### Multi-quant batch
+
+```bash
+#!/bin/bash
+MODEL="llama-3.1-8b-f16.gguf"
+IMATRIX="llama-3.1-8b.imatrix"
+
+./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
+
+for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
+    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
+    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
+    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
+done
+```
+
+### Quality testing (perplexity)
+
+```bash
+./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
+# Baseline FP16: ~5.96  |  Q4_K_M: ~6.06 (+1.7%)  |  Q2_K: ~6.87 (+15.3%)
+```
+
+## Python bindings (llama-cpp-python)
+
+### Basic generation
+
+```python
+from llama_cpp import Llama
+
+llm = Llama(
+    model_path="./model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,     # 0 for CPU only, 99 to offload everything
+    n_threads=8,
+)
+
+output = llm(
+    "What is machine learning?",
+    max_tokens=256,
+    temperature=0.7,
+    stop=["</s>", "\n\n"],
+)
+print(output["choices"][0]["text"])
+```
+
+### Chat completion + streaming
+
+```python
+llm = Llama(
+    model_path="./model-q4_k_m.gguf",
+    n_ctx=4096,
+    n_gpu_layers=35,
+    chat_format="llama-3",    # Or "chatml", "mistral", etc.
+)
+
+# Non-streaming
+response = llm.create_chat_completion(
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "What is Python?"},
+    ],
+    max_tokens=256,
+    temperature=0.7,
+)
+print(response["choices"][0]["message"]["content"])
+
+# Streaming
+for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
+    print(chunk["choices"][0]["text"], end="", flush=True)
+```
+
+### Embeddings
+
+```python
+llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
+vec = llm.embed("This is a test sentence.")
+print(f"Embedding dimension: {len(vec)}")
 ```

 ## Hardware acceleration
@ -140,122 +273,166 @@ Q3_K_M or Q4_K_S  # Lower bits to fit in memory
 ### Apple Silicon (Metal)

 ```bash
-# Build with Metal
-make LLAMA_METAL=1
-
-# Run with GPU acceleration (automatic)
-./llama-cli -m model.gguf -ngl 999  # Offload all layers
-
-# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
+make clean && make GGML_METAL=1
+./llama-cli -m model.gguf -ngl 99 -p "Hello"   # offload all layers
 ```

-### NVIDIA GPUs (CUDA)
-
-```bash
-# Build with CUDA
-make LLAMA_CUDA=1
-
-# Offload layers to GPU
-./llama-cli -m model.gguf -ngl 35  # Offload 35/40 layers
-
-# Hybrid CPU+GPU for large models
-./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20  # GPU: 20 layers, CPU: rest
+```python
+llm = Llama(
+    model_path="model.gguf",
+    n_gpu_layers=99,     # Offload everything
+    n_threads=1,         # Metal handles parallelism
+)
 ```

-### AMD GPUs (ROCm)
+Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
+
+### NVIDIA (CUDA)
+
+```bash
+make clean && make GGML_CUDA=1
+./llama-cli -m model.gguf -ngl 35 -p "Hello"
+
+# Hybrid for large models
+./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20   # GPU: 20 layers, CPU: rest
+
+# Multi-GPU split
+./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
+```
+
+### AMD (ROCm)

 ```bash
-# Build with ROCm
 make LLAMA_HIP=1
-
-# Run with AMD GPU
 ./llama-cli -m model.gguf -ngl 999
 ```

-## Common patterns
-
-### Batch processing
+### CPU

 ```bash
-# Process multiple prompts from file
-cat prompts.txt | ./llama-cli \
-    -m model.gguf \
-    --batch-size 512 \
-    -n 100
+# Match PHYSICAL cores, not logical
+./llama-cli -m model.gguf -t 8 -p "Hello"
+
+# BLAS acceleration (2–3× speedup)
+make LLAMA_OPENBLAS=1
 ```

-### Constrained generation
-
-```bash
-# JSON output with grammar
-./llama-cli \
-    -m model.gguf \
-    -p "Generate a person: " \
-    --grammar-file grammars/json.gbnf
-
-# Outputs valid JSON only
-```
-
-### Context size
-
-```bash
-# Increase context (default 512)
-./llama-cli \
-    -m model.gguf \
-    -c 4096  # 4K context window
-
-# Very long context (if model supports)
-./llama-cli -m model.gguf -c 32768  # 32K context
+```python
+llm = Llama(
+    model_path="model.gguf",
+    n_gpu_layers=0,
+    n_threads=8,
+    n_batch=512,         # Larger batch = faster prompt processing
+)
 ```

 ## Performance benchmarks

-### CPU performance (Llama 2-7B Q4_K_M)
+### CPU (Llama 2-7B Q4_K_M)

-| CPU | Threads | Speed | Cost |
-|-----|---------|-------|------|
-| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
-| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
-| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
-| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
+| CPU | Threads | Speed |
+|-----|---------|-------|
+| Apple M3 Max (Metal) | 16 | 50 tok/s |
+| AMD Ryzen 9 7950X | 32 | 35 tok/s |
+| Intel i9-13900K | 32 | 30 tok/s |

-### GPU acceleration (Llama 2-7B Q4_K_M)
+### GPU offloading on RTX 4090

-| GPU | Speed | vs CPU | Cost |
-|-----|-------|--------|------|
-| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
-| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
-| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
-| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
+| Layers GPU | Speed | VRAM |
+|------------|-------|------|
+| 0 (CPU only) | 30 tok/s | 0 GB |
+| 20 (hybrid) | 80 tok/s | 8 GB |
+| 35 (all) | 120 tok/s | 12 GB |

 ## Supported models

-**LLaMA family**:
- Llama 2 (7B, 13B, 70B)
- Llama 3 (8B, 70B, 405B)
- Code Llama
+- **LLaMA family**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
+- **Mistral family**: Mistral 7B, Mixtral 8x7B/8x22B
+- **Other**: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)

-**Mistral family**:
- Mistral 7B
- Mixtral 8x7B, 8x22B
+Find GGUF models: https://huggingface.co/models?library=gguf

-**Other**:
- Falcon, BLOOM, GPT-J
- Phi-3, Gemma, Qwen
- LLaVA (vision), Whisper (audio)
+## Ecosystem integrations

-**Find models**: https://huggingface.co/models?library=gguf
+### Ollama
+
+```bash
+cat > Modelfile << 'EOF'
+FROM ./model-q4_k_m.gguf
+TEMPLATE """{{ .System }}
+{{ .Prompt }}"""
+PARAMETER temperature 0.7
+PARAMETER num_ctx 4096
+EOF
+
+ollama create mymodel -f Modelfile
+ollama run mymodel "Hello!"
+```
+
+### LM Studio
+
+1. Place GGUF file in `~/.cache/lm-studio/models/`
+2. Open LM Studio and select the model
+3. Configure context length and GPU offload, start inference
+
+### text-generation-webui
+
+```bash
+cp model-q4_k_m.gguf text-generation-webui/models/
+python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
+```
+
+### OpenAI client → llama-server
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
+response = client.chat.completions.create(
+    model="local-model",
+    messages=[{"role": "user", "content": "Hello!"}],
+    max_tokens=256,
+)
+print(response.choices[0].message.content)
+```
+
+## Best practices
+
+1. **Use K-quants** — Q4_K_M is the recommended default
+2. **Use imatrix** for Q4 and below (calibration improves quality substantially)
+3. **Offload as many layers as VRAM allows** — start high, reduce by 5 on OOM
+4. **Thread count** — match physical cores, not logical
+5. **Batch size** — increase `n_batch` (e.g. 512) for faster prompt processing
+6. **Context** — start at 4096, grow only as needed (memory scales with ctx)
+7. **Flash Attention** — add `--flash-attn` if your build supports it
+
+## Common issues (quick fixes)
+
+**Model loads slowly** — use `--mmap` for memory-mapped loading.
+
+**Out of memory (GPU)** — reduce `-ngl`, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
+```python
+Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35)  # Q4_0 KV cache
+```
+
+**Garbage output** — wrong `chat_format`, temperature too high, or model file corrupted. Test with `temperature=0.1` and verify FP16 baseline works.
+
+**Connection refused (server)** — bind to `--host 0.0.0.0`, check `lsof -i :8080`.
+
+See `references/troubleshooting.md` for the full playbook.

 ## References

- **[Quantization Guide](references/quantization.md)** - GGUF formats, conversion, quality comparison
- **[Server Deployment](references/server.md)** - API endpoints, Docker, monitoring
- **[Optimization](references/optimization.md)** - Performance tuning, hybrid CPU+GPU
+- **[advanced-usage.md](references/advanced-usage.md)** — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
+- **[quantization.md](references/quantization.md)** — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
+- **[server.md](references/server.md)** — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
+- **[optimization.md](references/optimization.md)** — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
+- **[troubleshooting.md](references/troubleshooting.md)** — install/convert/quantize/inference/server issues, Apple Silicon, debugging

 ## Resources

- **GitHub**: https://github.com/ggerganov/llama.cpp
- **Models**: https://huggingface.co/models?library=gguf
- **Discord**: https://discord.gg/llama-cpp
-
-
+- **GitHub**: https://github.com/ggml-org/llama.cpp
+- **Python bindings**: https://github.com/abetlen/llama-cpp-python
+- **Pre-quantized models**: https://huggingface.co/TheBloke
+- **GGUF converter Space**: https://huggingface.co/spaces/ggml-org/gguf-my-repo
+- **License**: MIT