refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-26 01:01:40 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/models/llava/references/training.md
+++ b/skills/mlops/models/llava/references/training.md
@ -0,0 +1,197 @@
+# LLaVA Training Guide
+
+Guide to training and fine-tuning LLaVA models.
+
+## Training stages
+
+### Stage 1: Feature alignment (Pretraining)
+
+**Purpose**: Align vision encoder with language model
+
+**Data**: 558K image-caption pairs (CC3M subset)
+
+```bash
+# Download pretrained projector or train from scratch
+bash scripts/v1_5/pretrain.sh
+```
+
+**Configuration:**
+- Base model: Vicuna-7B or LLaMA-2-7B
+- Vision encoder: CLIP ViT-L/14
+- Training time: ~20 hours on 8× A100
+
+### Stage 2: Visual instruction tuning
+
+**Purpose**: Teach model to follow visual instructions
+
+**Data**: 150K GPT-generated multimodal instruction data
+
+```bash
+# Fine-tune with instruction data
+bash scripts/v1_5/finetune.sh
+```
+
+**Configuration:**
+- Epochs: 1
+- Batch size: 128 (across 8 GPUs)
+- Learning rate: 2e-5
+- Training time: ~24 hours on 8× A100
+
+## Data format
+
+### Instruction data format
+
+```json
+[
+    {
+        "id": "001",
+        "image": "path/to/image.jpg",
+        "conversations": [
+            {
+                "from": "human",
+                "value": "<image>\nWhat is in this image?"
+            },
+            {
+                "from": "gpt",
+                "value": "The image shows a dog playing in a park."
+            },
+            {
+                "from": "human",
+                "value": "What breed is the dog?"
+            },
+            {
+                "from": "gpt",
+                "value": "It appears to be a Golden Retriever."
+            }
+        ]
+    }
+]
+```
+
+## Fine-tuning on custom data
+
+### Prepare your data
+
+```python
+import json
+
+# Create instruction data
+data = []
+for image_path, qa_pairs in your_dataset:
+    conversations = []
+    for q, a in qa_pairs:
+        conversations.append({"from": "human", "value": f"<image>\n{q}"})
+        conversations.append({"from": "gpt", "value": a})
+
+    data.append({
+        "id": str(len(data)),
+        "image": image_path,
+        "conversations": conversations
+    })
+
+# Save
+with open("custom_data.json", "w") as f:
+    json.dump(data, f, indent=2)
+```
+
+### Fine-tune script
+
+```bash
+#!/bin/bash
+
+# Set paths
+DATA_PATH="custom_data.json"
+IMAGE_FOLDER="path/to/images"
+MODEL_PATH="liuhaotian/llava-v1.5-7b"
+OUTPUT_DIR="./checkpoints/llava-custom"
+
+# Fine-tune
+deepspeed llava/train/train_mem.py \
+    --deepspeed ./scripts/zero2.json \
+    --model_name_or_path $MODEL_PATH \
+    --version v1 \
+    --data_path $DATA_PATH \
+    --image_folder $IMAGE_FOLDER \
+    --vision_tower openai/clip-vit-large-patch14-336 \
+    --mm_projector_type mlp2x_gelu \
+    --mm_vision_select_layer -2 \
+    --mm_use_im_start_end False \
+    --mm_use_im_patch_token False \
+    --image_aspect_ratio pad \
+    --group_by_modality_length True \
+    --bf16 True \
+    --output_dir $OUTPUT_DIR \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 1 \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 50000 \
+    --save_total_limit 1 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --tf32 True \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 4 \
+    --lazy_preprocess True \
+    --report_to wandb
+```
+
+## LoRA fine-tuning (memory efficient)
+
+```python
+from peft import LoraConfig, get_peft_model
+
+# LoRA config
+lora_config = LoraConfig(
+    r=8,  # LoRA rank
+    lora_alpha=16,
+    target_modules=["q_proj", "v_proj"],
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+
+# Apply LoRA
+model = get_peft_model(base_model, lora_config)
+
+# Train with much lower memory
+```
+
+## Hardware requirements
+
+### Full fine-tuning
+
+- **7B model**: 8× A100 (40GB)
+- **13B model**: 8× A100 (80GB)
+- **Training time**: 20-48 hours
+
+### LoRA fine-tuning
+
+- **7B model**: 1× A100 (40GB)
+- **13B model**: 2× A100 (40GB)
+- **Training time**: 10-24 hours
+
+## Best practices
+
+1. **Start with pretrained** - Don't train from scratch
+2. **Use LoRA for efficiency** - 10× less memory
+3. **Quality over quantity** - 1K high-quality > 10K low-quality
+4. **Multi-turn conversations** - More engaging than single Q&A
+5. **Diverse images** - Cover different scenarios
+6. **Clear instructions** - Specific questions get better answers
+7. **Monitor loss** - Should decrease smoothly
+8. **Save checkpoints** - Training can fail
+9. **Test regularly** - Validate on held-out set
+10. **Use DeepSpeed** - For multi-GPU training
+
+## Resources
+
+- **Training script**: https://github.com/haotian-liu/LLaVA/tree/main/scripts
+- **Data format**: https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md
+- **Paper**: https://arxiv.org/abs/2304.08485