mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-26 01:01:40 +00:00

teknium1 ab0f4126cf fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.

2026-03-06 15:57:30 -08:00

4.6 KiB

Raw Blame History

LLaVA Training Guide

Guide to training and fine-tuning LLaVA models.

Training stages

Stage 1: Feature alignment (Pretraining)

Purpose: Align vision encoder with language model

Data: 558K image-caption pairs (CC3M subset)

# Download pretrained projector or train from scratch
bash scripts/v1_5/pretrain.sh

Configuration:

Base model: Vicuna-7B or LLaMA-2-7B
Vision encoder: CLIP ViT-L/14
Training time: ~20 hours on 8× A100

Stage 2: Visual instruction tuning

Purpose: Teach model to follow visual instructions

Data: 150K GPT-generated multimodal instruction data

# Fine-tune with instruction data
bash scripts/v1_5/finetune.sh

Configuration:

Epochs: 1
Batch size: 128 (across 8 GPUs)
Learning rate: 2e-5
Training time: ~24 hours on 8× A100

Data format

Instruction data format

[
    {
        "id": "001",
        "image": "path/to/image.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat is in this image?"
            },
            {
                "from": "gpt",
                "value": "The image shows a dog playing in a park."
            },
            {
                "from": "human",
                "value": "What breed is the dog?"
            },
            {
                "from": "gpt",
                "value": "It appears to be a Golden Retriever."
            }
        ]
    }
]

Fine-tuning on custom data

Prepare your data

import json

# Create instruction data
data = []
for image_path, qa_pairs in your_dataset:
    conversations = []
    for q, a in qa_pairs:
        conversations.append({"from": "human", "value": f"<image>\n{q}"})
        conversations.append({"from": "gpt", "value": a})

    data.append({
        "id": str(len(data)),
        "image": image_path,
        "conversations": conversations
    })

# Save
with open("custom_data.json", "w") as f:
    json.dump(data, f, indent=2)

Fine-tune script

#!/bin/bash

# Set paths
DATA_PATH="custom_data.json"
IMAGE_FOLDER="path/to/images"
MODEL_PATH="liuhaotian/llava-v1.5-7b"
OUTPUT_DIR="./checkpoints/llava-custom"

# Fine-tune
deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path $MODEL_PATH \
    --version v1 \
    --data_path $DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

LoRA fine-tuning (memory efficient)

from peft import LoraConfig, get_peft_model

# LoRA config
lora_config = LoraConfig(
    r=8,  # LoRA rank
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(base_model, lora_config)

# Train with much lower memory

Hardware requirements

Full fine-tuning

7B model: 8× A100 (40GB)
13B model: 8× A100 (80GB)
Training time: 20-48 hours

LoRA fine-tuning

7B model: 1× A100 (40GB)
13B model: 2× A100 (40GB)
Training time: 10-24 hours

Best practices

Start with pretrained - Don't train from scratch
Use LoRA for efficiency - 10× less memory
Quality over quantity - 1K high-quality > 10K low-quality
Multi-turn conversations - More engaging than single Q&A
Diverse images - Cover different scenarios
Clear instructions - Specific questions get better answers
Monitor loss - Should decrease smoothly
Save checkpoints - Training can fail
Test regularly - Validate on held-out set
Use DeepSpeed - For multi-GPU training

Resources

Training script: https://github.com/haotian-liu/LLaVA/tree/main/scripts
Data format: https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md
Paper: https://arxiv.org/abs/2304.08485

4.6 KiB Raw Blame History Unescape Escape