Retrieval-augmented commit generation using project history #2

New issue

Open

opened 2025-07-06 15:35:12 +00:00 by glenux · 0 comments

glenux commented

2025-07-06 15:35:12 +00:00

Owner

Note: Copy of issue from upstream project: https://github.com/GNtousakis/llm-commit/issues/18

Enhance llm-commit by retrieving relevant historical context (commit diffs, messages, and code) to improve the quality and consistency of generated commit messages. This is done via retrieval-augmented generation (RAG) using semantic search.

Benefits:

improved intent clarity in commit messages
better alignment with prior work (project-wide and developer-specific)

Problem

Relying solely on the staged diff (and optional hint) limits the model’s ability to understand change intent, project conventions, or past phrasing patterns. This can result in vague, inconsistent, or redundant commit messages.

Proposed solution

Add a lightweight RAG system that retrieves up to 3 relevant items per context type from the local repository history.

Context sources:

similar past commits (project-wide)
same author’s previous commits
commits affecting the same file(s)
surrounding code (function, class, or block near the change)

Retrieval mechanism:

generate embeddings of past commit messages + diffs using a small embedding model (e.g. text-embedding-3-small, MiniLM)
store embeddings and metadata (commit hash, author, file paths) in a lightweight vector index
query the index at commit time using the current staged diff as the embedding reference

Storage options:

use sqlite as a local vector store (either with vector extensions or reusing datasette LLM's implementation)
store in per-project cache (e.g. .git/llm-commit-vectors.db) or use in-memory index when persistence isn’t needed

Prompt structure:

# context
<commit and syntax rules>

# diff
<current staged diff>

# hint
<optional user-provided hint>

# context: surrounding code
<relevant function or block>

# similar past commits
- commit: <message>
  diff: <snippet>

# file-specific history
- commit: <message>
  diff: <snippet>

# author’s past commits
- commit: <message>
  diff: <snippet>

Retrieval limits and controls

Rather than hardcoding item counts, implement a flexible strategy:

allow configuration of token budget for context (e.g. --rag-max-tokens 2400)
optionally limit by item count per context type (e.g. --rag-limit 3) or by cost for API-based models (e.g. --rag-cost $0.1)
context is assembled in priority order (context → diff → hint → code → file history → author history → project matches), and truncated to fit within the budget

**Note:** Copy of issue from upstream project: https://github.com/GNtousakis/llm-commit/issues/18 Enhance `llm-commit` by retrieving relevant historical context (commit diffs, messages, and code) to improve the quality and consistency of generated commit messages. This is done via retrieval-augmented generation (RAG) using semantic search. Benefits: - improved intent clarity in commit messages - better alignment with prior work (project-wide and developer-specific) ## Problem Relying solely on the staged diff (and optional hint) limits the model’s ability to understand change intent, project conventions, or past phrasing patterns. This can result in vague, inconsistent, or redundant commit messages. ## Proposed solution Add a lightweight RAG system that retrieves up to 3 relevant items per context type from the local repository history. **Context sources**: - similar past commits (project-wide) - same author’s previous commits - commits affecting the same file(s) - surrounding code (function, class, or block near the change) **Retrieval mechanism**: - generate embeddings of past commit messages + diffs using a small embedding model (e.g. `text-embedding-3-small`, `MiniLM`) - store embeddings and metadata (commit hash, author, file paths) in a lightweight vector index - query the index at commit time using the current staged diff as the embedding reference **Storage options**: - use sqlite as a local vector store (either with [vector extensions](https://github.com/asg017/sqlite-vec) or [reusing datasette LLM's implementation](https://llm.datasette.io/en/stable/embeddings/python-api.html)) - store in per-project cache (e.g. `.git/llm-commit-vectors.db`) or use in-memory index when persistence isn’t needed **Prompt structure**: ``` # context <commit and syntax rules> # diff <current staged diff> # hint <optional user-provided hint> # context: surrounding code <relevant function or block> # similar past commits - commit: <message> diff: <snippet> # file-specific history - commit: <message> diff: <snippet> # author’s past commits - commit: <message> diff: <snippet> ``` ## Retrieval limits and controls Rather than hardcoding item counts, implement a flexible strategy: - allow configuration of **token budget** for context (e.g. `--rag-max-tokens 2400`) - optionally limit by **item count** per context type (e.g. `--rag-limit 3`) or by **cost** for API-based models (e.g. `--rag-cost $0.1`) - context is assembled in priority order (context → diff → hint → code → file history → author history → project matches), and truncated to fit within the budget

glenux added this to the Default project

2025-07-06 15:52:48 +00:00

glenux added the

Kind

Feature

label

2025-07-06 16:12:39 +00:00

glenux changed title from ~~[feature] Retrieval-augmented commit generation using project history~~ to Retrieval-augmented commit generation using project history

2025-07-06 16:16:35 +00:00

glenux added this to the vZ.Z.Z - Future (backlog) milestone

2025-07-06 16:34:10 +00:00