Retrieval-augmented commit generation using project history #2

Open
opened 2025-07-06 15:35:12 +00:00 by glenux · 0 comments
Owner

Note: Copy of issue from upstream project: https://github.com/GNtousakis/llm-commit/issues/18

Enhance llm-commit by retrieving relevant historical context (commit diffs, messages, and code) to improve the quality and consistency of generated commit messages. This is done via retrieval-augmented generation (RAG) using semantic search.

Benefits:

  • improved intent clarity in commit messages
  • better alignment with prior work (project-wide and developer-specific)

Problem

Relying solely on the staged diff (and optional hint) limits the model’s ability to understand change intent, project conventions, or past phrasing patterns. This can result in vague, inconsistent, or redundant commit messages.

Proposed solution

Add a lightweight RAG system that retrieves up to 3 relevant items per context type from the local repository history.

Context sources:

  • similar past commits (project-wide)
  • same author’s previous commits
  • commits affecting the same file(s)
  • surrounding code (function, class, or block near the change)

Retrieval mechanism:

  • generate embeddings of past commit messages + diffs using a small embedding model (e.g. text-embedding-3-small, MiniLM)
  • store embeddings and metadata (commit hash, author, file paths) in a lightweight vector index
  • query the index at commit time using the current staged diff as the embedding reference

Storage options:

Prompt structure:

# context
<commit and syntax rules>

# diff
<current staged diff>

# hint
<optional user-provided hint>

# context: surrounding code
<relevant function or block>

# similar past commits
- commit: <message>
  diff: <snippet>

# file-specific history
- commit: <message>
  diff: <snippet>

# author’s past commits
- commit: <message>
  diff: <snippet>

Retrieval limits and controls

Rather than hardcoding item counts, implement a flexible strategy:

  • allow configuration of token budget for context (e.g. --rag-max-tokens 2400)
  • optionally limit by item count per context type (e.g. --rag-limit 3) or by cost for API-based models (e.g. --rag-cost $0.1)
  • context is assembled in priority order (context → diff → hint → code → file history → author history → project matches), and truncated to fit within the budget
**Note:** Copy of issue from upstream project: https://github.com/GNtousakis/llm-commit/issues/18 Enhance `llm-commit` by retrieving relevant historical context (commit diffs, messages, and code) to improve the quality and consistency of generated commit messages. This is done via retrieval-augmented generation (RAG) using semantic search. Benefits: - improved intent clarity in commit messages - better alignment with prior work (project-wide and developer-specific) ## Problem Relying solely on the staged diff (and optional hint) limits the model’s ability to understand change intent, project conventions, or past phrasing patterns. This can result in vague, inconsistent, or redundant commit messages. ## Proposed solution Add a lightweight RAG system that retrieves up to 3 relevant items per context type from the local repository history. **Context sources**: - similar past commits (project-wide) - same author’s previous commits - commits affecting the same file(s) - surrounding code (function, class, or block near the change) **Retrieval mechanism**: - generate embeddings of past commit messages + diffs using a small embedding model (e.g. `text-embedding-3-small`, `MiniLM`) - store embeddings and metadata (commit hash, author, file paths) in a lightweight vector index - query the index at commit time using the current staged diff as the embedding reference **Storage options**: - use sqlite as a local vector store (either with [vector extensions](https://github.com/asg017/sqlite-vec) or [reusing datasette LLM's implementation](https://llm.datasette.io/en/stable/embeddings/python-api.html)) - store in per-project cache (e.g. `.git/llm-commit-vectors.db`) or use in-memory index when persistence isn’t needed **Prompt structure**: ``` # context <commit and syntax rules> # diff <current staged diff> # hint <optional user-provided hint> # context: surrounding code <relevant function or block> # similar past commits - commit: <message> diff: <snippet> # file-specific history - commit: <message> diff: <snippet> # author’s past commits - commit: <message> diff: <snippet> ``` ## Retrieval limits and controls Rather than hardcoding item counts, implement a flexible strategy: - allow configuration of **token budget** for context (e.g. `--rag-max-tokens 2400`) - optionally limit by **item count** per context type (e.g. `--rag-limit 3`) or by **cost** for API-based models (e.g. `--rag-cost $0.1`) - context is assembled in priority order (context → diff → hint → code → file history → author history → project matches), and truncated to fit within the budget
glenux added this to the Default project 2025-07-06 15:52:48 +00:00
glenux changed title from [feature] Retrieval-augmented commit generation using project history to Retrieval-augmented commit generation using project history 2025-07-06 16:16:35 +00:00
Sign in to join this conversation.
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: glenux/llm-commit-gen#2
No description provided.