houndr
A fast, trigram-based code search engine for Git repositories - Hound, rewritten in Rust.
Howdy!
I’ve been building Code Search - a self-hosted platform that started as a Hound replacement but quickly grew into something much bigger. Bulk replace across repos, automated merge requests, multi-host support, repository management UI. Feature after feature, it evolved into an alternative to Sourcegraph.
That’s great for teams that need it. But somewhere along the way, I lost what made Hound appealing in the first place: simplicity. Just point it at your repos and search. No databases, no job queues, no Kubernetes.
So I went back to basics and built houndr - a pure Hound alternative, rewritten from scratch in Rust. No additional features, no enterprise bells and whistles. Just fast indexing and fast search.
The name? hound + Rust = houndr.
What It Does
houndr clones your Git repositories, builds trigram indexes, and serves a web UI and JSON API for instant code search across all of them. That’s it.
Trigram-Based Search - Sub-millisecond substring and regex search. Every file is split into overlapping 3-byte windows (trigrams), each mapped to a bitmap of documents containing it. A query like readFile produces trigrams rea, ead, adF, dFi, Fil, ile - intersecting their bitmaps narrows candidates to a handful of files before any content scanning happens. Searches stay fast regardless of corpus size.
Incremental Indexing - On each poll cycle, houndr fetches the latest refs. If HEAD hasn’t changed, the existing index is reused immediately. If it has, a manifest diff identifies which files changed - unchanged files are copied zero-copy from the previous index via mmap. A commit that touches 2 files in a 10,000-file repo only reads 2 blobs from git.
Memory-Mapped I/O - Indexes are memory-mapped with segment-specific madvise hints: Random for the trigram index (binary search), Sequential for content (streamed reads). File content is accessed as byte slices directly from the OS page cache - no heap allocation, no copies, lazy page loading.
Streaming Results - SSE endpoint streams results per-repo as they complete. No waiting for the slowest repo to finish.
Private Repos - HTTPS tokens, SSH keys, and $ENV_VAR references. Works with your existing credentials.
How Search Works
Query: "readFile"
│
▼
1. Extract trigrams: [rea, ead, adF, dFi, Fil, ile]
│
▼
2. Look up posting lists (RoaringBitmap per trigram)
rea → {0, 3, 7, 12, 45}
ead → {0, 3, 12, 45, 99}
adF → {0, 12, 45}
...
│
▼
3. Intersect bitmaps (smallest-first, early termination)
candidates → {0, 12, 45}
│
▼
4. Parallel content scan (rayon)
First N files: build full match blocks with context lines
Remaining files: count only (accurate totals, bounded response)
│
▼
5. Return grouped results per repo Posting lists are sorted by cardinality before intersection - starting with the smallest bitmap produces fewer candidates, and each subsequent step can only shrink the result. If the intermediate result becomes empty, the loop terminates early. Trigrams appearing in 3 or fewer documents store their doc IDs directly inline in the index entry, avoiding a pointer chase to the posting section entirely.
The Index Format
houndr uses a custom binary format (.idx) designed for memory-mapped random access:
┌────────────────────────────────────────────┐
│ HEADER (64 bytes) │
│ magic: "HNDR", version, offsets │
├────────────────────────────────────────────┤
│ DOC TABLE (24 bytes × doc_count) │
│ path offset/len, content offset/len │
├────────────────────────────────────────────┤
│ PATH STRINGS (concatenated UTF-8) │
├────────────────────────────────────────────┤
│ TRIGRAM INDEX (16 bytes × count, sorted) │
│ inline: up to 3 doc IDs packed directly │
│ offset: pointer to posting data │
├────────────────────────────────────────────┤
│ POSTING DATA (RoaringBitmaps) │
├────────────────────────────────────────────┤
│ CONTENT DATA (raw file bytes) │
├────────────────────────────────────────────┤
│ FOOTER (xxhash3 checksum) │
└────────────────────────────────────────────┘ File content is embedded directly in the index - one file per repo, atomic updates via rename, zero-copy reads via mmap. The tradeoff is larger index files (~1.2x source size), but for code search that’s acceptable.
Index writes happen in 9 phases to a .tmp file, then atomically renamed. Readers never see a partial index. Integrity is verified via streaming xxhash3 - checksumming a 100MB index takes ~10ms.
The Architecture
Three standalone crates, deliberately separated:
┌──────────────────────────────────────────────┐
│ houndr-server │
│ Axum · Web UI · SSE · LRU Cache │
│ │
│ ┌────────────┐ ┌────────────────────────┐ │
│ │ Axum │ │ Background Watcher │ │
│ │ Router │ │ (poll loop) │ │
│ └─────┬──────┘ └──────────┬─────────────┘ │
└────────┼────────────────────┼────────────────┘
│ │
▼ ▼
┌────────────────┐ ┌─────────────────────────┐
│ houndr-index │ │ houndr-repo │
│ │ │ │
│ IndexBuilder │◀──│ pipeline::index_repo() │
│ IndexReader │ │ vcs::GitRepo │
│ QueryPlan │ │ config::Config │
│ Trigram engine │ │ │
└────────────────┘ └─────────────────────────┘ houndr-index is the core engine - trigram extraction, index building (parallel via rayon fold/reduce), disk I/O, and query execution. Zero server or HTTP dependencies. It can power CLI tools or custom integrations standalone.
houndr-repo handles Git operations via libgit2 - bare clones, fetches, tree walks, blob reads, manifest tracking for incremental indexing.
houndr-server is the HTTP layer - Axum with SSE streaming, LRU result cache (pre-serialized JSON, no re-serialization on hits), security headers, rate limiting via Governor, and gzip compression. The web UI is embedded at compile time via include_str! - no build step, no bundler.
Why I Built This
Code Search uses Zoekt under the hood and has grown into a full platform - Next.js frontend, Go backend, PostgreSQL, Redis, worker queues. It’s the right tool when you need bulk operations and multi-host management.
But sometimes you just need to search code. Fast. Without deploying a database.
I wanted something that sits closer to Hound’s philosophy - single binary, single config file, point at repos, search. But with the performance and safety guarantees that come from Rust: zero-copy mmap reads, fearless concurrency, no GC pauses, single binary deployment.
I also wanted to understand how trigram search actually works. Not just use Zoekt or Hound as a black box, but implement the index engine myself - the bitmap intersections, the binary format, the mmap strategies. Building it from scratch taught me more about search internals than years of using existing tools.
What I Learned
Trigram Index Design - Elegantly simple in concept (overlapping 3-byte windows → inverted index), surprisingly nuanced in practice. Inline small postings for rare trigrams (~30-50% of all trigrams appear in 3 or fewer files), smallest-first bitmap intersection with early termination, partial deserialization of RoaringBitmap containers that overlap with the current result - each optimization compounded.
Memory-Mapped I/O - Segment-specific madvise hints make a real difference. Random access hints for the trigram index (binary search pattern) vs sequential hints for content data (streamed reads) vs willneed for the header. The OS read-ahead strategy matches each section’s actual access pattern.
Incremental Indexing - Comparing HEAD refs and diffing manifests sounds straightforward, but edge cases abound: force pushes, rebases, deleted branches, repos that temporarily fail auth. Failed repos keep their previous IndexReader - search continues against the last successful index rather than dropping results.
Graceful Shutdown - Three-layer cancellation: CancellationToken → AtomicBool flag checked in git transfer callbacks and tree walks → force-quit on second Ctrl+C. Getting cooperative cancellation right across async Tokio tasks and blocking rayon threads was the trickiest concurrency problem.
Rust for Systems Work - libgit2 FFI via git2-rs, multi-crate workspaces, unsafe mmap operations, Arc<IndexReader> shared across threads with compile-time safety. The ownership model caught real bugs during development - double-mmap, use-after-unmap, data races on the reader swap - all at compile time.
Tech Stack
- Language: Rust
- Search Engine: Custom trigram index with RoaringBitmaps and memory-mapped I/O
- Parallelism: Rayon (index builds + query execution)
- Git Integration: libgit2 via git2-rs (bare clones, blob reads)
- HTTP Server: Axum with Tower middleware, SSE streaming
- Caching: LRU with TTL (pre-serialized JSON)
- Hashing: FxHash (hot-path trigram maps), xxhash3 (index integrity)
- Configuration: TOML
- Deployment: Single binary or Docker
Getting Started
# Build
cargo build --release
# Configure
cp config.toml my-config.toml
# Add your repositories to my-config.toml
# Run
./target/release/houndr-server --config my-config.toml Open http://127.0.0.1:6080 and start searching.
Full configuration options and deployment guides are available in the GitHub repository.
Happy searching!