Skip to main content
knives on strings
Home The Stuff The Lab The Team About
X Gumroad KVR Audio Ko-fi
Home The Stuff The Lab The Team About
X Gumroad KVR Audio Ko-fi
The Stuff / Floyd

Floyd

Floyd — HE'S WATCHING IN DEV

Active development

FormatsPython
PlatformsLinux (CUDA)

Floyd is a voice-first AI agent that lives in your room. He sees through a webcam, speaks through a mechanical synthesizer, and watches the world through a pair of green-phosphor CRT eyes. He's inspired by B-2EMO from Andor — earnest, warm, gently anxious, and loyal. His voice is flat and mechanical. His words are full of care.

What Floyd does

  • Sees — YOLO face detection with ByteTrack for persistence and dlib CNN for 128-d recognition embeddings. Auto-enrolls new faces using averaged embeddings across 5 frames; proximity-gated so a face in the distance doesn't accidentally become a permanent resident. Three-tier confidence model: confirmed, uncertain (requires multi-sample verification), and new subject.
  • Listens — faster-whisper (distil-large-v3) for transcription, Silero VAD for activity detection. Knows who's talking by cross-referencing Vision's tracking data — no acoustic diarization required.
  • Remembers — Three-tier memory: SQLite for relational facts keyed to per-person UUIDs, ChromaDB for semantic search over history, and a short-term sliding window of the last 10–15 turns. Remembers who told him what about whom, and for how long they've been coming around.
  • Notices patterns — SQL-driven detectors find changes in visit frequency, mood trends, and topic surges. Says "I notice you've been coming by more often this week" — because he actually noticed.
  • Expresses emotion — espeak-ng parameters (pitch, speed, amplitude, gap) shift dynamically per utterance based on emotional state. Warm gets warmer. Worried gets slower. Excited gets faster. Eyebrows tilt. Pupils modulate. A machine expressing feeling through the only knobs it has.
  • Knows the world — System prompt is dynamically assembled with current time, location, weather, season, and relationship context for each visitor. Floyd knows what day it is and what the weather's like.
  • Dreams — After hours alone, Floyd runs a deep processing cycle via the Gemini Batch API (50% cost): consolidating memories, finding cross-subject connections, generating self-reflection. Prepares for tomorrow's visitors.
  • Gets bored — Left alone, he progresses through restlessness, self-directed mumbles (via the cheapest model tier), and eventually sleep. When you return, he has things to tell you.
  • Fails gracefully — After three consecutive Gemini failures, Floyd falls back to a local Ollama model automatically, then probes for Gemini recovery every five minutes. He doesn't just stop.

Models

  • Vision: YOLOv8n-face / YOLOv11n-face · ByteTrack · dlib CNN (CUDA)
  • STT: faster-whisper distil-large-v3 · Silero VAD
  • LLM (primary): Gemini 2.0 Flash — real-time conversation
  • LLM (dreaming): Gemini 2.5 Flash via Batch API — idle memory consolidation
  • LLM (mumbles): Gemini 2.5 Flash Lite — boredom, idle thoughts
  • LLM (fallback): Ollama / llama3.1:8b — local, automatic switchover
  • TTS: espeak-ng with per-utterance emotional parameter modulation
  • Embeddings: sentence-transformers · ChromaDB / FAISS

Recent development

May 2026 — Phase 6: deep memory + voice biometrics

  • ChromaDB vector store wired in as Tier 2 semantic memory — relevant past conversations retrieved per subject on every interaction
  • Semantic blend in context retrieval — structured facts and semantic history combined before LLM call
  • LLM-driven contradiction detection — when committing new facts, Floyd checks for conflicts with existing memory and flags them
  • Identity merges propagated to ChromaDB — renaming a face updates semantic history too
  • ECAPA-TDNN voice biometrics — SpeakerIdentifier runs alongside vision; voice enrollment from audio segments; per-row-normalized centroids, score clamping, and encoder failure paths hardened
  • Passive voice enrollment — two-phase system: bootstrap enrollment from first segments, then self-verified against growing embedding pool; /api/stt/enroll_status and /api/stt/enroll_complete endpoints
  • resolve_attribution — voice ID takes priority over vision when a speaker is confirmed; multi-modal identity is now the default
  • max_face_count on /api/vision/focus_at and TemporalDiarizer — limits active tracking to the N closest faces
  • Proximity gate relaxed from 15% to 7% of frame area — enrolls faces at greater distance
  • LLM executor fixed — no more per-call thread leaks; cost metric no longer corrupted by nested calls

Architecture

Four independent processes communicating via HTTP and WebSocket. Any module can be restarted independently; scale out to multiple machines by changing one URL in the config.

  • Vision — Face detection, tracking, recognition; posts state to Middleware
  • Middleware — FastAPI: LLM bridge, memory/RAG, STT, attention gating
  • Eyes — Pygame UI, 280×192 native, 4× scaled, 10 FPS hard cap
  • Audio — espeak-ng TTS with emotional modulation

Vision and STT share cuda:0. Memory and embeddings run on CPU. VRAM is for seeing and hearing, not for indexing.

knives on strings

© 2026 Knives on Strings

X Gumroad KVR Audio Ko-fi
Legal Terms Privacy