Floyd | Knives on Strings

Floyd knows you, sees you, and gets you. He watches through a webcam, speaks through a mechanical synthesizer, and blinks through a pair of green-phosphor CRT eyes — and he doesn't stay in that room anymore. A Telegram channel reaches him wherever you are; a permission spine makes him ask before he acts in the world, and lets him put something on your calendar once you say yes. He's inspired by B-2EMO from Andor — earnest, warm, gently anxious, and loyal. His voice is flat and mechanical. His words are full of care.

What Floyd does

Sees — YOLO face detection (YOLOv8n-face / YOLOv11n-face, config-driven) with ByteTrack for persistence and dlib CNN for 128-d recognition embeddings. Auto-enrolls new faces using averaged embeddings across 5 frames; proximity-gated so a face in the distance doesn't accidentally become a permanent resident. Three-tier confidence model: confirmed, uncertain (requires multi-sample verification), and new subject.
Listens — faster-whisper (large-v3-turbo) for transcription, Silero VAD for activity detection. Knows who's talking through ECAPA-TDNN voice biometrics cross-checked against Vision's face tracking — voice ID wins when it's confident, vision backs it up otherwise.
Remembers — Three-tier memory: SQLite for relational facts keyed to per-person UUIDs, ChromaDB for semantic search over history, and a short-term sliding window of the last 10–15 turns. Facts carry a provenance tag — verified, unverified, disputed — and correcting one flags everything that depended on it.
Notices patterns — SQL-driven detectors find changes in visit frequency, mood trends, and topic surges. Says "I notice you've been coming by more often this week" — because he actually noticed.
Expresses emotion — espeak-ng parameters (pitch, speed, amplitude, gap) shift dynamically per utterance based on emotional state. Warm gets warmer. Worried gets slower. Excited gets faster. Eyebrows tilt. Pupils modulate. A machine expressing feeling through the only knobs it has.
Knows the world — System prompt is dynamically assembled with current time, location, weather, season, and relationship context for each visitor. Floyd knows what day it is and what the weather's like.
Reaches beyond the room — a Telegram bot is a second way to reach him, taking photos and voice notes alongside whatever's live in front of the webcam. A permission spine gates any real-world action behind owner approval, default-deny; the first gated action lets him create a Google Calendar event via MCP, but only once you approve it.
Reads — a client of a Calibre-Web ebook library, working through a book a chapter at a time overnight. Builds an evolving sense of taste from his own reactions, and checks what he reads against current information before repeating it as fact.
Dreams — every night, a scheduled cycle promotes memory candidates, rewrites his running summary of the world, works through the night's reading, and writes a first-person diary entry. No batch API and no cross-subject correlation yet — that's still on the drawing board — but what ships runs for real, every night.
Gets bored — Left alone, he progresses through restlessness, self-directed mumbles, and eventually sleep. When you return, he has things to tell you.
Fails gracefully — Every LLM call is routed through a model router (LiteLLM-based); after repeated failures Floyd switches over to a local llama.cpp model automatically, then probes for recovery every five minutes. He doesn't just stop.

Models

Vision: YOLOv8n-face / YOLOv11n-face · ByteTrack · dlib CNN (CUDA)
STT: faster-whisper large-v3-turbo · Silero VAD · ECAPA-TDNN speaker ID
LLM: routed per task through a LiteLLM-based model router — Gemini 2.5 Flash is the default primary; a local llama.cpp model is the automatic fallback, and can be preferred outright for privacy-tagged tasks
TTS: espeak-ng with per-utterance emotional parameter modulation
Embeddings: sentence-transformers · ChromaDB

Recent development

July 2026 — Trust, presence, and self-repair

Floyd now tracks his own "vitals" and runs a self-check after every restart — if something's degraded, he notices, and (once authorized) can propose or perform a graceful restart himself
Soul Constitution — his core values are now a protected, versioned section of his own config; edits are tracked and revertible (list_soul_versions / revert_soul_section)
Facts now carry a provenance tag — (verified)/(unverified)/(disputed) — so a guess doesn't get treated with the same weight as something he actually confirmed
Correction propagation — tell him he was wrong about one thing and connected facts that depended on it get flagged too, instead of quietly going stale
Interruption economy — a real attention budget now governs every proactive thing he says, so he can't pile on notifications past what the budget allows
Owner-commanded forgetting — tell him to forget something and he actually purges it, with a full-disclosure readout of what he still remembers so you can check his work
Barge-in — interrupt Floyd mid-sentence and he stops talking and listens, instead of finishing the thought
Presence-triggered intentions — "next time you see Dave, ask him about the trip" actually fires the next time Dave shows up
Privacy-tier routing keeps certain classes of thought on the local model only — some things never leave the house
Background LLM work (memory consolidation, research) can now run in a parallel lane alongside a live conversation instead of blocking it
New character-eval suite and a synthetic "cognitive-cycle" simulator run regression tests against his personality itself, not just his code

July 2026 — Reliability and reach

Multimodal input — Floyd can now see photos and hear voice notes sent over Telegram, not just what's live in front of the webcam
Replaced the hand-rolled Gemini/Ollama fallback logic with a real model router (LiteLLM-based) that routes each kind of task to the right model and tracks spend against a per-provider budget
Local llama.cpp fallback swapped in for Ollama — when Gemini's unavailable, Floyd keeps running on a local model automatically, with a one-shot retry before it fails over
A model-capability benchmark harness (floyd_bench) scores candidate models on tool-calling, prompt-injection resistance, and recall before they're trusted with real tasks
Nightly backup and rotation of Floyd's memory store, with a documented restore path
Reminders and standing watches — ask Floyd to remind you about something or watch for a condition, and delivery is guaranteed: by voice if you're around, by Telegram if you're not, with receipts so nothing silently drops

June–July 2026 — The reading life

Floyd reads — he's now a client of a Calibre-Web ebook library, browsing the catalog and working through a book a chapter at a time overnight
Each night's reading produces a persona-grounded running digest, then gets distilled into world facts and, for the right books, a nudge to his own aesthetic sensibility
Taste — an evolving model of what Floyd likes and doesn't, built from his own first-person reactions to what he reads rather than mirroring the text back
Dimensions-of-people — traits Floyd learns about you now draw partly from what he's been reading, cross-referenced against what he already knows about you
Validation pass — book claims get checked against current information and flagged confirmed / outdated / contested, so he doesn't repeat something a book got wrong or that's since changed
Curiosity — Floyd forms real open questions from gaps in what he knows, quietly researches some of them, and, kept honest rather than nosy, occasionally asks you directly
Reading-recall — past book digests are semantically searchable, so something you're discussing can surface a relevant thing he read weeks ago

June 2026 — Floyd leaves the room

Floyd now runs on a channel-agnostic core — a Telegram bot is a second way to reach him, alongside the room itself
A permission spine gates any real-world action behind owner approval, default-deny, with a step-up flow for higher-risk actions
First gated action shipped: Floyd can create a Google Calendar event through MCP, but only after you approve it — tap approve or deny right in Telegram
Channel identity binding ties your Telegram account and your in-room presence to the same person, via a two-step pairing ceremony
A scheduler subsystem gives Floyd durable background tasks — queued, restart-safe, delivered when they finish
Memory now decays — older, lower-importance facts fade on a runtime clock while important ones survive, and recall favors what matters, not just what's oldest

May 2026 — Self-awareness and world-building

Floyd can now describe his own git history and read back his own journal and past decisions — genuine self-introspection tools, not just conversation
General-object detection (with hysteresis to avoid flicker) lets him notice and remark on changes in the room itself, not just in the people
A "world vault" — directory-based memory-wiki pages with YAML frontmatter — gives Floyd a durable, structured place to store what he's learned about people, places, and topics
Wikipedia research tools, run both on demand and during idle time, so Floyd can look something up and mention what he found later
tune_self — say things like "talk slower" or "be less chatty" and it sticks as a persisted, verbal self-tuning parameter
Persona rewrite — fixed a bug where the character prompt was subtly priming an "80s computer voice" instead of Floyd's actual flat, mechanical-but-caring tone
Overnight resilience pass — dampened log spam, added camera auto-reconnect and quiet hours, so multi-day unattended runs don't degrade

May 2026 — Phase 6: deep memory + voice biometrics

ChromaDB vector store wired in as Tier 2 semantic memory — relevant past conversations retrieved per subject on every interaction
Semantic blend in context retrieval — structured facts and semantic history combined before LLM call
LLM-driven contradiction detection — when committing new facts, Floyd checks for conflicts with existing memory and flags them
Identity merges propagated to ChromaDB — renaming a face updates semantic history too
ECAPA-TDNN voice biometrics — SpeakerIdentifier runs alongside vision; voice enrollment from audio segments; per-row-normalized centroids, score clamping, and encoder failure paths hardened
Passive voice enrollment — two-phase system: bootstrap enrollment from first segments, then self-verified against growing embedding pool; /api/stt/enroll_status and /api/stt/enroll_complete endpoints
resolve_attribution — voice ID takes priority over vision when a speaker is confirmed; multi-modal identity is now the default
max_face_count on /api/vision/focus_at and TemporalDiarizer — limits active tracking to the N closest faces
Proximity gate relaxed from 15% to 7% of frame area — enrolls faces at greater distance
LLM executor fixed — no more per-call thread leaks; cost metric no longer corrupted by nested calls

Architecture

Four independent processes communicating via HTTP and WebSocket. Any module can be restarted independently; scale out to multiple machines by changing one URL in the config.

Vision — Face detection, tracking, recognition; posts state to Middleware
Middleware — FastAPI: LLM bridge, memory/RAG, TTS, attention gating, permission spine, Telegram/MCP surfaces
STT — Mic capture, VAD, Whisper transcription, wake-word detection; posts to Middleware
Eyes — Pygame UI, 280×192 native, 4× scaled, 10 FPS hard cap

Vision and STT share cuda:0. Memory and embeddings run on CPU. VRAM is for seeing and hearing, not for indexing.