Show HN for January 27, 2026
45 itemsLemonSlice – Give your voice agents a face #
Chatbots are everywhere. Voice AI has recently taken off. But we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.
We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.
Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.
From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).
And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.
We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.
Looking forward to your feedback! And we’d love to see any cool characters you make - please share their links in the comments
*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.
I Wrapped the Zorks with an LLM #
So I figured out how to wrap it with Tambo.. (and run the game engine in the browser) basically whatever you type gets "translated" into zork-speak and passed to the game - and then the LLM takes the game's output and optionally adds flavor. (the little ">_" button at the top exposes the actual game input)
What was a big surprise to me is multi-turn instructions - you can ask it to "Explore all the rooms in the house until you can't find any more" and it will plug away at the game for 10+ "turns" at a time... like Claude Code for Zork or something
ShapedQL – A SQL engine for multi-stage ranking and RAG #
I’m Tullie, founder of Shaped. Previously, I was a researcher at Meta AI, worked on ranking for Instagram Reels, and was a contributor to PyTorch Lightning.
We built ShapedQL because we noticed that while retrieval (finding 1,000 items) has been commoditized by vector DBs, ranking (finding the best 10 items) is still an infrastructure problem.
To build a decent for you feed or a RAG system with long-term memory, you usually have to put together a vector DB (Pinecone/Milvus), a feature store (Redis), an inference service, and thousands of lines of Python to handle business logic and reranking.
We built an engine that consolidates this into a single SQL dialect. It compiles declarative queries into high-performance, multi-stage ranking pipelines.
HOW IT WORKS:
Instead of just SELECT , ShapedQL operates in four stages native to recommendation systems:
RETRIEVE: Fetch candidates via Hybrid Search (Keywords + Vectors) or Collaborative Filtering. FILTER: Apply hard constraints (e.g., "inventory > 0"). SCORE: Rank results using real-time models (e.g., p(click) or p(relevance)). REORDER: Apply diversity logic so your Agent/User doesn’t see 10 nearly identical results.
THE SYNTAX: Here is what a RAG query looks like. This replaces about 500 lines of standard Python/LangChain code:
SELECT item_id, description, price
FROM
-- Retrieval: Hybrid search across multiple indexes
search_flights("$param.user_prompt", "$param.context"),
search_hotels("$param.user_prompt", "$param.context")
WHERE -- Filtering: Hard business constraints
price <= "$param.budget" AND is_available("$param.dates")
ORDER BY -- Scoring: Real-time reranking (Personalization + Relevance)
0.5 * preference_score(user, item) +
0.3 * relevance_score(item, "$param.user_prompt")
LIMIT 20If you don’t like SQL, you can also use our Python and Typescript SDKs. I’d love to know what you think of the syntax and the abstraction layer!
Fuzzy Studio – Apply live effects to videos/camera #
I've been learning computer graphics on the side for several years now and gain so much joy from smooshing and stretching images/videos. I hope you can get a little joy as well with Fuzzy Studio!
Try applying effects to your camera! My housemates and I have giggled so much making faces with weird effects!
Nothing gets sent to the server; everything is done in the browser! Amazing what we can do. I've only tested on macOS... apologies if your browser/OS is not supported (yet).
Mystral Native – Run JavaScript games natively with WebGPU (no browser) #
Why: I originally started by starting a new game engine in WebGPU, and I loved the iteration loop of writing Typescript & instantly seeing the changes in the browser with hot reloading. After getting something working and shipping a demo, I realized that shipping a whole browser doesn't really work if I also want the same codebase to work on mobile. Sure, I could use a webview, but that's not always a good or consistent experience for users - there are nuances with Safari on iOS supporting WebGPU, but not the same features that Chrome does on desktop. What I really wanted was a WebGPU runtime that is consistent & works on any platform. I was inspired by deno's --unsafe-webgpu flag, but I realized that deno probably wouldn't be a good fit long term because it doesn't support iOS or Android & doesn't bundle a window / event system (they have "bring your own window", but that means writing a lot of custom code for events, dealing with windowing, not to mention more specific things like implementing a WebAudio shim, etc.). So that got me down the path of building a native runtime specifically for games & that's Mystral Native.
So now with Mystral Native, I can have the same developer experience (write JS, use shaders in WGSL, call requestAnimationFrame) but get a real native binary I can ship to players on any platform without requiring a webview or a browser. No 200MB Chromium runtime, no CEF overhead, just the game code and a ~25MB runtime.
What it does: - Full WebGPU via Dawn (Chrome's implementation) or wgpu-native (Rust) - Native window & events via SDL3 - Canvas 2D support (Skia), Web Audio (SDL3), fetch (file/http/https) - V8 for JS (same engine as Chrome/Node), also supports QuickJS and JSC - ES modules, TypeScript via SWC - Compile to single binary (think "pkg"): `mystral compile game.js --include assets -o my-game` - macOS .app bundles with code signing, Linux/Windows standalone executables - Embedding API for iOS and Android (JSC/QuickJS + wgpu-native)
It's early alpha — the core rendering path works well & I've tested on Mac, Linux (Ubuntu 24.04), and Windows 11, and some custom builds for iOS & Android to validate that they can work, but there's plenty to improve. Would love to get some feedback and see where it can go!
MIT licensed.
Build Web Automations via Demonstration #
We’ve been building browser agents for a while. In production, we kept converging on the same pattern: deterministic scripts for the happy path, agents only for edge cases. So we built Demonstrate Mode.
The idea is simple: You perform your workflow once in a remote browser. Notte records the interactions and generates deterministic automation code.
How it works: - Record clicks, inputs, navigations in a cloud browser - Compile them into deterministic code (no LLM at runtime) - Run and deploy on managed browser infrastructure
Closest analog is Playwright codegen but: - Infrastructure is handled (remote browsers, proxies, auth state) - Code runs in a deployable runtime with logs, retries, and optional agent fallback
Agents are great for prototyping and dynamic steps, but for production we usually want versioned code and predictable cost/behavior. Happy to dive into implementation details in the comments.
Demo: https://www.loom.com/share/f83cb83ecd5e48188dd9741724cde49a
-- Andrea & Lucas, Notte Founders
Drum machine VST made with React/C++ #
Externalized Properties, a modern Java configuration library #
Honcho – Open-source memory infrastructure, powered by custom models #
It’s Vineeth from Plastic Labs. We've been building Honcho, an open-source memory library for stateful AI agents.
Most memory systems are just vector search—store facts, retrieve facts, stuff into context. We took a different approach: memory as reasoning. (We talk about this a lot on our blog)
We built Neuromancer, a model trained specifically for AI-native memory. Instead of naive fact extraction, Neuromancer does formal logical reasoning over conversations to build representations that evolve over time. Its both cheap ( $2/M tokens ingestion, unlimited retrieval), token efficient and SOTA: LongMem (90.4%), LoCoMo (89.9%), and BEAM. On BEAM 10M—which exceeds every model's context window—we hit 0.409 vs prior SOTA of 0.266, using 0.5% of context per query.
Github: https://github.com/plastic-labs/honcho
Evals: https://evals.honcho.dev
Neuromancer Model Card: https://plasticlabs.ai/neuromancer)
Memory as Reasoning Approach: https://blog.plasticlabs.ai/blog/Memory-as-Reasoning
Read more about our recent updates: https://blog.plasticlabs.ai/blog/Honcho-3
Happy to answer questions about the architecture, benchmarks, or agent memory patterns in general
I built a CSV parser to try Go 1.26's new SIMD package #
A CSV parser using Go 1.26's experimental simd/archsimd package.
I wanted to see what the new SIMD API looks like in practice. CSV parsing is mostly "find these bytes in a buffer"—load 64 bytes, compare, get a bitmask of positions. The interesting part was handling chunk boundaries correctly (quotes and line endings can split across chunks).
- Drop-in replacement for encoding/csv - ~20% faster for unquoted data on AVX-512 - Quoted data is slower (still optimizing) - Scalar fallback for non-AVX-512
Requires GOEXPERIMENT=simd.
https://github.com/nnnkkk7/go-simdcsv
Feedback on edge cases or the SIMD implementation welcome.
A 4.8MB native iOS voice notes app built with SwiftUI #
I wanted to share a project I’ve been working on called Convoxa. It’s a native iOS transcriber/summarizer. I had two main goals: keep it efficient and keep it private.
THE TECH STACK
100% Swift & SwiftUI: No heavy cross-platform wrappers or bloated dependencies.
Binary Size: The final build is only 4.8 MB.
Transcription: Uses Apple's latest speech APIs for maximum privacy and efficiency.
THE CHALLENGE: BYPASSING THE 4K CONTEXT LIMIT
The biggest technical hurdle was working with Apple’s foundation models. The default context window is capped at 4096 tokens, which is practically useless for anything over a 10-minute meeting transcript.
I ended up building a recursive chunking method to "feed" the model long-form data without losing the global context of the conversation. I use a sliding window approach where each chunk's summary informs the next, ensuring the final output doesn't "hallucinate" at the seams where the chunks meet. It’s now stable enough for long-form audio while remaining entirely on-device for supported hardware.
PRIVACY & AI MODES
On-Device: (Apple Intelligence required) - Total local processing.
Cloud: With reasoning for intelligent insights (Zero Data Retention).
I’m currently in the pre-order phase (out on Feb 3rd) and would love to get some feedback from this community on the performance and the chunking logic.
App Store: https://apps.apple.com/us/app/convoxa-ai-meeting-minutes/id6...
Analyzing Semantic Redundancy in LLM Retrieval (Google GIST Protocol) #
The Tool: https://websiteaiscore.com/gist-compliance-check
The Context (The Paper): To understand the tool, you have to understand the problem Google is solving with GIST: redundancy is expensive. When generating an AI answer , the model cannot feed 10k search results into the context window—it costs too much compute. If the top 5 results are semantically identical (consensus content), the model wastes tokens processing duplicates.
The GIST algorithm solves this via Max-Min Diversity:
Utility Score: It selects a high-value source.
The Radius: It draws a mathematical conflict radius around that content based on semantic similarity.
The Lockout: Any content inside that radius is rejected to save compute, regardless of domain authority.
How my implementation works: I wanted to see if we could programmatically detect if a piece of content falls inside this "redundancy radius." The tool uses an LLM to analyze the top ranking URLs for a specific query, calculates the vector embedding, and measures the Semantic Cosine Similarity against your input.
If the overlap is too high (simulating the GIST lockout), the tool flags the content as providing zero marginal utility to the model.
I’d love feedback on the accuracy of the similarity scoring.
Open-source Robotics – Curated projects with interactive 3D URDF viewer #
Lightbox – Flight recorder for AI agents (record, replay, verify) #
Logs were scattered, the LLM’s “I called the tool” wasn’t trustworthy, and re-running wasn’t deterministic.
This week, tons of Clawdbot incidents have driven the point home. Agents with full system access can expose API keys and chat histories. Prompt injection is now a major security concern.
When agents can touch your filesystem, execute code, and browse the web…you probably need a tamper-proof record of exactly what actions it took, especially when a malicious prompt or compromised webpage could hijack the agent mid-session.
Lightbox is a small Python library that records every tool call an agent makes (inputs, outputs, timing) into an append-only log with cryptographic hashes. You can replay runs with mocked responses, diff executions across versions, and verify the integrity of logs after the fact.
Think airplane black box, but for your hackbox.
*What it does:*
- Records tool calls locally (no cloud, your infra)
- Tamper-evident logs (hash chain, verifiable)
- Replay failures exactly with recorded responses
- CLI to inspect, replay, diff, and verify sessions
- Framework-agnostic (works with LangChain, Claude, OpenAI, etc.)
*What it doesn’t do:* - Doesn’t replay the LLM itself (just tool calls) - Not a dashboard or analytics platform - Not trying to replace LangSmith/Langfuse (different problem)
*Use cases I care about:*
- Security forensics: agent behaved strangely, was it prompt injection? Check the trace.
- Compliance: “prove what your agent did last Tuesday”
- Debugging: reproduce a failure without re-running expensive API calls
- Regression testing: diff tool call patterns across agent versions
As agents get more capable and more autonomous (Clawdbot/Molt, Claude computer use, Manus, Devin), I think we’ll need black boxes the same way aviation does.
This is my attempt at that primitive.
It’s early (v0.1), intentionally minimal, MIT licensed.
Site: <https://uselightbox.app> install: `pip install lightbox-rec`
GitHub: <https://github.com/mainnebula/Lightbox-Project>
Would love feedback, especially from anyone thinking about agent security or running autonomous agents in production.
Burn Text – Add animated captions to videos, runs locally in browser #
Tech: 1. Whisper WASM (tiny and small models) 2. Canvas Text 3. MediaBunny to stitch everything together.
The privacy angle was important to me. This processes everything locally and exports directly.
Free, no account, no watermark. Feedback welcome.
I Stopped Hoping My LLM Would Cooperate #
Then I fixed the constraints. Eight days, zero failures, zero intervention.
The secret wasn't better prompts... it was treating the LLM as a constrained function: schema-validated tool calls that reject malformed output and force retries, two-pass architecture separating editorial judgment from formatting, and boring DevOps (retry logic, rate limiting, structured logging).
The Claude invocation is ~30 lines in a 2000-line system. Most of the work is everything around it.
https://seanfloyd.dev/blog/llm-reliability https://github.com/SeanLF/claude-rss-news-digest
13-Virtues – A tracker for Benjamin Franklin's 13-week character system #
I’m Hélène. My husband and I are builders from Belgium, and we’ve spent the last few months building a side project called 13 Virtues.
The idea comes from Benjamin Franklin’s personal character system. Instead of tracking many habits at once, he focused on one virtue per week (Temperance, Silence, Order, etc.), cycling through 13 virtues over 13 weeks, then repeating the cycle. He documented this practice for many years.
We wanted an app that follows this structure strictly, rather than another flexible habit tracker. One virtue at a time. One day at a time. Repeat.
You can try the ledger directly on the homepage without creating an account. You can mark faults for today and see the current virtue with Franklin’s original quotes:
Why we built it:
We were frustrated with productivity apps that optimize for streaks and metrics rather than reflection. Franklin’s system felt refreshingly constrained and intentional, and we wanted something we’d actually use ourselves. My husband handled the engineering; I focused on product and design.
Pricing model:
We deliberately avoided subscriptions.
- Free tier: the full 13-week cycle and daily ledger
- Lifetime upgrade ($79 launch price): long-term history beyond the current cycle, guided reflections, data export, and a downloadable Modern Virtue Guide (PDF) that explains the method and its rationale in more depth.
Franklin’s system predates SaaS by a few centuries, and a monthly fee felt wrong for something meant to be practiced quietly over years.
Tech:
- Backend: Laravel
- Frontend: Vue (Inertia)
- CSS: Tailwind
- Hosting: Hostinger
Built over ~12 weekends.
We’ll be around today (CET) to answer questions — happy to discuss the implementation, the pricing decision, or Franklin’s original writings. Thoughtful UI/UX feedback is especially welcome.
A Local OS for LLMs. MIT License. Zero Hallucinations. (Not Crank) #
The core thesis is simple: Don't rent your cognition.
Most RAG (Retrieval Augmented Generation) implementations are just "grep for embeddings." They are messy, imprecise, and prone to hallucination. I wanted to solve the "Context integrity" problem at the architectural layer.
The Tech Stack (How it works):
QDMA (Quantum Dream Memory Architecture): instead of a flat vector DB, it uses a hierarchical projection engine. It separates "Hot" (Recall) from "Cold" (Storage) memory, allowing for effectively infinite context window management via compression.
CSNP (Context Switching Neural Protocol) - The Hallucination Killer: This is the most important part. Every memory fragment is hashed into a Merkle Chain. When the LLM retrieves context, the system cryptographically verifies the retrieval against the immutable ledger.
If the hash doesn't match the chain: The retrieval is rejected.
Result: The AI visually cannot "make things up" about your past because it is mathematically constrained to the ledger. Local Inference: Built on top of llama.cpp server. It runs Llama-3 (or any GGUF) locally. No API keys. No data leaving your machine.
Features:
Zero-Dependency: Runs on Windows/Linux with just Python and a GPU (or CPU).
Visual Interface: Includes a Streamlit-based "Cognitive Interface" to visualize memory states. Open Source: MIT License. This is an attempt to give "Agency" back to the user. I believe that if we want AGI, it needs to be owned by us, not rented via an API.
Repository: https://github.com/merchantmoh-debug/Remember-Me-AI
I’d love to hear your feedback on the Merkle-verification approach. Does constraining the context window effectively solve the "trust" issue for you?
It's fully working - Fully tested. If you tried to Git Clone before without luck - As this is not my first Show HN on this - Feel free to try again.
To everyone who HATES AI slop; Greedy corporations and having their private data stuck on cloud servers.
You're welcome.
Cheers, Mohamad
Authors note: Updated successfully.
Framework 50 is active.
For anyone passing by - yes this is a big deal. Eliminating AI hallucination is a 60 billion dollar market problem and I'm giving THAT + sovereign control of your DATA plus the capability to do high-end research via framework 50 (including advanced scientific research) for FREE - under an MIT license. If you don't take advantage of this - you are an idiot.
If you do - welcome to the future.
P.S: What do I get from lying? I got 36 stars on the repo - many from high-end senior engineers at fortune 500 companies. If you're too stupid to tell the real deal from a lie then keep it moving son.
Paper → Code → Jupyter Notebook (generate and run code while reading) #
It’s meant for quick reproducibility: implement an equation/method, run a sanity check, and keep everything in a shareable notebook.
Demo: https://www.youtube.com/watch?v=FOnyym-jUPg Happy to answer questions. What would make this useful in your workflow?
Walk and drive through OpenStreetMap in 3D #
It’s barebones right now (currently loads one scene), renderings need work, and it’s buggy in places. But you can get driving on most roads around the world.
Contributions welcome!
Get recommendations or convert agent skills directly in your workspace #
Cosmic AI Workflows – Chain AI agents to automate multi-step projects #
So we built AI Workflows — chain multiple agents together and let them run autonomously, with each step receiving outputs from previous steps.
Three agent types you can chain:
- Code Agents: Build features in GitHub with commits and pull requests.
- Content Agents: Generate CMS content with context injection from previous steps.
- Computer Use Agents: Automate browser workflows and record demos.
How it works:
1. Define steps with agent type, prompt, and configuration
2. Steps run sequentially or in parallel (configurable)
3. Context passes automatically between steps
4. Trigger manually, on a schedule (cron), or via CMS and API events (object.created, object.edited, etc.)
5. Add approval gates for human review before critical steps
Example: Autopilot feature development:
Step 1: Content Agent writes a feature spec based on user feedback
Step 2: Code Agent builds the feature, creates PR, and deploys to production
Step 3: Content Agent generates documentation and a changelog entry
Step 4: Computer Use Agent posts update to team Slack with the PR link and preview URL
Currently in beta. Would love feedback on the workflow model and what use cases you'd want to automate.
P.ai.os – A local, modular AI "operating" system for macOS (M4/MLX) #
I’m sharing a personal project I’ve been working on: a modular, "headless" operating system for my digital life, hosted locally on a Mac Mini M4.
The source code is available here:
https://github.com/vag-mac-mini/PAIOS_Public
I travel frequently and wanted a "sovereign" system that doesn't rely on SaaS subscriptions or cloud data. The stack is: - Hardware: Headless Mac Mini M4 (16GB). - AI Core: Qwen 3 VL 8B (Abliterated) running via MLX. - Network: Tailscale (for remote access via PWA). - Agentic Layer: Custom Python scripts for tool use (API calls, system control).
The constraint for this project was unique: I did not write the code manually. I utilized Google Gemini to architect and generate the Python files with Antigravity IDE, acting as the "product manager". The code structure is admittedly messy in places, but it is fully functional.
The OS currently runs 10 active modules, categorized by function:
-- Intelligence & Memory -- * Digital Diary: Uses Vision AI to analyze screen activity/productivity and logs daily summaries to Apple Notes. * Voice & Memory: Indexes voice transcriptions into a searchable "Second Brain." * Ghostwriter: Remixes rough voice notes into structured essays or book chapters.
-- Global Logistics -- * Travel Command: Aggregates flight/visa data and summarizes local security risks via Tavily API.Gives recommendations for packing based on country/city/period/weather. * Aero Intel: Audits flight paths and generates deep links for travel logistics. * Chronos Calendar: A master schedule that integrates financial timelines with travel itineraries.
-- Network & Security -- * Network Sentry: Monitors the local ARP table for unknown devices. * Secure Dead Drop: An encrypted P2P file tunnel between my devices. * CRM & Network: A relationship database that parses raw notes into structured contacts.
Latest update that is not on GitHub, created a python script and added it on cron where it uses an api to check all the bank holidays across the globe(excluding the major ones) checks my contacts, which could be from all over the world, see's who is from the country with the bank holiday and I receive an iMessage in the morning in order to wish them happy whatever holiday.
This is not a SaaS or a commercial product. I am open to feedback on the architecture or suggestions for other "sovereign" modules I could add.
Thanks.
First autonomous ML and AI engineering Agent #
Where things still break down is when ML workflows become long-running and feedback-heavy. Training jobs, evaluations, retries, metric comparisons, and partial failures are still treated as ephemeral side effects rather than durable state. Once a workflow spans hours, multiple experiments, or iterative evaluation, you either babysit the agent or restart large parts of the process. Feedback exists, but it is not something the system can reliably resume from.
NEO tries to model ML work the way it actually happens. It is an AI agent that executes end-to-end ML workflows, not just code generation. Work is broken into explicit execution steps with state, checkpoints, and intermediate results. Feedback from metrics, evaluations, or failures feeds directly into the next step instead of forcing a full restart. You can pause a run, inspect what happened, tweak assumptions, and resume from where it left off.
Here's an example as well for your reference: You might ask NEO to explore a dataset, train a few baseline models, compare their performance, and generate plots and a short report. NEO will load the data, run EDA, train models, evaluate them, notice if something underperforms or fails, adjust, and continue. If training takes an hour and one model crashes at 45 minutes, you do not start over. Neo inspects the failure, fixes it, and resumes.
Docs for the extension: https://docs.heyneo.so/#/vscode
Happy to answer questions about Neo.
Lumina – Open-source observability for AI systems(OpenTelemetry-native) #
The Problem:
I've been building LLM apps for the past year, and I kept running into the same issues: - LLM responses would randomly change after prompt tweaks, breaking things. - Costs would spike unexpectedly (turns out a bug was hitting GPT-4 instead of 3.5). - No easy way to compare "before vs after" when testing prompt changes. - Existing tools were either too expensive or missing features in free tiers.
What I Built:
Lumina is OpenTelemetry-native, meaning: - Works with your existing OTEL stack (Datadog, Grafana, etc.). - No vendor lock-in, standard trace format. - Integrates in 3 lines of code.
Key features: - Cost & quality monitoring – Automatic alerts when costs spike, or responses degrade. - Replay testing – Capture production traces, replay them after changes, see diffs. - Semantic comparison – Not just string matching – uses Claude to judge if responses are "better" or "worse." - Self-hosted tier – 50k traces/day, 7-day retention, ALL features included (alerts, replay, semantic scoring)
How it works:
```bash # Start Lumina git clone https://github.com/use-lumina/Lumina cd Lumina/infra/docker docker-compose up -d ```
```typescript // Add to your app (no API key needed for self-hosted!) import { Lumina } from '@uselumina/sdk';
const lumina = new Lumina({ endpoint: 'http://localhost:8080/v1/traces', });
// Wrap your LLM call const response = await lumina.traceLLM( async () => await openai.chat.completions.create({...}), { provider: 'openai', model: 'gpt-4', prompt: '...' } ); ```
That's it. Every LLM call is now tracked with cost, latency, tokens, and quality scores.
What makes it different:
1. Free self-hosted with limits that work – 50k traces/day and 7-day retention (resets daily at midnight UTC). All features included: alerts, replay testing, and semantic scoring. Perfect for most development and small production workloads. Need more? Upgrade to managed cloud.
2. OpenTelemetry-native – Not another proprietary format. Use standard OTEL exporters, works with existing infra. Can send traces to both Lumina AND Datadog simultaneously.
3. Replay testing – The killer feature. Capture 100 production traces, change your prompt, replay them all, and get a semantic diff report. Like snapshot testing for LLMs.
4. Fast – Built with Bun, Postgres, Redis, NATS. Sub-500ms from trace to alert. Handles 10k+ traces/min on a single machine.
What I'm looking for:
- Feedback on the approach (is OTEL the right foundation?) - Bug reports (tested on Mac/Linux/WSL2, but I'm sure there are issues) - Ideas for what features matter most (alerts? replay? cost tracking?) - Help with the semantic scorer (currently uses Claude, want to make it pluggable)
Why open source:
I want this to be the standard for LLM observability. That only works if it's: - Free to use and modify (Apache 2.0) - Easy to self-host (Docker Compose, no cloud dependencies) - Open to contributions (good first issues tagged)
The business model is managed hosting for teams that don't want to run infrastructure. But the core product is and always will be free.
Try it: - GitHub: https://github.com/use-lumina/Lumina - Docs: https://docs.uselumina.io - Quick start: 5 minutes from `git clone` to dashboard
I'd love to hear what you think! Especially interested in: - What observability problems are you hitting with LLMs - Missing features that would make this useful for you - Any similar tools you're using (and what they do better)
Thanks for reading!
CUGA – Configurable Generalist Agent (HuggingFace Live Demo) #
Jiss – A community-powered LLM API I built for open models #
An Internationalization GitHub Action to Replace Crowdin with LLMs #
The problem: TMS platforms charge per-word and per-seat, and their machine translation lacks product context. A typical SaaS with 500 strings across 9 languages costs $200-500/month.
How it works: - Extracts strings from your codebase (XLIFF, JSON, PO, YAML) - Diffs against previous translations (only translates what changed) - Sends to any LLM (Claude, GPT-4, Gemini, Ollama) with your product context, glossary, and style guide - Commits translations back to your branch
Key technical bits: - Structured generation for ICU message format (CLDR plural rules handled correctly - Russian 4-form, Arabic 6-form, etc.) - Hash-based caching to avoid re-translating unchanged strings - Provider-agnostic interface - swap LLMs without config changes
GitHub: https://github.com/webdecoy/ai-i18n
Happy to answer questions about the ICU handling, prompt design, or anything else.
LinkLens – Document and link tracking in one dashboard #
It tracks: - Documents: who viewed, which pages, how long - Links: clicks, location, device, UTM params
Tech stack: Next.js, Supabase, Cloudflare R2
Free tier available. Would love feedback on the product and landing page.
Claude Threads – Collaborate on Claude Code via Slack (Or Mattermost) #
It runs on your machine.
I built most of it using itself. Teammates watching live caught stuff I missed.
https://claude-threads.run https://github.com/anneschuth/claude-threads
Engroles.com – Verified, Active SWE Listings from Recruiters #
And so I thought to myself, how can I help other job-seeking software engineers get access to an aggregated list of real recruiter outreach? And I came up with engroles. The idea is that as a job seeker, you forward one verified recruiter outreach that you've received to your email in the last 30 days to [email protected]. Once engroles verifies that the recruiter outreach represents a real job listing, it will give you access to apply to any of the listings in the entire job board. Every listing represents a job from another user who has received verified recruiter outreach and sent it to engroles. Each one of the listings in engroles is also linked to a recruiter. When job seekers apply to a job listing, the recruiter is notified, and recruiters will reach out to the job seeker's profile.
This is an initial release. You can see a truncated view of the currently existing job listings without creating an account. Please give me any feedback that you have. Thanks for checking this out!
Goldenthread – Compile Go to TypeScript Zod for type-safe validation #
I kept shipping features where the frontend accepted data the backend rejected. Validation rules lived in two places (Go struct tags and TypeScript Zod schemas), and they'd drift.
I built goldenthread to solve this. It's a build-time compiler that generates TypeScript Zod schemas from Go struct tags. Write validation once, use it everywhere.
Example:
// Go backend
type User struct {
Username string `json:"username" gt:"required,len:3..20"`
Email string `json:"email" gt:"email"`
Age int `json:"age" gt:"min:13,max:130"`
}
Run `goldenthread generate ./models` and you get: // TypeScript (auto-generated)
export const UserSchema = z.object({
username: z.string().min(3).max(20),
email: z.string().email(),
age: z.number().int().min(13).max(130)
})
export type User = z.infer<typeof UserSchema>
The compiler uses go/packages and go/types for accurate type resolution. It handles nested objects, enums, maps, embedded structs, and 37+ validation rules.The killer feature: drift detection. Run `goldenthread check` in CI - it computes SHA-256 hashes of your schemas and fails the build if generated TypeScript doesn't match Go source. No more "frontend works locally but breaks in production because someone changed the backend struct."
Before releasing v0.1.0, I ran continuous fuzzing (10 targets, hourly in GitHub Actions). It found two bugs my test suite missed:
1. UTF-8 corruption: Japanese field names with empty JSON tags triggered byte-slicing bugs in camelCase conversion. Took 444,553 executions to discover.
2. Regex escaping: Patterns with newline characters produced broken JavaScript output. Found in 180 executions.
Both required specific intersections of conditions that manual testing wouldn't cover. I wrote up the full fuzzing setup here: https://blackwell-systems.github.io/blog/posts/continuous-fu...The tool is production-ready:
Zero runtime dependencies (generated code only imports Zod) Generates readable TypeScript (looks hand-written) Complete Go type support (primitives, arrays, maps, nested objects) Works with any Go project structure MIT OR Apache 2.0 dual licensed
I built this because I was tired of frontend/backend validation bugs in my day job (hospitality platform with booking/payment APIs). We use it internally now.
Real-world example in the repo: 9-schema e-commerce system with Customer records (E.164 phone validation), Product SKUs (regex patterns), and Order workflows (nested objects + array constraints).
GitHub: https://github.com/blackwell-systems/goldenthread Docs: https://github.com/blackwell-systems/goldenthread/blob/main/... Tag syntax: https://github.com/blackwell-systems/goldenthread/blob/main/...
Would love feedback
Thanks for looking!