Show HN for January 27, 2026

45 items

322

One Human + One Agent = One Browser From Scratch in 20K LOC #

emsh.cat

153 comments1:13 PMView on HN

133

LemonSlice – Give your voice agents a face #

131 comments5:55 PMView on HN

Hey HN, we're the co-founders of LemonSlice (https://lemonslice.com). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call with that character. Here's a demo: https://www.loom.com/share/941577113141418e80d2834c83a5a0a9

Chatbots are everywhere. Voice AI has recently taken off. But we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.

We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.

Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.

How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.

From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).

And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.

We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.

Looking forward to your feedback! And we’d love to see any cool characters you make - please share their links in the comments

*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.

110

I Wrapped the Zorks with an LLM #

infocom.tambo.co

58 comments8:59 PMView on HN

I grew up on the Infocom games and when microsoft actually open-sourced Zork 1/2/3 I really wanted to figure out how to use LLMs to let you type whatever you want, I always found the amount language that the games "understood" to be so limiting - even if it was pretty state of the art at the time.

So I figured out how to wrap it with Tambo.. (and run the game engine in the browser) basically whatever you type gets "translated" into zork-speak and passed to the game - and then the LLM takes the game's output and optionally adds flavor. (the little ">_" button at the top exposes the actual game input)

What was a big surprise to me is multi-turn instructions - you can ask it to "Explore all the rooms in the house until you can't find any more" and it will plug away at the game for 10+ "turns" at a time... like Claude Code for Zork or something

106

We Built the 1. EU-Sovereignty Audit for Websites #

lightwaves.io

91 comments2:00 PMView on HN

ShapedQL – A SQL engine for multi-stage ranking and RAG #

playground.shaped.ai

23 comments1:54 PMView on HN

Hi HN,

I’m Tullie, founder of Shaped. Previously, I was a researcher at Meta AI, worked on ranking for Instagram Reels, and was a contributor to PyTorch Lightning.

We built ShapedQL because we noticed that while retrieval (finding 1,000 items) has been commoditized by vector DBs, ranking (finding the best 10 items) is still an infrastructure problem.

To build a decent for you feed or a RAG system with long-term memory, you usually have to put together a vector DB (Pinecone/Milvus), a feature store (Redis), an inference service, and thousands of lines of Python to handle business logic and reranking.

We built an engine that consolidates this into a single SQL dialect. It compiles declarative queries into high-performance, multi-stage ranking pipelines.

HOW IT WORKS:

Instead of just SELECT , ShapedQL operates in four stages native to recommendation systems:

RETRIEVE: Fetch candidates via Hybrid Search (Keywords + Vectors) or Collaborative Filtering. FILTER: Apply hard constraints (e.g., "inventory > 0"). SCORE: Rank results using real-time models (e.g., p(click) or p(relevance)). REORDER: Apply diversity logic so your Agent/User doesn’t see 10 nearly identical results.

THE SYNTAX: Here is what a RAG query looks like. This replaces about 500 lines of standard Python/LangChain code:

SELECT item_id, description, price

FROM

  -- Retrieval: Hybrid search across multiple indexes

  search_flights("$param.user_prompt", "$param.context"),

  search_hotels("$param.user_prompt", "$param.context")

WHERE
-- Filtering: Hard business constraints price <= "$param.budget" AND is_available("$param.dates")
ORDER BY
-- Scoring: Real-time reranking (Personalization + Relevance) 0.5 * preference_score(user, item) + 0.3 * relevance_score(item, "$param.user_prompt")
LIMIT 20

If you don’t like SQL, you can also use our Python and Typescript SDKs. I’d love to know what you think of the syntax and the abstraction layer!

Fuzzy Studio – Apply live effects to videos/camera #

fuzzy.ulyssepence.com

20 comments3:16 PMView on HN

Back story:

I've been learning computer graphics on the side for several years now and gain so much joy from smooshing and stretching images/videos. I hope you can get a little joy as well with Fuzzy Studio!

Try applying effects to your camera! My housemates and I have giggled so much making faces with weird effects!

Nothing gets sent to the server; everything is done in the browser! Amazing what we can do. I've only tested on macOS... apologies if your browser/OS is not supported (yet).

Mystral Native – Run JavaScript games natively with WebGPU (no browser) #

github.com

18 comments6:33 PMView on HN

Hi HN, I've been building Mystral Native — a lightweight native runtime that lets you write games in JavaScript/TypeScript using standard Web APIs (WebGPU, Canvas 2D, Web Audio, fetch) and run them as standalone desktop apps. Think "Electron for games" but without Chromium. Or a JS runtime like Node, Deno, or Bun but optimized for WebGPU (and bundling a window / event system using SDL3).

Why: I originally started by starting a new game engine in WebGPU, and I loved the iteration loop of writing Typescript & instantly seeing the changes in the browser with hot reloading. After getting something working and shipping a demo, I realized that shipping a whole browser doesn't really work if I also want the same codebase to work on mobile. Sure, I could use a webview, but that's not always a good or consistent experience for users - there are nuances with Safari on iOS supporting WebGPU, but not the same features that Chrome does on desktop. What I really wanted was a WebGPU runtime that is consistent & works on any platform. I was inspired by deno's --unsafe-webgpu flag, but I realized that deno probably wouldn't be a good fit long term because it doesn't support iOS or Android & doesn't bundle a window / event system (they have "bring your own window", but that means writing a lot of custom code for events, dealing with windowing, not to mention more specific things like implementing a WebAudio shim, etc.). So that got me down the path of building a native runtime specifically for games & that's Mystral Native.

So now with Mystral Native, I can have the same developer experience (write JS, use shaders in WGSL, call requestAnimationFrame) but get a real native binary I can ship to players on any platform without requiring a webview or a browser. No 200MB Chromium runtime, no CEF overhead, just the game code and a ~25MB runtime.

What it does: - Full WebGPU via Dawn (Chrome's implementation) or wgpu-native (Rust) - Native window & events via SDL3 - Canvas 2D support (Skia), Web Audio (SDL3), fetch (file/http/https) - V8 for JS (same engine as Chrome/Node), also supports QuickJS and JSC - ES modules, TypeScript via SWC - Compile to single binary (think "pkg"): `mystral compile game.js --include assets -o my-game` - macOS .app bundles with code signing, Linux/Windows standalone executables - Embedding API for iOS and Android (JSC/QuickJS + wgpu-native)

It's early alpha — the core rendering path works well & I've tested on Mac, Linux (Ubuntu 24.04), and Windows 11, and some custom builds for iOS & Android to validate that they can work, but there's plenty to improve. Would love to get some feedback and see where it can go!

MIT licensed.

Repo: https://github.com/mystralengine/mystralnative

Docs: https://mystralengine.github.io/mystralnative/

Build Web Automations via Demonstration #

notte.cc

20 comments1:48 PMView on HN

Hey HN,

We’ve been building browser agents for a while. In production, we kept converging on the same pattern: deterministic scripts for the happy path, agents only for edge cases. So we built Demonstrate Mode.

The idea is simple: You perform your workflow once in a remote browser. Notte records the interactions and generates deterministic automation code.

How it works: - Record clicks, inputs, navigations in a cloud browser - Compile them into deterministic code (no LLM at runtime) - Run and deploy on managed browser infrastructure

Closest analog is Playwright codegen but: - Infrastructure is handled (remote browsers, proxies, auth state) - Code runs in a deployable runtime with logs, retries, and optional agent fallback

Agents are great for prototyping and dynamic steps, but for production we usually want versioned code and predictable cost/behavior. Happy to dive into implementation details in the comments.

Demo: https://www.loom.com/share/f83cb83ecd5e48188dd9741724cde49a

-- Andrea & Lucas, Notte Founders

Drum machine VST made with React/C++ #

okaysynthesizer.com

6 comments5:03 AMView on HN

Hi HN! We just launched our drum machine vst this month! We will be updating it with many new synthesis models and unique features. Check it out, join our discord and show us what you made!

Externalized Properties, a modern Java configuration library #

github.com

7 comments7:37 AMView on HN

Externalized Properties is powerful configuration library which supports resolution of properties from external sources such as files, databases, git repositories, and any custom sources

Decrypting the Zodiac Z32 triangulates a 100ft triangular crop mark #

zenodo.org

1 comments7:12 PMView on HN

Actionbase – A database for likes, views, follows at 1M+ req/min #

github.com

5 comments10:49 AMView on HN

Honcho – Open-source memory infrastructure, powered by custom models #

github.com

0 comments3:58 PMView on HN

Hey HN,

It’s Vineeth from Plastic Labs. We've been building Honcho, an open-source memory library for stateful AI agents.

Most memory systems are just vector search—store facts, retrieve facts, stuff into context. We took a different approach: memory as reasoning. (We talk about this a lot on our blog)

We built Neuromancer, a model trained specifically for AI-native memory. Instead of naive fact extraction, Neuromancer does formal logical reasoning over conversations to build representations that evolve over time. Its both cheap ( $2/M tokens ingestion, unlimited retrieval), token efficient and SOTA: LongMem (90.4%), LoCoMo (89.9%), and BEAM. On BEAM 10M—which exceeds every model's context window—we hit 0.409 vs prior SOTA of 0.266, using 0.5% of context per query.

Github: https://github.com/plastic-labs/honcho

Evals: https://evals.honcho.dev

Neuromancer Model Card: https://plasticlabs.ai/neuromancer)

Memory as Reasoning Approach: https://blog.plasticlabs.ai/blog/Memory-as-Reasoning

Read more about our recent updates: https://blog.plasticlabs.ai/blog/Honcho-3

Happy to answer questions about the architecture, benchmarks, or agent memory patterns in general

I built a CSV parser to try Go 1.26's new SIMD package #

github.com

4 comments2:28 PMView on HN

Hey HN,

A CSV parser using Go 1.26's experimental simd/archsimd package.

I wanted to see what the new SIMD API looks like in practice. CSV parsing is mostly "find these bytes in a buffer"—load 64 bytes, compare, get a bitmask of positions. The interesting part was handling chunk boundaries correctly (quotes and line endings can split across chunks).

- Drop-in replacement for encoding/csv - ~20% faster for unquoted data on AVX-512 - Quoted data is slower (still optimizing) - Scalar fallback for non-AVX-512

Requires GOEXPERIMENT=simd.

https://github.com/nnnkkk7/go-simdcsv

Feedback on edge cases or the SIMD implementation welcome.

A 4.8MB native iOS voice notes app built with SwiftUI #

apps.apple.com

0 comments8:09 PMView on HN

Hey HN,

I wanted to share a project I’ve been working on called Convoxa. It’s a native iOS transcriber/summarizer. I had two main goals: keep it efficient and keep it private.

THE TECH STACK

100% Swift & SwiftUI: No heavy cross-platform wrappers or bloated dependencies.

Binary Size: The final build is only 4.8 MB.

Transcription: Uses Apple's latest speech APIs for maximum privacy and efficiency.

THE CHALLENGE: BYPASSING THE 4K CONTEXT LIMIT

The biggest technical hurdle was working with Apple’s foundation models. The default context window is capped at 4096 tokens, which is practically useless for anything over a 10-minute meeting transcript.

I ended up building a recursive chunking method to "feed" the model long-form data without losing the global context of the conversation. I use a sliding window approach where each chunk's summary informs the next, ensuring the final output doesn't "hallucinate" at the seams where the chunks meet. It’s now stable enough for long-form audio while remaining entirely on-device for supported hardware.

PRIVACY & AI MODES

On-Device: (Apple Intelligence required) - Total local processing.

Cloud: With reasoning for intelligent insights (Zero Data Retention).

I’m currently in the pre-order phase (out on Feb 3rd) and would love to get some feedback from this community on the performance and the chunking logic.

App Store: https://apps.apple.com/us/app/convoxa-ai-meeting-minutes/id6...

Script: JavaScript That Runs Like Rust #

docs.script-lang.org

5 comments9:15 PMView on HN

Analyzing Semantic Redundancy in LLM Retrieval (Google GIST Protocol) #

0 comments1:13 PMView on HN

Last week, Google research published details on GIST (Greedy Independent Set Thresholding), a new protocol presented at NeurIPS 2025. I was fascinated by the paper, so I built a tool to visualize the "No-Go Zones" (redundancy radius) it describes.

The Tool: https://websiteaiscore.com/gist-compliance-check

The Context (The Paper): To understand the tool, you have to understand the problem Google is solving with GIST: redundancy is expensive. When generating an AI answer , the model cannot feed 10k search results into the context window—it costs too much compute. If the top 5 results are semantically identical (consensus content), the model wastes tokens processing duplicates.

The GIST algorithm solves this via Max-Min Diversity:

Utility Score: It selects a high-value source.

The Radius: It draws a mathematical conflict radius around that content based on semantic similarity.

The Lockout: Any content inside that radius is rejected to save compute, regardless of domain authority.

How my implementation works: I wanted to see if we could programmatically detect if a piece of content falls inside this "redundancy radius." The tool uses an LLM to analyze the top ranking URLs for a specific query, calculates the vector embedding, and measures the Semantic Cosine Similarity against your input.

If the overlap is too high (simulating the GIST lockout), the tool flags the content as providing zero marginal utility to the model.

I’d love feedback on the accuracy of the similarity scoring.

Open-source Robotics – Curated projects with interactive 3D URDF viewer #

robotics.growbotics.ai

3 comments8:25 PMView on HN

Lightbox – Flight recorder for AI agents (record, replay, verify) #

uselightbox.app

0 comments5:23 PMView on HN

I built Lightbox because I kept running into the same problem: an agent would fail in production, and I had no way to know what actually happened.

Logs were scattered, the LLM’s “I called the tool” wasn’t trustworthy, and re-running wasn’t deterministic.

This week, tons of Clawdbot incidents have driven the point home. Agents with full system access can expose API keys and chat histories. Prompt injection is now a major security concern.

When agents can touch your filesystem, execute code, and browse the web…you probably need a tamper-proof record of exactly what actions it took, especially when a malicious prompt or compromised webpage could hijack the agent mid-session.

Lightbox is a small Python library that records every tool call an agent makes (inputs, outputs, timing) into an append-only log with cryptographic hashes. You can replay runs with mocked responses, diff executions across versions, and verify the integrity of logs after the fact.

Think airplane black box, but for your hackbox.

*What it does:*

- Records tool calls locally (no cloud, your infra)

- Tamper-evident logs (hash chain, verifiable)

- Replay failures exactly with recorded responses

- CLI to inspect, replay, diff, and verify sessions

- Framework-agnostic (works with LangChain, Claude, OpenAI, etc.)

*What it doesn’t do:* - Doesn’t replay the LLM itself (just tool calls) - Not a dashboard or analytics platform - Not trying to replace LangSmith/Langfuse (different problem)

*Use cases I care about:*

- Security forensics: agent behaved strangely, was it prompt injection? Check the trace.

- Compliance: “prove what your agent did last Tuesday”

- Debugging: reproduce a failure without re-running expensive API calls

- Regression testing: diff tool call patterns across agent versions

As agents get more capable and more autonomous (Clawdbot/Molt, Claude computer use, Manus, Devin), I think we’ll need black boxes the same way aviation does.

This is my attempt at that primitive.

It’s early (v0.1), intentionally minimal, MIT licensed.

Site: <https://uselightbox.app> install: `pip install lightbox-rec`

GitHub: <https://github.com/mainnebula/Lightbox-Project>

Would love feedback, especially from anyone thinking about agent security or running autonomous agents in production.

Burn Text – Add animated captions to videos, runs locally in browser #

burntext.com

0 comments6:01 AMView on HN

I built a tool that adds word-by-word animated captions to videos. All processing happens client-side using the browser - no server uploads.

Tech: 1. Whisper WASM (tiny and small models) 2. Canvas Text 3. MediaBunny to stitch everything together.

The privacy angle was important to me. This processes everything locally and exports directly.

Free, no account, no watermark. Feedback welcome.

I Stopped Hoping My LLM Would Cooperate #

0 comments8:50 PMView on HN

42 validation errors in one run. Claude apologising instead of writing HTML. OAuth tokens expiring mid-digest.

Then I fixed the constraints. Eight days, zero failures, zero intervention.

The secret wasn't better prompts... it was treating the LLM as a constrained function: schema-validated tool calls that reject malformed output and force retries, two-pass architecture separating editorial judgment from formatting, and boring DevOps (retry logic, rate limiting, structured logging).

The Claude invocation is ~30 lines in a 2000-line system. Most of the work is everything around it.

https://seanfloyd.dev/blog/llm-reliability https://github.com/SeanLF/claude-rss-news-digest

Distributed Training Observability for PyTorch (TraceML) #

github.com

0 comments8:18 PMView on HN

13-Virtues – A tracker for Benjamin Franklin's 13-week character system #

13-virtues.com

1 comments1:26 PMView on HN

Hi HN!

I’m Hélène. My husband and I are builders from Belgium, and we’ve spent the last few months building a side project called 13 Virtues.

The idea comes from Benjamin Franklin’s personal character system. Instead of tracking many habits at once, he focused on one virtue per week (Temperance, Silence, Order, etc.), cycling through 13 virtues over 13 weeks, then repeating the cycle. He documented this practice for many years.

We wanted an app that follows this structure strictly, rather than another flexible habit tracker. One virtue at a time. One day at a time. Repeat.

You can try the ledger directly on the homepage without creating an account. You can mark faults for today and see the current virtue with Franklin’s original quotes:

https://www.13-virtues.com

Why we built it:

We were frustrated with productivity apps that optimize for streaks and metrics rather than reflection. Franklin’s system felt refreshingly constrained and intentional, and we wanted something we’d actually use ourselves. My husband handled the engineering; I focused on product and design.

Pricing model:

We deliberately avoided subscriptions.

- Free tier: the full 13-week cycle and daily ledger

- Lifetime upgrade ($79 launch price): long-term history beyond the current cycle, guided reflections, data export, and a downloadable Modern Virtue Guide (PDF) that explains the method and its rationale in more depth.

Franklin’s system predates SaaS by a few centuries, and a monthly fee felt wrong for something meant to be practiced quietly over years.

Tech:

- Backend: Laravel

- Frontend: Vue (Inertia)

- CSS: Tailwind

- Hosting: Hostinger

Built over ~12 weekends.

We’ll be around today (CET) to answer questions — happy to discuss the implementation, the pricing decision, or Franklin’s original writings. Thoughtful UI/UX feedback is especially welcome.

Pingram – Send Telegram alerts with 1 line of Python (20KB) #

github.com

2 comments4:58 PMView on HN

A Local OS for LLMs. MIT License. Zero Hallucinations. (Not Crank) #

github.com

0 comments4:56 AMView on HN

The problem with LLMs isn't intelligence; it's amnesia and dishonesty. Hey HN, I’ve spent the last few months building Remember-Me, an open-source "Sovereign Brain" stack designed to run entirely offline on consumer hardware.

The core thesis is simple: Don't rent your cognition.

Most RAG (Retrieval Augmented Generation) implementations are just "grep for embeddings." They are messy, imprecise, and prone to hallucination. I wanted to solve the "Context integrity" problem at the architectural layer.

The Tech Stack (How it works):

QDMA (Quantum Dream Memory Architecture): instead of a flat vector DB, it uses a hierarchical projection engine. It separates "Hot" (Recall) from "Cold" (Storage) memory, allowing for effectively infinite context window management via compression.

CSNP (Context Switching Neural Protocol) - The Hallucination Killer: This is the most important part. Every memory fragment is hashed into a Merkle Chain. When the LLM retrieves context, the system cryptographically verifies the retrieval against the immutable ledger.

If the hash doesn't match the chain: The retrieval is rejected.

Result: The AI visually cannot "make things up" about your past because it is mathematically constrained to the ledger. Local Inference: Built on top of llama.cpp server. It runs Llama-3 (or any GGUF) locally. No API keys. No data leaving your machine.

Features:

Zero-Dependency: Runs on Windows/Linux with just Python and a GPU (or CPU).

Visual Interface: Includes a Streamlit-based "Cognitive Interface" to visualize memory states. Open Source: MIT License. This is an attempt to give "Agency" back to the user. I believe that if we want AGI, it needs to be owned by us, not rented via an API.

Repository: https://github.com/merchantmoh-debug/Remember-Me-AI

I’d love to hear your feedback on the Merkle-verification approach. Does constraining the context window effectively solve the "trust" issue for you?

It's fully working - Fully tested. If you tried to Git Clone before without luck - As this is not my first Show HN on this - Feel free to try again.

To everyone who HATES AI slop; Greedy corporations and having their private data stuck on cloud servers.

You're welcome.

Cheers, Mohamad

Authors note: Updated successfully.

Framework 50 is active.

For anyone passing by - yes this is a big deal. Eliminating AI hallucination is a 60 billion dollar market problem and I'm giving THAT + sovereign control of your DATA plus the capability to do high-end research via framework 50 (including advanced scientific research) for FREE - under an MIT license. If you don't take advantage of this - you are an idiot.

If you do - welcome to the future.

P.S: What do I get from lying? I got 36 stars on the repo - many from high-end senior engineers at fortune 500 companies. If you're too stupid to tell the real deal from a lie then keep it moving son.

A simple, transaction-safe SQL migration tool #

github.com

0 comments11:30 PMView on HN

Paper → Code → Jupyter Notebook (generate and run code while reading) #

deeptutor.knowhiz.us

0 comments5:38 AMView on HN

Hi HN — we’re shipping a feature in DeepTutor this week: turn a paper into runnable code and export a Jupyter Notebook.

It’s meant for quick reproducibility: implement an equation/method, run a sanity check, and keep everything in a shareable notebook.

Demo: https://www.youtube.com/watch?v=FOnyym-jUPg Happy to answer questions. What would make this useful in your workflow?

Walk and drive through OpenStreetMap in 3D #

bilalba.github.io

3 comments6:25 AMView on HN

Standing on the shoulders of OpenStreetMap, three.js, and other open-source projects, I made an open-source, game-like 3D explorer for OSM.

It’s barebones right now (currently loads one scene), renderings need work, and it’s buggy in places. But you can get driving on most roads around the world.

Contributions welcome!

Get recommendations or convert agent skills directly in your workspace #

agenstskills.com

0 comments9:00 AMView on HN

It's an open-source project that lets you create, translate, and work with multiple agents at the same time, and more.

WSL2 HyperV Firewall Manager – GUI App for WSL2 Firewall Management #

github.com

0 comments3:03 PMView on HN

An open-source starter for developing with Postgres and ClickHouse #

github.com

0 comments5:46 PMView on HN

Cosmic AI Workflows – Chain AI agents to automate multi-step projects #

cosmicjs.com

0 comments6:37 PMView on HN

Hi, I'm Tony, founder of Cosmic (AI-powered headless CMS and application development platform). We kept running into the same problem: create a blog post with the help of an AI agent, use the output for another prompt to create social posts, then manually post to X, LinkedIn, Facebook. Every single time.

So we built AI Workflows — chain multiple agents together and let them run autonomously, with each step receiving outputs from previous steps.

Three agent types you can chain:

- Code Agents: Build features in GitHub with commits and pull requests.

- Content Agents: Generate CMS content with context injection from previous steps.

- Computer Use Agents: Automate browser workflows and record demos.

How it works:

1. Define steps with agent type, prompt, and configuration

2. Steps run sequentially or in parallel (configurable)

3. Context passes automatically between steps

4. Trigger manually, on a schedule (cron), or via CMS and API events (object.created, object.edited, etc.)

5. Add approval gates for human review before critical steps

Example: Autopilot feature development:

Step 1: Content Agent writes a feature spec based on user feedback

Step 2: Code Agent builds the feature, creates PR, and deploys to production

Step 3: Content Agent generates documentation and a changelog entry

Step 4: Computer Use Agent posts update to team Slack with the PR link and preview URL

Currently in beta. Would love feedback on the workflow model and what use cases you'd want to automate.

P.ai.os – A local, modular AI "operating" system for macOS (M4/MLX) #

0 comments7:46 PMView on HN

Hello HN,

I’m sharing a personal project I’ve been working on: a modular, "headless" operating system for my digital life, hosted locally on a Mac Mini M4.

The source code is available here:

https://github.com/vag-mac-mini/PAIOS_Public

I travel frequently and wanted a "sovereign" system that doesn't rely on SaaS subscriptions or cloud data. The stack is: - Hardware: Headless Mac Mini M4 (16GB). - AI Core: Qwen 3 VL 8B (Abliterated) running via MLX. - Network: Tailscale (for remote access via PWA). - Agentic Layer: Custom Python scripts for tool use (API calls, system control).

The constraint for this project was unique: I did not write the code manually. I utilized Google Gemini to architect and generate the Python files with Antigravity IDE, acting as the "product manager". The code structure is admittedly messy in places, but it is fully functional.

The OS currently runs 10 active modules, categorized by function:

-- Intelligence & Memory -- * Digital Diary: Uses Vision AI to analyze screen activity/productivity and logs daily summaries to Apple Notes. * Voice & Memory: Indexes voice transcriptions into a searchable "Second Brain." * Ghostwriter: Remixes rough voice notes into structured essays or book chapters.

-- Global Logistics -- * Travel Command: Aggregates flight/visa data and summarizes local security risks via Tavily API.Gives recommendations for packing based on country/city/period/weather. * Aero Intel: Audits flight paths and generates deep links for travel logistics. * Chronos Calendar: A master schedule that integrates financial timelines with travel itineraries.

-- Network & Security -- * Network Sentry: Monitors the local ARP table for unknown devices. * Secure Dead Drop: An encrypted P2P file tunnel between my devices. * CRM & Network: A relationship database that parses raw notes into structured contacts.

Latest update that is not on GitHub, created a python script and added it on cron where it uses an api to check all the bank holidays across the globe(excluding the major ones) checks my contacts, which could be from all over the world, see's who is from the country with the bank holiday and I receive an iMessage in the morning in order to wish them happy whatever holiday.

This is not a SaaS or a commercial product. I am open to feedback on the architecture or suggestions for other "sovereign" modules I could add.

Thanks.

First autonomous ML and AI engineering Agent #

marketplace.visualstudio.com

2 comments8:30 PMView on HN

Founder here. I built NEO, an AI agent designed specifically for AI and ML engineering workflows, after repeatedly hitting the same wall with existing tools: they work for short, linear tasks, but fall apart once workflows become long-running, stateful, and feedback-driven. In real ML work, you don’t just generate code and move on. You explore data, train models, evaluate results, adjust assumptions, rerun experiments, compare metrics, generate artifacts, and iterate; often over hours or days. Most modern coding agents already go beyond single prompts. They can plan steps, write files, run commands, and react to errors.

Where things still break down is when ML workflows become long-running and feedback-heavy. Training jobs, evaluations, retries, metric comparisons, and partial failures are still treated as ephemeral side effects rather than durable state. Once a workflow spans hours, multiple experiments, or iterative evaluation, you either babysit the agent or restart large parts of the process. Feedback exists, but it is not something the system can reliably resume from.

NEO tries to model ML work the way it actually happens. It is an AI agent that executes end-to-end ML workflows, not just code generation. Work is broken into explicit execution steps with state, checkpoints, and intermediate results. Feedback from metrics, evaluations, or failures feeds directly into the next step instead of forcing a full restart. You can pause a run, inspect what happened, tweak assumptions, and resume from where it left off.

Here's an example as well for your reference: You might ask NEO to explore a dataset, train a few baseline models, compare their performance, and generate plots and a short report. NEO will load the data, run EDA, train models, evaluate them, notice if something underperforms or fails, adjust, and continue. If training takes an hour and one model crashes at 45 minutes, you do not start over. Neo inspects the failure, fixes it, and resumes.

Docs for the extension: https://docs.heyneo.so/#/vscode

Happy to answer questions about Neo.

A userscript to filter out LLM/AI content from HN #

github.com

0 comments8:43 PMView on HN

ACME Proxy using step-ca #

github.com

0 comments10:12 PMView on HN

Lumina – Open-source observability for AI systems(OpenTelemetry-native) #

github.com

0 comments4:00 PMView on HN

Hey HN! I built Lumina – an open-source observability platform for AI/LLM applications. Self-host it in 5 minutes with Docker Compose, all features included.

The Problem:

I've been building LLM apps for the past year, and I kept running into the same issues: - LLM responses would randomly change after prompt tweaks, breaking things. - Costs would spike unexpectedly (turns out a bug was hitting GPT-4 instead of 3.5). - No easy way to compare "before vs after" when testing prompt changes. - Existing tools were either too expensive or missing features in free tiers.

What I Built:

Lumina is OpenTelemetry-native, meaning: - Works with your existing OTEL stack (Datadog, Grafana, etc.). - No vendor lock-in, standard trace format. - Integrates in 3 lines of code.

Key features: - Cost & quality monitoring – Automatic alerts when costs spike, or responses degrade. - Replay testing – Capture production traces, replay them after changes, see diffs. - Semantic comparison – Not just string matching – uses Claude to judge if responses are "better" or "worse." - Self-hosted tier – 50k traces/day, 7-day retention, ALL features included (alerts, replay, semantic scoring)

How it works:

```bash # Start Lumina git clone https://github.com/use-lumina/Lumina cd Lumina/infra/docker docker-compose up -d ```

```typescript // Add to your app (no API key needed for self-hosted!) import { Lumina } from '@uselumina/sdk';

const lumina = new Lumina({ endpoint: 'http://localhost:8080/v1/traces', });

// Wrap your LLM call const response = await lumina.traceLLM( async () => await openai.chat.completions.create({...}), { provider: 'openai', model: 'gpt-4', prompt: '...' } ); ```

That's it. Every LLM call is now tracked with cost, latency, tokens, and quality scores.

What makes it different:

1. Free self-hosted with limits that work – 50k traces/day and 7-day retention (resets daily at midnight UTC). All features included: alerts, replay testing, and semantic scoring. Perfect for most development and small production workloads. Need more? Upgrade to managed cloud.

2. OpenTelemetry-native – Not another proprietary format. Use standard OTEL exporters, works with existing infra. Can send traces to both Lumina AND Datadog simultaneously.

3. Replay testing – The killer feature. Capture 100 production traces, change your prompt, replay them all, and get a semantic diff report. Like snapshot testing for LLMs.

4. Fast – Built with Bun, Postgres, Redis, NATS. Sub-500ms from trace to alert. Handles 10k+ traces/min on a single machine.

What I'm looking for:

- Feedback on the approach (is OTEL the right foundation?) - Bug reports (tested on Mac/Linux/WSL2, but I'm sure there are issues) - Ideas for what features matter most (alerts? replay? cost tracking?) - Help with the semantic scorer (currently uses Claude, want to make it pluggable)

Why open source:

I want this to be the standard for LLM observability. That only works if it's: - Free to use and modify (Apache 2.0) - Easy to self-host (Docker Compose, no cloud dependencies) - Open to contributions (good first issues tagged)

The business model is managed hosting for teams that don't want to run infrastructure. But the core product is and always will be free.

Try it: - GitHub: https://github.com/use-lumina/Lumina - Docs: https://docs.uselumina.io - Quick start: 5 minutes from `git clone` to dashboard

I'd love to hear what you think! Especially interested in: - What observability problems are you hitting with LLMs - Missing features that would make this useful for you - Any similar tools you're using (and what they do better)

Thanks for reading!

CUGA – Configurable Generalist Agent (HuggingFace Live Demo) #

huggingface.co

2 comments2:01 PMView on HN

We’ve been building CUGA, an open-source generalist agent for reliable browser+API automation with Save & Reuse, policy modes, OpenAPI/MCP integration, and strong results on WebArena/AppWorld. Live demo link above. Happy to answer questions and get feedback.

Jiss – A community-powered LLM API I built for open models #

jiss.ai

0 comments6:08 AMView on HN

I’ve been working on Jiss, a project to make open-source LLMs accessible without the massive cloud bills or the headache of managing your own 24/7 infrastructure. It’s a distributed network that routes inference requests to a pool of volunteer-run workers. If you have an app already using OpenAI, you can just swap the base_url and start running models like Llama 3 or Qwen3. The idea: Instead of relying on big providers, Jiss is powered by people donating idle compute. Whether you’re running Ollama on a Mac, a home server, or a workstation, you can connect your local setup to the network and help serve requests. I built this to see if we could create a useful, shared resource for the open-source AI community. The highlights: • Drop-in replacement: Use the standard OpenAI SDK—just change the base URL. • Streaming: Supports real-time token streaming (SSE). • Open: Anyone can join as a worker using a single script. It’s still early so I’d love to get your feedback or have you try running a worker!

An Internationalization GitHub Action to Replace Crowdin with LLMs #

github.com

0 comments5:09 AMView on HN

I built an open-source GitHub Action that translates i18n files using LLMs, designed as a drop-in replacement for Lokalise, Phrase, and Crowdin.

The problem: TMS platforms charge per-word and per-seat, and their machine translation lacks product context. A typical SaaS with 500 strings across 9 languages costs $200-500/month.

How it works: - Extracts strings from your codebase (XLIFF, JSON, PO, YAML) - Diffs against previous translations (only translates what changed) - Sends to any LLM (Claude, GPT-4, Gemini, Ollama) with your product context, glossary, and style guide - Commits translations back to your branch

Key technical bits: - Structured generation for ICU message format (CLDR plural rules handled correctly - Russian 4-form, Arabic 6-form, etc.) - Hash-based caching to avoid re-translating unchanged strings - Provider-agnostic interface - swap LLMs without config changes

GitHub: https://github.com/webdecoy/ai-i18n

Happy to answer questions about the ICU handling, prompt design, or anything else.

LinkLens – Document and link tracking in one dashboard #

linklens.tech

0 comments6:21 AMView on HN

I built LinkLens because I was paying $80/month for DocSend + Bitly separately.

It tracks: - Documents: who viewed, which pages, how long - Links: clicks, location, device, UTM params

Tech stack: Next.js, Supabase, Cloudflare R2

Free tier available. Would love feedback on the product and landing page.

Claude Threads – Collaborate on Claude Code via Slack (Or Mattermost) #

claude-threads.run

0 comments4:04 PMView on HN

I wanted my team to start using Claude Code but didn't want to set everyone up before they were convinced. Started piping output to Mattermost (and later Slack) so people could watch and learn how to work with Claude Code. Ended up building more: multiple sessions in parallel (each in a thread, hence the name), approve messages from other users with emojis, approve file writes, attach images/files, worktrees per thread.

It runs on your machine.

I built most of it using itself. Teammates watching live caught stuff I missed.

https://claude-threads.run https://github.com/anneschuth/claude-threads

Triangular Arbitrage Bot with Kernel Bypass #

github.com

0 comments7:14 PMView on HN

Engroles.com – Verified, Active SWE Listings from Recruiters #

engroles.com

0 comments6:17 AMView on HN

I created engroles because, as I was looking for a job as a software engineer, I realized that most of the listings out there are black holes where my application likely went to die. However, I noticed that I got an interview with almost every recruiter that reached out to my personal email advertising an actual listing. It taught me that recruiter email outreach almost always represents a real job.

And so I thought to myself, how can I help other job-seeking software engineers get access to an aggregated list of real recruiter outreach? And I came up with engroles. The idea is that as a job seeker, you forward one verified recruiter outreach that you've received to your email in the last 30 days to [email protected]. Once engroles verifies that the recruiter outreach represents a real job listing, it will give you access to apply to any of the listings in the entire job board. Every listing represents a job from another user who has received verified recruiter outreach and sent it to engroles. Each one of the listings in engroles is also linked to a recruiter. When job seekers apply to a job listing, the recruiter is notified, and recruiters will reach out to the job seeker's profile.

This is an initial release. You can see a truncated view of the currently existing job listings without creating an account. Please give me any feedback that you have. Thanks for checking this out!

Goldenthread – Compile Go to TypeScript Zod for type-safe validation #

github.com

0 comments4:03 PMView on HN

Hi HN,

I kept shipping features where the frontend accepted data the backend rejected. Validation rules lived in two places (Go struct tags and TypeScript Zod schemas), and they'd drift.

I built goldenthread to solve this. It's a build-time compiler that generates TypeScript Zod schemas from Go struct tags. Write validation once, use it everywhere.

Example:

    // Go backend
    type User struct {
        Username string `json:"username" gt:"required,len:3..20"`
        Email    string `json:"email" gt:"email"`
        Age      int    `json:"age" gt:"min:13,max:130"`
    }

Run `goldenthread generate ./models` and you get:

    // TypeScript (auto-generated)
    export const UserSchema = z.object({
      username: z.string().min(3).max(20),
      email: z.string().email(),
      age: z.number().int().min(13).max(130)
    })

    export type User = z.infer<typeof UserSchema>

The compiler uses go/packages and go/types for accurate type resolution. It handles nested objects, enums, maps, embedded structs, and 37+ validation rules.

The killer feature: drift detection. Run `goldenthread check` in CI - it computes SHA-256 hashes of your schemas and fails the build if generated TypeScript doesn't match Go source. No more "frontend works locally but breaks in production because someone changed the backend struct."

Before releasing v0.1.0, I ran continuous fuzzing (10 targets, hourly in GitHub Actions). It found two bugs my test suite missed:

  1. UTF-8 corruption: Japanese field names with empty JSON tags triggered byte-slicing bugs in camelCase conversion. Took 444,553 executions to discover.

  2. Regex escaping: Patterns with newline characters produced broken JavaScript output. Found in 180 executions.

Both required specific intersections of conditions that manual testing wouldn't cover. I wrote up the full fuzzing setup here: https://blackwell-systems.github.io/blog/posts/continuous-fu...

The tool is production-ready:

Zero runtime dependencies (generated code only imports Zod) Generates readable TypeScript (looks hand-written) Complete Go type support (primitives, arrays, maps, nested objects) Works with any Go project structure MIT OR Apache 2.0 dual licensed

I built this because I was tired of frontend/backend validation bugs in my day job (hospitality platform with booking/payment APIs). We use it internally now.

Real-world example in the repo: 9-schema e-commerce system with Customer records (E.164 phone validation), Product SKUs (regex patterns), and Order workflows (nested objects + array constraints).

GitHub: https://github.com/blackwell-systems/goldenthread Docs: https://github.com/blackwell-systems/goldenthread/blob/main/... Tag syntax: https://github.com/blackwell-systems/goldenthread/blob/main/...

Would love feedback

Thanks for looking!