2026年1月26日 的 Show HN
44 条Ourguide – OS wide task guidance system that shows you where to click #
I started building this because whenever I didn’t know how to do something on my computer, I found myself constantly tabbing between chatbots and the app, pasting screenshots, and asking “what do I do next?” Ourguide solves this with two modes. In Guide mode, the app overlays your screen and highlights the specific element to click next, eliminating the need to leave your current window. There is also Ask mode, which is a vision-integrated chat that captures your screen context—which you can toggle on and off anytime -so you can ask, "How do I fix this error?" without having to explain what "this" is.
It’s an Electron app that works OS-wide, is vision-based, and isn't restricted to the browser.
Figuring out how to show the user where to click was the hardest part of the process. I originally trained a computer vision model with 2300 screenshots to identify and segment all UI elements on a screen and used a VLM to find the correct icon to highlight. While this worked extremely well—better than SOTA grounding models like UI Tars—the latency was just too high. I'll be making that CV+VLM pipeline OSS soon, but for now, I’ve resorted to a simpler implementation that achieves <1s latency.
You may ask: if I can show you where to click, why can't I just click too? While trying to build computer-use agents during my job in Palo Alto, I hit the core limitation of today’s computer-use models where benchmarks hover in the mid-50% range (OSWorld). VLMs often know what to do but not what it looks like; without reliable visual grounding, agents misclick and stall. So, I built computer use—without the "use." It provides the visual grounding of an agent but keeps the human in the loop for the actual execution to prevent misclicks.
I personally use it for the AWS Console's "treasure hunt" UI, like creating a public S3 bucket with specific CORS rules. It’s also been surprisingly helpful for non-technical tasks, like navigating obscure settings in Gradescope or Spotify. Ourguide really works for any task when you’re stuck or don't know what to do.
You can download and test Ourguide here: https://ourguide.ai/downloads
The project is still very early, and I’d love your feedback on where it fails, where you think it worked well, and which specific niches you think Ourguide would be most helpful for.
Cua-Bench – a benchmark for AI agents in GUI environments #
Computer-use agents show massive performance variance across different UIs—an agent with 90% success on Windows 11 might drop to 9% on Windows XP for the same task. The problem is OS themes, browser versions, and UI variations that existing benchmarks don't capture.
The existing benchmarks (OSWorld, Windows Agent Arena, AndroidWorld) were great but operated in silos—different harnesses, different formats, no standardized way to test the same agent across platforms. More importantly, they were evaluation-only. We needed environments that could generate training data and run RL loops, not just measure performance. Cua-Bench takes a different approach: it's a unified framework that standardizes environments across platforms and supports the full agent development lifecycle—benchmark, train, deploy.
With Cua-Bench, you can:
- Evaluate agents across multiple benchmarks with one CLI (native tasks + OSWorld + Windows Agent Arena adapters)
- Test the same agent on different OS variations (Windows 11/XP/Vista, macOS themes, Linux, Android via QEMU)
- Generate new tasks from natural language prompts
- Create simulated environments for RL training (shell apps like Spotify, Slack with programmatic rewards)
- Run oracle validations to verify environments before agent evaluation
- Monitor agent runs in real-time with traces and screenshots
All of this works on macOS, Linux, Windows, and Android, and is self-hostable.
To get started:
Install cua-bench:
% pip install cua-bench
Run a basic evaluation:
% cb run dataset datasets/cua-bench-basic --agent demo
Open the monitoring dashboard:
% cb run watch <run_id>
For parallelized evaluations across multiple workers:
% cb run dataset datasets/cua-bench-basic --agent your-agent --max-parallel 8
Want to test across different OS variations? Just specify the environment:
% cb run task slack_message --agent your-agent --env windows_xp
% cb run task slack_message --agent your-agent --env macos_sonoma
Generate new tasks from prompts:
% cb task generate "book a flight on kayak.com"
Validate environments with oracle implementations:
% cb run dataset datasets/cua-bench-basic --oracle
The simulated environments are particularly useful for RL training—they're HTML/JS apps that render across 10+ OS themes with programmatic reward verification. No need to spin up actual VMs for training loops.
We're seeing teams use Cua-Bench for:
- Training computer-use models on mobile and desktop environments
- Generating large-scale training datasets (working with labs on millions of screenshots across OS variations)
- RL fine-tuning with shell app simulators
- Systematic evaluation across OS themes and browser versions
- Building task registries (collaborating with Snorkel AI on task design and data curation, similar to their Terminal-Bench work)
Cua-Bench is 100% open-source under the MIT license. We're actively developing it as part of Cua (https://github.com/trycua/cua), our Computer Use Agent SDK, and we'd love your feedback, bug reports, or feature ideas.
GitHub: https://github.com/trycua/cua
Docs: https://cua.ai/docs/cuabench
Technical Report: https://cuabench.ai
We'll be here to answer any technical questions and look forward to your comments!
NukeCast – If it happened today, where would the fallout go #
Hybrid Markdown Editing #
Partial content web crawling using HTTP/2 and Go #
Tl;dr e.g. the HTML of a YouTube video contains the video description, views, likes etc. in its first 600KB, the remaining 900KB are of no use for me, but I have to pay my proxies by the gigabyte.
My crawler receives packet per packet, and if I got everything I needed I reset the request, and only pay-for-what-i-crawled.
This is also potentially useful for large-scale crawling operations, where duplicates matter. You could compute a simHash on the fly, and reset on-the-fly before crawling the entire document (again).
XDA Forum discussion on sideloading (2009) #
I got tired of checking 5 dashboards, so I built a simpler one #
Whenever I wanted to understand how things were going, I’d end up jumping between Stripe, analytics, database queries, logs, and cron scripts. I even built custom dashboards and Telegram bots to notify me about certain numbers, but that just added more things to maintain.
What I wanted was something simpler: send a number from my backend and see it on a clean dashboard.
So I built a small tool for myself.
It’s essentially a very simple API where you push numeric metrics with a timestamp, and then view them as counters, charts, goals, or percentage changes over time.
It’s not meant to replace analytics tools. I still use those. This is more for things like user counts, MRR, failed jobs, or any metric you already know you want to track without setting up a full integration.
Some intentional constraints: - no SDKs, just a basic HTTP API - works well with backend code and cron jobs - stores only numbers and timestamps - flexible enough to track any metric you can turn into a number
It’s still early and very much an MVP. I’m mainly posting to get feedback: - does this solve a real problem for you? - what feels unnecessary or missing? - how would you approach this differently?
Website: https://anypanel.io
Happy to answer questions or hear why this doesn’t make sense. Thanks, Felix
A Local OS for LLMs. MIT License. Zero Hallucinations. Infinite Memory #
Hey HN,
I’ve spent the last few months building Remember-Me, an open-source "Sovereign Brain" stack designed to run entirely offline on consumer hardware.
The core thesis is simple: Don't rent your cognition.
Most RAG (Retrieval Augmented Generation) implementations are just "grep for embeddings." They are messy, imprecise, and prone to hallucination. I wanted to solve the "Context integrity" problem at the architectural layer.
The Tech Stack (How it works):
QDMA (Quantum Dream Memory Architecture): instead of a flat vector DB, it uses a hierarchical projection engine. It separates "Hot" (Recall) from "Cold" (Storage) memory, allowing for effectively infinite context window management via compression.
CSNP (Context Switching Neural Protocol) - The Hallucination Killer: This is the most important part. Every memory fragment is hashed into a Merkle Chain. When the LLM retrieves context, the system cryptographically verifies the retrieval against the immutable ledger.
If the hash doesn't match the chain: The retrieval is rejected.
Result: The AI visually cannot "make things up" about your past because it is mathematically constrained to the ledger. Local Inference: Built on top of llama.cpp server. It runs Llama-3 (or any GGUF) locally. No API keys. No data leaving your machine.
Features:
Zero-Dependency: Runs on Windows/Linux with just Python and a GPU (or CPU).
Visual Interface: Includes a Streamlit-based "Cognitive Interface" to visualize memory states. Open Source: MIT License. This is an attempt to give "Agency" back to the user. I believe that if we want AGI, it needs to be owned by us, not rented via an API.
Repository: https://github.com/merchantmoh-debug/Remember-Me-AI
I’d love to hear your feedback on the Merkle-verification approach. Does constraining the context window effectively solve the "trust" issue for you?
It's fully working - Fully tested. If you tried to Git Clone before without luck - As this is not my first Show HN on this - Feel free to try again.
To everyone who HATES AI slop; Greedy corporations and having their private data stuck on cloud servers.
You're welcome.
Cheers, Mohamad
Delegation/Mixins C# Source Generators Library #
Bytepiper – turn .txt files into live APIs #
I built a small tool that converts API logic written in plain .txt files into real, executable PHP API endpoints.
The motivation was personal: I can design and ship frontends quickly, but backend APIs (setup, boilerplate, deployment) always slowed down small projects and MVPs. I wanted a way to describe inputs, rules, and responses in text and get a working endpoint without worrying about infrastructure.
This is early and opinionated. I’m especially interested in feedback around:
trust and security concerns
where this breaks down
whether this is useful beyond prototypes
Happy to answer questions about how it works.
Debugging conflicting U.S. sexual behavior surveys #
LocalPass offline password manager. Zero cloud. Zero telemetry #
SHDL – A minimal hardware description language built from logic gates #
I built SHDL (Simple Hardware Description Language) as an experiment in stripping hardware description down to its absolute fundamentals.
In SHDL, there are no arithmetic operators, no implicit bit widths, and no high-level constructs. You build everything explicitly from logic gates and wires, and then compose larger components hierarchically. The goal is not synthesis or performance, but understanding: what digital systems actually look like when abstractions are removed.
SHDL is accompanied by PySHDL, a Python interface that lets you load circuits, poke inputs, step the simulation, and observe outputs. Under the hood, SHDL compiles circuits to C for fast execution, but the language itself remains intentionally small and transparent.
This is not meant to replace Verilog or VHDL. It’s aimed at:
- learning digital logic from first principles
- experimenting with HDL and language design
- teaching or visualizing how complex hardware emerges from simple gates
I would especially appreciate feedback on:
- the language design choices
- what feels unnecessarily restrictive vs. educationally valuable
- whether this kind of “anti-abstraction” HDL is useful to you
Repo: https://github.com/rafa-rrayes/SHDL
Python package: PySHDL on PyPI
Thanks for reading, and I’m very open to critique.
Alprina – Intent matching for co-founders and investors #
We built Alprina to match people on what they want right now, not just who they are on paper.
On Alprina, you create "intents" in natural language (what you're looking for), join networks (communities where matching happens), and our AI matches you with people whose intents complement yours. You can attach context like pitch decks or profiles so that when you match, the other side immediately understands why you're reaching out.
Would love feedback from the HN community - especially on the balance between match precision and serendipity. Too strict and you miss interesting connections; too loose and it's just noise.
Ideon – An open source, infinite canvas for your project's segmentation #
Ideon is a self-hosted visual workspace designed to bridge this gap. It doesn't replace your existing stack (GitHub, Figma, Notion, etc.) but provides a shared context where all these pieces live together on an infinite canvas.
We built this because projects often die from fragmentation—code is in one place, decisions in chat logs, and visuals in design tools. Ideon aims to keep the project "mentally navigable" for everyone involved.
Key features: - Visual Blocks: Organize Repositories, Notes, Links, Files, and People spatially. - State History: Track how decisions evolved with workspace snapshots. - Multiplayer: Real-time collaboration. - Self-hosted: Docker-based and AGPLv3 licensed.
Tech stack: Next.js, PostgreSQL, Docker.
Would love to hear your feedback on the approach!
PillarLabAI – A reasoning engine for prediction markets #
I built PillarLab to solve the 'black box' problem of AI. It uses 1,720+ proprietary 'Pillars' (analytical frameworks) to guide the AI through rigorous logic, things like Sharp Money tracking, xG soccer models, and Line Movement.
I’d love your feedback on the reasoning. Does the weighting of factors make sense to you? I'll be here to answer questions all day!
Agent OS – 0% Safety Violations for AI Agents #
Current frameworks let the LLM "decide" whether to follow safety rules. Agent OS inverts this: the kernel decides, the LLM computes.
AI Compass:Daily AI Search Signals and Trends #
I’m sharing AI Compass, a daily AI signal brief built for builders who want facts over hype.
AI moves fast. We track real-time signals across Google Trends and web news, then use AI clustering, denoising, and source attribution to surface what actually matters: model launches, company moves, and emerging terms.
You get a structured daily brief with traceable sources and clear takeaways, so you can understand the landscape in minutes.
Our priorities:
1. High signal-to-noise: only accurate, relevant items. No hype. 2. Objective and transparent: conclusions backed by traceable evidence. 3. Fast to consume: built for developers, PMs, and indie builders.
This is a new launch. Feedback, bug reports, and feature ideas are welcome.
Reactive Resume v5 – A free and open-source resume builder #
GitHub Action that analyzes CI failures using AI #
Marketplace: https://github.com/marketplace/actions/github-actions-failur...
The Poor Man's Guide to Cloud GPU Selection #
Taking the pre-training of LLMs as an example, it shows how the cost-optimal GPU changes depending on the computational intensity (∝ model size x batch size).
CSR vs. SSR Detector #
After work I built a small Chrome extension that detects whether a webpage is rendered using Server Side Rendering, Client Side Rendering, or a hybrid approach.
As a frontend developer I often wanted a quick way to check how a site is rendered without opening devtools and digging through network and DOM. This started as a personal after hours project and turned into something I use daily, so I decided to share it.
What it does • Detects SSR, CSR, or hybrid rendering • Recognizes frameworks like Next.js, React, Nuxt, Gatsby and others • Shows basic performance timings like DOM ready and FCP • Keeps last 10 checks in local history • Works fully locally with no data collection
Accuracy is based on 15 plus indicators and works surprisingly well across modern stacks.
Everything is open source. No tracking. No external servers. Just a lightweight dev tool.
I recently improved React 18 detection, fixed history display, added better error handling, and cleaned up docs and roadmap.
This is very much a side project made for fun and learning. If it helps even a few devs or SEO folks, I will be happy.
https://chromewebstore.google.com/detail/csr-vs-ssr-detector...
Feedback and suggestions are welcome. Thanks for checking it out
ZigZag – Generate Markdown code reports from directories (Zig) #
I built ZigZag, a command-line tool written in Zig that recursively scans source code directories and generates a single markdown report containing the code and metadata.
It’s designed to be fast on large codebases and uses: - Parallel directory and file processing - A persistent on-disk cache to avoid re-reading unchanged files - Different file reading strategies based on file size (read vs mmap) - Timezone-aware timestamps in reports
Each directory produces a report.md with a table of contents, syntax-highlighted code blocks, file sizes, modification times, and detected language.
Repo: https://github.com/LegationPro/zigzag
I built this mainly for auditing and documenting large repositories. Feedback, critiques, and ideas are welcome.
SheetSage – A Linter for the Most Dangerous Programming Language #
The Technical Implementation:
Locale aware parsing: Since Google Sheets doesn’t provide an AST for formulas, I had to build a conservative parser that tracks quotes, parens, and braces to extract function calls without getting poisoned by strings or array literals. It handles localized argument separators (, vs ;) and decimal separators (, vs .) based on the spreadsheet's locale.
R1C1 Clustering: To avoid UI noise, I don't treat every cell as a unique finding. I normalize formulas using getFormulasR1C1() to identify templates that have been copied down. This allows the fix all engine to refactor thousands of cells in one batch.
The systemic softcap scoring: standard penalty per thousand metrics often under react to widespread errors. I implemented a continuous soft-cap model. It calculates union coverage for risks—if a critical error covers 40% of your workbook, your health score is soft-capped regardless of how many other healthy cells you have.
Snapshot & Rollback: Since I’m mutating user data, I implemented a SnapshotService that writes original formulas to a hidden SheetSage_SNAPSHOT sheet before any bulk fix. This provides a native "Undo" even after the Apps Script execution finishes.
Privacy: No spreadsheet data ever leaves the Google environment. The audit engine runs entirely in Apps Script. The only external call is a signed HMAC request to a Vercel/Next.js billing service to verify subscription entitlements via a stable clientId.
I'd love to discuss the heuristics I'm using to distinguish magic numbers from legitimate constants (like 24 for hours), and how I'm handling LockService to prevent race conditions during bulk refactoring.
Personal website, but imagine I'm messaging you #
i just finished my personal website and wanted someone to see it. i've been developing and pixel perfecting this in background as a side thing.
i will be happy to get feedbacks, thanks.
InsAIts V2 – Real-time monitoring for multi-agent AI communication #
Anchor-aware detection (set the user's original query as context to reduce false positives) Forensic root-cause tracing + ASCII chain visualization Built-in domain dictionaries (finance, healthcare, kubernetes, ML, devops, quantum) Local (Ollama) decipher mode — translates agent jargon to human-readable (Cloud soon) Integrations: Slack alerts, Notion/Airtable export, LangGraph/CrewAI wrappers
Privacy-first: local embeddings by default, nothing leaves your machine unless you opt into cloud decipher. Free tier works without an API key (local only). Also running limited lifetime deals for early supporters. Quick install: Bashpip install insa-its[full] Demos included:
Live terminal dashboard Marketing team agent simulation (watch shorthand emerge in real time)
GitHub: https://github.com/Nomadu27/InsAIts PyPI: https://pypi.org/project/insa-its/ Docs: https://insaitsapi-production.up.railway.app/docs Would love feedback — especially from anyone building agent crews or running multi-LLM systems in production. What’s your biggest pain point with agent observability? Thanks for checking it out!
Cristian
Python SDK for RamaLama AI Containers #
Hey, I’m one of the maintainers of RamaLama[1] which is part of the containers ecosystem (podman, buildah, skopeo). It’s a runtime-agnostic tool for coordinating local AI inference with containers.
I put together a python SDK for programmatic control over local AI using ramalama under the hood. Being runtime agnosti you can use ramalama with llama.cpp, vLLM, mlx, etc… so long as the underlying service exposes an OpenAI compatible endpoint. This is especially powerful for users deploying to edge or other devices with atypical hardware/software configuration that, for example, requires custom runtime compilations.
``` from ramalama_sdk import RamalamaModel
runtime_image = "quay.io/ramalama/ramalama:latest" model = "huggingface://ggml-org/gpt-oss-20b-GGUF"
with RamalamaModel(model, base_image=runtime_image) as model:
response = model.chat("How tall is Michael Jordan?")
print(response["content"])
```This SDK manages:
- Pulling and verifying runtime images
- Downloading models (HuggingFace, Ollama, ModelScope, OCI registries)
- Managing the runtime process
It works with air-gapped deployments and private registries and also has async support.If you want to learn more the documentation is available here: https://docs.ramalama.com/sdk/introduction.
Otherwise, I hope this is useful to people out there and would appreciate feedback about where to prioritize next whether that’s specific language support, additional features (speech to text? RAG? MCP?), or something else.
1. github.com/containers/ramalama