Ежедневные Show HN

Upvote0

Show HN за 12 мая 2026 г.

28 постов
773

Needle: We Distilled Gemini Tool Calling into a 26M Model #

github.com favicongithub.com
211 комментариев6:03 PMПосмотреть на HN
Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).

Training: - Pretrained on 200B tokens across 16 TPU v6e (27 hours) - Post-trained on 2B tokens of synthesized function-calling data (45 minutes) - Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle

The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.

While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.

This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544

Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle GitHub: https://github.com/cactus-compute/needle

126

Statewright – Visual state machines that make AI agents reliable #

github.com favicongithub.com
56 комментариев2:24 PMПосмотреть на HN
Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

I'm Ben Cochran, I spent 20+ years in the trenches with full-stack Engineering, DevOps, high performance computing & ML with stints at NVIDIA, AMD and various other organizations most recently as a Distinguished Engineer.

For agents to work reliably you either need massive parameter counts or massive context windows to keep the solution spaces workable. Most people are brute forcing reliability with bigger models and longer prompts.

What if I made the problem smaller instead of making the model bigger?

I took a different approach by using smaller models: models in the 13-20B parameter range and set them to task solving real SWE-bench problems. I constrained the tool and solution spaces using formal state machines. Each state in the machine defines which tools the model can access, how many iterations it gets and what transitions are valid. A planning state gets read-only tools. An implementation state gets edit tools (scoped to prevent mega edits) and write friendly bash tools. The testing state gets bash but only for testing commands. The model cannot physically skip steps or use the wrong tool at the wrong time. It is enforced via protocol, not via prompts.

The results were more promising than I would have expected. Across multiple model families irrespective of age (qwen-coder, gpt-oss, gemma4) and the improvements were consistent above the 13B parameter inflection point. Below that, models can navigate the state machine but can't retain enough context to produce accurate edits. More on the research bit: https://statewright.ai/research

Surprisingly this yielded improvements in frontier models as well. Haiku and Sonnet start to punch above their weight and Opus solves more reliably with fewer tokens and death spirals. Fine tuning did not yield these kinds of functional improvements for me. The takeaway it seems is that context window utilization matters more than raw context size - a tightly scoped working context at each step outperforms a model given carte blanche over everything. Constraining LLMs which are non-idempotent by using deterministic code is a pattern that nobody is currently talking about.

So, I built Statewright. Its core is a Rust engine that evaluates state machine definitions: states, transitions, guards and tool restrictions. Its orchestration doesn't use an LLM, just enforces the state machine. On top of that is a plugin layer that integrates with Claude Code (and soon Codex, Cursor and others) via MCP. When you activate a workflow, hooks enforce the guardrails per state automatically. The model sees 5 tools available instead of dozens, gets clear instructions for the current phase and transitions when conditions are met. Importantly it tells the model when it's attempting to do something that isn't in scope, incorrect or when it needs to try something else after getting stuck.

You can use your agent via MCP to build a state machine for you to solve a problem in your current context. The visual editor at statewright.ai lets you tweak these workflows in a graph view... You can clearly see the failure paths, the retry loops and the approval gates. State machines aren't DAGs; they loop and retry, which is what agentic work actually needs.

Statewright is currently live with a free tier, try it out in Claude Code by running the following:

/plugin marketplace add statewright/statewright

/plugin install statewright

/reload-plugins

Then "start the bugfix workflow" or /statewright start bugfix. You'll need to paste your API key when prompted. The latest versions of Claude may complain -- paste the API key again and say you really mean it, Claude is just being cautious here.

Feedback is welcome on the workflow editor, the plugin experience, and tell me what workflows you'd want to build first. Agents are suggestions, states are laws.

97

Agentic interface for mainframes and COBOL #

hypercubic.ai faviconhypercubic.ai
50 комментариев5:10 PMПосмотреть на HN
Hi HN, we’re Sai and Aayush, and we’re building Hypercubic (https://www.hypercubic.ai/). Today we’re launching Hopper, an agentic development environment for the mainframe.

Mainframes still run a surprising amount of critical infrastructure: banking, payments, insurance, airlines, government programs, logistics, and core operations at large institutions. Many of these systems are decades old, but they continue to process enormous transaction volumes because they are reliable, secure, and deeply embedded into business operations.

A lot of that software is written in COBOL and runs on IBM z/OS. The development environment looks very different from modern cloud or Unix-style development. Instead of GitHub, shell commands, package managers, and CI pipelines, developers often work through TN3270 terminal sessions, ISPF panels, partitioned datasets, JCL, JES queues, spool output, return codes, VSAM files, CICS transactions, and shop-specific conventions.

TN3270 is the terminal interface used to interact with many IBM mainframe systems. ISPF is the menu and panel system developers use inside that terminal to browse datasets, edit source, submit jobs, and inspect output. It is powerful and reliable, but it was designed for expert humans navigating screens, function keys, and fixed-width workflows, not AI agents.

A simple COBOL change might require finding the right source member, checking copybooks, locating compile JCL, submitting a job, reading JES/SYSPRINT output, interpreting condition codes, patching fixed-width source, and resubmitting.

A chatbot next to a terminal is not enough. The agent needs to operate inside the mainframe environment.

Hopper combines three things:

1. A real TN3270 terminal 2. Mainframe-aware panels for datasets, members, jobs, and spool output 3. An AI agent that can operate across those z/OS surfaces

For example, here is a tiny version of the kind of thing Hopper can help debug:

  cobol

   IDENTIFICATION DIVISION.
   PROGRAM-ID. PAYCALC.

   DATA DIVISION.
   WORKING-STORAGE SECTION.
   01  CUSTOMER-BALANCE     PIC 9(7)V99.

   PROCEDURE DIVISION.
       ADD 100.00 TO CUSTOMER-BALNCE
       DISPLAY "UPDATED BALANCE: " CUSTOMER-BALANCE
       STOP RUN.


  jcl

    //PAYCOMP  JOB (ACCT),'COMPILE',CLASS=A,MSGCLASS=X
    
    //COBOL    EXEC IGYWCL
    
    [//COBOL.SYSIN](https://cobol.sysin/) DD DSN=USER1.APP.COBOL(PAYCALC),DISP=SHR
    
    [//LKED.SYSLMOD](https://lked.syslmod/) DD DSN=USER1.APP.LOAD(PAYCALC),DISP=SHR

A human would submit this job, inspect JES output, open `SYSPRINT`, find the undefined `CUSTOMER-BALNCE`, map it back to the source, patch the member, and resubmit.

Hopper is designed to let an agent operate through that same loop autonomously.

Hopper is not trying to hide the mainframe behind a generic abstraction. It is not a chatbot pasted onto a terminal. The design principle is simple: preserve the fidelity of the mainframe environment, but make it accessible to AI agents.

Sensitive operations require approval, and the terminal remains visible at all times.

Once agents can operate inside the mainframe environment, new workflows become possible: faster job debugging, automated documentation, safer code changes, test generation, migration planning, traffic replay, and modernization verification.

At Hypercubic, we are building AI-native infrastructure for the full lifecycle of legacy modernization: understanding, operating, transforming, and testing. Hopper is one part of that platform.

Visit our site to download Hopper: https://www.hypercubic.ai/hopper

Here’s a demo video of Hopper in action: https://youtu.be/q81L5DcfBvE

You can also request access and immediately get a mainframe user account to play with.

We’re curious to hear your thoughts, especially from anyone who has worked with mainframes, COBOL or has done legacy enterprise modernization.

61

Gigacatalyst – Extend your SaaS with an embedded AI builder #

27 комментариев4:32 PMПосмотреть на HN
Hi HN, I’m Namanyay from Gigacatalyst (link: https://gigacatalyst.com/). Gigacatalyst allows sales, CS, and users to build one-off features, so your SaaS can support long-tail customer workflows and engineers aren’t pulled away from the roadmap.

When you sell software to large businesses, you realize that each customer needs their own workflow and features. Traditionally, this either means long engineering roadmaps or the customers end up using workarounds.

But what if everyone could build their critical missing features just by talking to an AI? That’s what we do at Gigacatalyst. We provide an AI customization layer for your customers, CS team, and sales team to build these missing critical workflows without needing any engineers at all. Think Lovable, but built on top of YOUR platform.

We connect to your product's APIs, learn your data model and design system, and let non-technical users build governed apps via natural language - inside your product, under your brand.

Here’s what it looks like in action: https://www.youtube.com/watch?v=_taSpSphH6E

One of our customers, a Series B company, saw their users (not engineers - managers, ops people, facility directors) build critical workflows like:

- Parts stockout prevention: A maintenance manager typed "show me which parts will run out in the next 2 weeks based on usage over the last 90 days, accounting for vendor lead times." The app tracks consumption velocity, forecasts stockouts, and alerts before it's too late. He says it's prevented ~$500K in emergency downtime.

- Invoice OCR from phone photos: Technicians kept losing paper invoices. The prompt: "upload a photo of the invoice, extract vendor name, date, amount, and line items, then match it to the purchase order and flag discrepancies." Now techs snap a photo on-site to automatically add to the system of record.

- Restaurant emergency triage: A pizza chain's facilities manager was drowning in maintenance requests. He built a priority matrix: "walk-in freezer not cooling" auto-routes as CRITICAL, "dining room light flickering" goes to LOW. He's now able to manage backlogs with the correct priority.

How Gigacatalyst works under the hood:

1. Agentic API discovery: Our agents go through your app and parse your endpoints, query params, request/response shapes, and sample data to build the base layer.

2. Generation and Validation: When a user describes what they want our AI generates an app. We set up multiple validation steps, including static checks, runtime error analysis, and LLM-as-a-judge.

3. Sandboxing and Compilation: We wrote our own compilation and sandboxing framework to get the fastest speeds and lowest costs. This means that users can interact with the built app in seconds.

4. Proxy layer: We create a proxy layer for all APIs to handle auth, tenant isolation, and rate limiting. Everything the agent has access to is controlled, logged, observed, and version controlled.

After 2000+ daily users, 900+ apps built, and 70% 30-day retention, today we're opening a public demo.

Try it: https://app.gigacatalyst.com/ - enter your SaaS product's API URL (or just the homepage) and start prompting.

If you're serving a variety of use cases, you probably deal with a lot of custom requests and Gigacatalyst will save you time and increase your bottom line. Book a meeting at https://gigacatalyst.com/#contact and I'll help your team and customers build new functionality on top of your platform.

I've been reading Hacker News since I was 12 years old. I'm proud to launch for all of you and I want to hear your feedback on my product and comments!

18

Safe-install – safer NPM installs with trusted build dependencies #

npmjs.com faviconnpmjs.com
5 комментариев12:30 AMПосмотреть на HN
In light of the ongoing npm supply chain compromises, I built safe-install:

https://www.npmjs.com/package/@gkiely/safe-install

It brings a couple of protections I wanted from npm but are not built in.

Similar to Bun’s trusted dependencies, it lets you disable install scripts by default and define a list of dependencies that are allowed to run build/install scripts:

https://bun.com/docs/guides/install/trusted

It also supports blocking exotic sub-dependencies, similar to pnpm’s `blockExoticSubdeps` setting:

https://gajus.com/blog/3-pnpm-settings-to-protect-yourself-f...

I was hoping npm would eventually add something like this, but it does not seem to be happening soon, so I made a small package for it.

4

Java/Spring Boot Idempotency Library #

github.com favicongithub.com
1 комментариев8:23 AMПосмотреть на HN
Idempotency4j is a Java idempotency library with pluggable storage backends and Spring Web / Spring Boot support.

This library solves the problem of ensuring that sensitive endpoints do not trigger side-effects multiple times - this is especially useful for any endpoints that handle financial operations. Currently, the library supports Spring MVC (Servlet-based) applications and MySQL and PostgreSQL with jdbc regarding storage backends. It is very simple to integrate, all that you have to do is add @Idempotent to any endpoints that need idempotency.

Full explanation of the functionalities and configuration is available in the readme of the repository. Repository : https://github.com/josipmusa/idempotency4j

I would love any feedback or review regarding the implementation - also, any recommendation is welcome.

3

OpenClaw OS – OSS Claude Cowork Built on Top of OpenClaw #

github.com favicongithub.com
0 комментариев5:56 PMПосмотреть на HN
Hi HN, We made OSS Claude Cowork, built as an OpenClaw plugin. It lets you create live artifacts (like Claude) that connect to datasources instead of datasets. (eg: fetching Stripe data automatically)

Other tools(Paperclip, Multica) focus on task management but our vision is to build one screen that feels like the SaaS tools you already love using.

It’s OSS. Feedback is highly appreciated.

2

GIF Pile. a site to make piles of GIFs #

gifpile.com favicongifpile.com
0 комментариев9:11 PMПосмотреть на HN
I'm quite fond of obnoxious looking gifs in a post-ironic way as a manner of shitposting and or injecting humor into a chat. The issue with this however is that, for no real good reason at all, the simple usecase of "Have image/gif background, bombard with garbage" had no real good tooling.

There's gif editors out there, EZgif my beloved is probably my most used non-search-indexing-slash-social-media-site, but they're kinda clunky for my specific usecase of making digital eye-sandpaper bombastic garbage. Other options are bleak and gave me the mark of the beast via shitty watermarks. I just wanted a pile of gifs on top of each other, and thus far the "easiest" way was to bust open a video editor, muck around with it, mess up exporting as a gif directly, get mad, export it as a 4 second mp4, and then use ffmpeg to get it working. is this probably moronic? yes. am I likely to have missed a decent tool? yes. Did I give up looking after sending 4 dollars to some Indian guy for "No watermarks ever for 4$", only for that "ever" to be a year, and then the clunky weird af login process not work? absolutely. (Fuck you, you know who you are)

This took me a few hours (most of which was dealing with the fact I don't do webshit normally and the clunk that one would expect from that), and is a minimal site for my personal minimal usecase. It's static because I'm not going to deal w/ hosting other people's shit and I don't want to deal with that can of worms. all processing is done locally on your browser. Yes, this means that using a 4k image as a base layer for your gif pile will make it take an age. It'll work eventually though.

This will never have a watermark unless I'm bought out (total investment thus far has been 14 bucks, 4 of which was that one dude fucking me), in which case I probably earned it. at most I'll likely throw adsense on there at some point to scrape a few cents from the people who can't figure out adblock if it gets popular enough for me to warrant it.

There's no timelines or anything like that. literally just a pile of gifs. thus far my primary usecase has been overlaying text gifs from the various fancy text generator sites onto glitter backgrounds with uncomfortable rat GIFs to call people poor on the internet. this makes me happy.

There's likely to be obvious UI, UX, or other U-whatever fuckups. If you point them out and I deem it pedantic I'll probably laugh at you. if it's helpful I'll probably implement it when I get a bit.

Surprisingly, works on mobile. CSS is exceedingly generic and souless atm, just went off vauge memories of ss13's TGUI. I'll likely scrap the CSS entirely and go full neocities at some point because that's more soulful.

1

DualDoc – A text editor for the AI age #

dualdoc.xyz favicondualdoc.xyz
0 комментариев4:34 PMПосмотреть на HN
I'm a writer and I often ask AI to draft things for me. Rather than trusting the results outright, I've been trying to edit side-by-side using various tools. Instead of retrofitting something, I made a web app that lets me edit my text while keeping a draft, notes, or information in another pane.
1

CircadianLab – Browser-based radiosity lighting simulator (WebGPU) #

innerscene.com faviconinnerscene.com
0 комментариев4:35 PMПосмотреть на HN
We built CircadianLab, a free in-browser tool for lighting design. It uses webGPU shaders to calculate melanopic equivalent daylight illuminance (mel-EDI, formerly EML, relevant for circadian rhythm and WELL v2 compliance), photopic illuminance, UGR glare per CIE 117, and actual daylight through windows and skylights.

It has a database of 400,000+ IES files from 81 lighting manufacturers and supports easy sharing of scenes/configs with other people.

Daylight uses the NREL Solar Position Algorithm and a Perez clear / intermediate / overcast sky model, mixed into the same radiosity solve as electric fixtures. Sun and sky are added as initial flux on every patch, then bounced through the same form-factor network. Form factors via Monte Carlo with visibility tests, then a Gauss-Seidel 3-bounce iterative solve.

  My company (Innerscene) makes "daylight" luminaires. We built CircadianLab to address a specific gap (no browser-based way to verify WELL v2 mel-EDI compliance before specifying), but it works with any IES file from any manufacturer, not just our line.  For mel-EDI you need SPDs to accurately calculate but CCT is a good proxy.

  Write-up + demo videos: https://www.innerscene.com/blog/introducing-circadian-lab
Of interest to HN crowd, there are measurable productivity/performance enhancements that can be achieved with higher mel-EDIs which is why the design community is now incorporating this into building design. If you haven't dug into any of the research before checkout: https://www.innerscene.com/research?topic=Workplace+Performa...

Here is an example scene with a classroom and sunlight coming through a window, showing foot candles as a heatmap:

3d view: https://www.innerscene.com/tools/circadian-lab?share=b104262... 2d view: https://www.innerscene.com/tools/circadian-lab?share=456ad19...

Happy to answer questions on anything related.