heck i might sooner take natgeo claude 4.6 over mactech gpt-1 LOL. The reasoning/logic skills would be so much better on a modern model versus an early GPT model, even having to re-train it would be trivial. When I did my IE5 retro prompt project, probably sonnet 4.5 at the time, it did not know anything about IE5 css box hacks, etc, but I literally wrote docs to train it, and i had it doing that stuff in short order. I think you're giving almost too much credit to the dataset. Ironically some of my other hobbies deal with literal physical logic, electro-mechanical/relay logic, or even modular synths where i've got a logic module with XOR/OR/AND type functions. How well can it piece together logical building blocks?And here I disagree. Suppose you had a Mac programming question and a choice of LLMs: GPT-1 trained against Inside Macintosh and MacTech magazine, or the latest Claude trained against National Geographic? The primary value of an LLM is not in its algorithms but in its dataset. Hell, I'd be delighted if I could use Google Search from 20 years ago with a complete index, including comp.sys.mac.* Usenet posts.
Flipside, you could be giving the training data too much credit, especially when you go into unknown territories or really obscure stuff, you'll be creating a lot more guardrails for it, but it's not just copying off a cheat sheet. For example, now that I'm working on FTP, I even said, is there a BSD/MIT licensed FTP client we could just use for this? And it basically said "they're all filled with posix calls, FTP is only like ~15-16 commands, it'd be easier just to write it ourselves native in opentransport." Most of this stuff is just known equations and applying a given platform's limitations and programming syntax around them. It doesn't always have to be copying someone else's homework. I don't think it's that simple.Sometimes that effort happens and sometimes it doesn't. And sometimes "their own" has to do some heavy lifting...
I'd like to think i've mostly stuck with that same philosophy. Going for MVP stuff, as well as easy stuff/low hanging fruit. I've definitely added a bunch since then, mostly because it was easier than I realized.Just going to leave this exchange from page 2 of this topic here without comment:
This, 100%. It's more like, understanding the logical building blocks. That's what the models have truly gotten better at. It's not just that the training data has gotten better, that's giving the training too much credit on it's own. Ultimately you add a lot of those guardrails yourself through the agent markdown files.I think it works on a lower level than that, it copies the shape / Syntax of code, rather than lines from a program.
You can ask it to do something never done before and it doesn't struggle.
For example, non-programming, but if I ask for a poem about a purple mouseball dancing in custard, there are unlikely to be sources, but it has seen poems, and how people contextually talk about custard and mouseballs, and makes something that fits the shape.
So, no, while it works from a dataset, it doesn't look for something that does the same thing and reuse it. It's a bit weirdly abstract because the learning is like insanely complex statistical analysis. Given the user wants this, what is a statistically probable shape of words that would exist as a solution. Intentionally past tense, because it sort of fits.
It's still easy to over-trust what it says when solutioning, it doesn't always lay out all the options on the table for you, and that's where some actual human insight is still a useful thing. It's also not going to spit out perfectly working code each time, either.The complexity of the results reflects the, to me, incomprehensible complexity of the model itself. You have to remember that as these things were been developed, there came a point where they switched from giving chatbot like automated answers based on specific content of their dataset, to surprising the designers by actually producing what looks like new insights. It isn't entirely new, (same could be said of human though, we all learnt from somewhere) because it is based on the dataset, but it isn't based on the dataset in a really easily comprehendible way.
PS - not specifically an advocate, just try to understand things around me, and learn enough to stop my colleagues using it badly, like the guy that asked it to explain how it did something... The result of which will be that it worked out how it should answer the question "how did you do that" and not actually tell him how it did it. They're weird things.
PPS If you're trying to do what my colleague was trying to do, I suspect you'd have better luck asking it to work out a step by step process to get the solution (using the newfangled "Reasoning" modes), and then ask it to follow that process, detailing the process at each step. Rather than retroactively ask it to justify an answer that was reached without the implied structure your request assumes. The problem is it would give a reasonable answer to both ways.
Regarding step by step - i think that's where planning modes come in - or even skills like superpowers, where it will enter a brainstorming mode first, come up with a plan document, and then go back/forth with you on reviewing it first before implementing. That's definitely what I've been doing lately on these bigger asks. However, even despite doing all that, with that big SCP plan I sent above, I still spent an hour or two last night debugging an issue where the transfers kept timing out. It turned out the ssh library didn't calculate the file sizes correct for macos, and it was not ending the transfer correctly due to that mismatch. I actually submitted a PR back to cy384's upstream library for this one, though after all the drama I haven't any expectations on whether that ever makes it in.
Exactly, it's able to understand the logical reasoning, it doesn't need to copy-pasta off existing code. And that's all said without getting into cache re-usage and context windows and all that stuff, which allows further optimizations within your dataset.Yeah, and moreover, these models are generalists. Models trained on high-quality code improve at completely unrelated reasoning tasks (See here: https://arxiv.org/abs/2210.07128). They aren't operating like lookup databases - it is a statistical model that produces probability distribution for the next token based on the previous information in the sequence, which can most certainly be a completely unseen sequence. It is very likely that the model has existing code in its training set it can produce verbatim (or roughly), but it is highly unlikely it would actually do so at inference unless explicitly prompted. The attention mechanism is the key point that allows for the meaning behind certain sequences to be abstracted from their original context.
