One thing that I find truly amazing is just the simple fact that you can now be fuzzy with the input you give a computer, and get something meaningful in return. Like, as someone who grew up learning to code in the 90s it always seemed like science fiction that we'd get to a point where you could give a computer some vague human level instructions and get it more or less do what you want.
> On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
This has been an obviously absurd question for two centuries now. Turns out the people asking that question were just visionaries ahead of their time.
It is kind of impressive how I'll ask for some code in the dumbest, vaguest, sometimes even wrong way, but so long as I have the proper context built up, I can get something pretty close to what I actually wanted. Though I still have problems where I can ask as precisely as possible and get things not even close to what I'm looking for.
> This has been an obviously absurd question for two centuries now. Turns out the people asking that question were just visionaries ahead of their time.
This is not the point of that Babbage quote, and no, LLMs have not solved it, because it cannot be solved, because "garbage in, garbage out" is a fundamental observation of the limits of logic itself, having more to with the laws of thermodynamics than it does with programming. The output of a logical process cannot be more accurate than the inputs to that process; you cannot conjure information out of the ether. The LLM isn't the logical process in this analogy, it's one of the inputs.
At a fundamental level, yes, and even in human-to-human interaction this kind of thing happens all the time. The difference is that humans are generally quite good at resolving most ambiguities and contradictions in a request correctly and implicitly (sometimes surprisingly bad at doing so explicitly!). Which is why human language tends to be more flexible and expressive than programming languages (but bad at precision). LLMs basically can do some of the same thing, so you don't need to specify all the 'obvious' implicit details.
The Babbage anecdote isn't about ambiguous inputs, it's about wrong inputs. Imagine wanting to know the answer to 2+2, so you go up to the machine and ask "What is 3+3?", expecting that it will tell you what 2+2 is.
Adding an LLM as input to this process (along with an implicit acknowledgement that you're uncertain about your inputs) might produce a response "Are you sure you didn't mean to ask what 2+2 is?", but that's because the LLM is a big ball of likelihoods and it's more common to ask for 2+2 than for 3+3. But it's not magic; the LLM cannot operate on information that it was not given, rather it's that a lot of the information that it has was given to it during training. It's no more a breakthrough of fundamental logic than Google showing you results for "air fryer" when you type in "air frier".
I think the point they’re making is that computers have traditionally operated with an extremely low tolerance for errors in the input, where even minor ambiguities that are trivially resolved by humans by inferring from context can cause vastly wrong results.
We’ve added context, and that feels a bit like magic coming from the old ways. But the point isn’t that there is suddenly something magical, but rather that the capacity for deciphering complicated context clues is suddenly there.
> computers have traditionally operated with an extremely low tolerance for errors in the input
That's because someone have gone out of their way to mark those inputs as errors because they make no sense. The CPU itself has no qualms doing 'A' + 10 because what it's actually sees is a request is 01000001 (65) as 00001010 (10) as the input for its 8 bit adder circuit. Which will output 01001011 (75) which will be displayed as 75 or 'k' or whatever depending on the code afterwards. But generally, the operation is nonsense, so someone will mark it as an error somewhere.
So errors are a way to let you know that what you're asking is nonsense according to the rules of the software. Like removing a file you do not own. Or accessing a web page that does not exists. But as you've said, we can now rely on more accurate heuristics to propose alternatives solution. But the issue is when the machine goes off and actually compute the wrong information.
Handing an LLM a file and asking it to extract data out of it with no further context or explanation of what I'm looking for with good results does feel a bit like the future. I still do add context just to get more consistent results, but it's neat that LLMs handle fuzzy queries as well as they do.
We wanted to check the clock at the wrong time but read the correct time. Since a broken clock is right twice a day, we broke the clock, which solves our problem some of the time!
A clock that's 5 seconds, 5 minutes, or 5 hours ahead, or counts an hour as 61 minutes, is still more useful than a clock that does not move it's hands at all.
It's very impressive that I can type misheard song lyrics into Google, and yet still have the right song pop up.
But, having taken a chance to look at the raw queries people type into apps, I'm afraid neither machine nor human is going to make sense of a lot of it.
Well, you can enter 4-5 relatively vague keywords into google and first/second stackoverflow link will probably provide plenty of relevant code. Given that, its much less impressive since >95% of the problems and queries just keep repeating.
This is a really hard problem when I write every line and have the whole call graph in my head. I have no clue how you think this gets easier by knowing less about the code
No one is saying you shouldn't write tests. But we are saying TDD is dumb.
Actually, for exactly the reasons you mention: I'm not dumb enough to believe I'm a genius. I'll always miss something. So I can't rely on my tests to ensure correctness. It takes deeper thought and careful design.
By using the program? Mind you this works only for _personal_ tools where it’s intuitively obvious when something is wrong.
For example
”Please create a viewer for geojson where i can select individual feature polygons and then have button ’export’ that exports the selected features to a new geojson”
1. You run it
2. It shows the json and visualizes selections
3. The exported subset looks good
I have no idea how anyone could keep the callgraph of even a minimal gui application in their head. If you can then congratulations, not all of us can!
Great, I used my program and everything seems to be working as expected.
Not great, somebody else used my program and they got root on my server...
> I have no idea how anyone could keep the callgraph of even a minimal gui application in their head
Practice.
Lots and lots of practice.
Write it down. Do things the hard way. Build the diagrams by hand and make sure you know what's going on. Trace programs. Pull out the debugger! Pull out the profiler!
If you do those things, you too will gain that skill. Obviously you can't do this for a giant program but it is all about the resolution of your call graph anyways.
If you are junior, this is the most important time to put in that work. You will get far more from it than you lose. If you're further along, well the second best time to plant a tree is today.
”not great, somebody else used my program and they got root on my server...”
In general security sensitive software is the worst place possible to use LLM:s based on public case studies and anecdata exactly for this reason.
”Do it the hard way”
Yes that’s generally the way I do it as well when I need to reliably understand something but it takes hours.
The cadence with LLM driven experiments is usually under an hour. That’s the biggest boom for me - I get a new tool and can focus on the actual work I’m delivering, with some step now taking slightly less time.
For example I’m happy using vim without ever having read the code or debugged it, much less having observed it’s callgraph. I’m similarly content in using LLM generated utilities without much oversight. I would never push code like that to production of course.
how do you know what you want if you didn't write a test for it?
I'm afraid what you want is often totally unclear until you start to use a program and realize that what you want is either what the program is doing, or it isn't and you change the program.
MANY programs are made this way, I would argue all of them actually. Some of the behaviour of the program wasn't imagined by the person making it, yet it is inside the code... it is discovered, as bugs, as hidden features, etc.
Why are programmers so obsessed that not knowing every part of the way a program runs means we can't use the program? I would argue you already don't, or you are writing programs that are so fundamentally trivial as to be useless anyway.
LLM written code is just a new abstraction layer, like Python, C, Assembly and Machine Code before it... the prompts are now the code. Get over it.
> how do you know what you want if you didn't write a test for it?
You have that backwards.
How do you know what to test if you don't know what you want?
I agree with you though, you don't always know what you want when you set out. You can't just factorize your larger goal into unit tests. That's my entire point.
You factorize by exploration. By play. By "fuck around and find out". You have to discover the factorization.
And that, is a very different paradigm than TDD. Both will end with tests, and frankly, the non TDD paradigm will likely end up with more tests with better coverage.
> Why are programmers so obsessed that not knowing every part of the way a program runs means we can't use the program?
I think you misunderstand. I want to compare it to something else. There's a common saying "don't let perfection be the enemy of good (enough)". I think it captures what you're getting at, or is close enough.
The problem with that saying is that most people don't believe in perfection[0]. The problem is, perfection doesn't exist. So the saying ends up being a lazy thought terminator instead of addressing the real problem: determining what is good enough.
In fact, no one knows every part of even a trivial program. We can always introduce more depth and complexity until we reach the limits of our physics models and so no one knows. Therefore, you'll have to reason it is not about perfection.
I think you are forgetting why we program in the first place. Why we don't just use natural language. It's the same reason we use math in science. Not because math is the language of the universe but rather that math provides enough specificity to be very useful in describing the universe.
This isn't about abstraction. This is about specification.
It's the same problem with where you started. The customer can't tell my boss their exact requirements and my boss can't perfectly communicate to me. Someone somewhere needs to know a fair amount of details and that someone needs to be very trustworthy.
I'll get over it when the alignment problem is solved to a satisfactory degree. Perfection isn't needed, we will have you discuss what is good enough and what is not
[0] likely juniors. And it should be beat out of them. Kindly
You don't hear the complaints. That's different than no complaints. Trust me, they got them.
I got plenty of complaints for Apple, Google, Netflix, and everyone else. Shit that can be fixed with just a fucking regex. Here's an example: my gf is duplicated in my Apple contacts. It can't find the duplicate, despite same name, nickname, phone number, email, and birthday. Which there's three entries on my calendar for her birthday. Guess what happened when I manually merged? She now has 4(!!!!!) entries!! How the fuck does that increase!
Sure, you can now be fuzzy with the input you give to computers, but in return the computer will ALSO be fuzzy with the answer it gives back. That's the drawback of modern AI.
Today I had a dentist appointment and the dentist suggested I switch toothpaste lines to see if something else works for my sensitivity better.
I am predisposed to canker sores and if I use a toothpaste with SLS in it I'll get them. But a lot of the SLS free toothpastes are new age hippy stuff and is also fluoride free.
I went to chatgpt and asked it to suggest a toothpaste that was both SLS free and had fluoride. Pretty simple ask right?
It came back with two suggestions. It's top suggestion had SLS, it's backup suggestion lacked fluoride.
Yes, it is mind blowing the world we live in. Executives want to turn our code bases over to these tools
What model and query did you use? I used the prompt "find me a toothpaste that is both SLS free and has fluoride" and both GPT-4o [0] and o4-mini-high [1] gave me correct first answers. The 4o answer used the newish "show products inline" feature which made it easier to jump to each product and check it out (I am putting aside my fear this feature will end up kill their web product with monetization).
The problem is the same prompt will yield good results one time and bad results another. The "get better at prompting" is often just an excuse for AI hallucination. Better prompting can help but often it's totally fine, the tech is just not there yet.
While this is true, I have seen this happen enough times to confidently bet all my money that OP will not return and post a link to their incorrect ChatGPT response.
Seemingly basic asks that LLMs consistently get wrong have lots of value to people because they serve as good knowledge/functionality tests.
I don't have to post my chat, someone else already posted a chat claiming ChatGPT gave them correct answers when the answers ChatGPT gave them were all kinds of wrong.
As far as I can tell these are all real products and all meet the requirement of having fluoride and being SLS free.
Since you did return however and that was half my bet, I suppose you are still entitled to half my life savings. But the amount is small so maybe the knowledge of these new toothpastes is more valuable to you anyway.
If you want a correct answer the first time around, and give up if you don't get it, even if you know the thing can give it to you with a bit more effort (but still less effort than searching yourself), don't you think that's a user problem?
I briefly got excited about the possibility of local LLMs as an offline knowledge base. Then I tried asking Gemma for a list of the tallest buildings in the world and it just made up a bunch. It even provided detailed information about the designers, year of construction etc.
I still hope it will get better. But I wonder if an LLM is the right tool for factual lookup - even if it is right, how do I know?
I wonder how quickly this will fall apart as LLM content proliferates. If it’s bad now, how bad will it be in a few years when there’s loads of false but credible LLM generated blogspam in the training data?
> I wonder how quickly this will fall apart as LLM content proliferates. If it’s bad now, how bad will it be in a few years when there’s loads of false but credible LLM generated blogspam in the training data?
There is already misinformation online so only the marginal misinformation is relevant. In other words do LLMs generate misinformation at a higher rate than their training set?
For raw information retrieval from the training set misinformation may be a concern but LLMs aren’t search engines.
Emergent properties don’t rely on facts. They emerge from the relationship between tokens. So even if an LLM is trained only on misinformation abilities may still emerge at which point problem solving on factual information is still possible.
The person that started this conversation verified the answers were incorrect. So it sounds like you just do that. Check the results. If they turn out to be false, tell the LLM or make sure you're not on a bad one. It still likely to be faster than searching yourself.
That's all well and good for this particular example. But in general, the verification can often be so much work it nullifies the advantage of the LLM in the first place.
Something I've been using perplexity for recently is summarizing the research literature on some fairly specific topic(e.g. the state of research on the use of polypharmacy in treatment of adult ADHD). Ideally it should look up a bunch of papers, look at them and provide a summary of the current consensus on the topic. At first, I thought it did this quite well. But I eventually noticed that in some cases it would miss key papers and therefore provide inaccurate conclusions. The only way for me to tell whether the output is legit is to do exactly what the LLM was supposed to do; search for a bunch of papers, read them and conclude on what the aggregate is telling me. And it's almost never obvious from the output whether the LLM did this properly or not.
The only way in which this is useful, then, is to find a random, non-exhaustive set of papers for me to look at(since the LLM also can't be trusted to accurately summarize them). Well, I can already do that with a simple search in one of the many databases for this purpose, such as pubmed, arxiv etc. Any capability beyond that is merely an illusion. It's close, but no cigar. And in this case close doesn't really help reduce the amount of work.
This is why a lot of the things people want to use LLMs for requires a "definiteness" that's completely at odds with the architecture. The fact that LLMs are food at pretending to do it well only serves to distract us from addressing the fundamental architectural issues that need to be solved. I think think any amount of training of a transformer architecture is gonna do it. We're several years into trying that and the problem hasn't gone away.
> The only way for me to tell whether the output is legit is to do exactly what the LLM was supposed to do; search for a bunch of papers, read them and conclude on what the aggregate is telling me. And it's almost never obvious from the output whether the LLM did this properly or not.
You're describing a fundamental and inescapable problem that applies to literally all delegated work.
Sure, if you wanna be reductive, absolutist and cynical about it. What you're conveniently leaving out though is that there are varying degrees of trust you can place in the result depending on who did it. And in many cases with people, the odds they screwed it up are so low they're not worth considering. I'm arguing LLMs are fundamentally and architecturally incapable of reaching that level of trust, which was probably obvious to anyone interpreting my comment in good faith.
I think what you're leaving is that what you're applying to people also applies to LLMs. There are many people you can trust to do certain things but can't trust to do others. Learning those ropes requires working with those people repeatedly, across a variety of domains. And you can save yourself some time by generalizing people into groups, and picking the highest-level group you can in any situation, e.g. "I can typically trust MIT grads on X", "I can typically trust most Americans on Y", "I can typically trust all humans on Z."
The same is true of LLMs, but you just haven't had a lifetime of repeatedly working with LLMs to be able to internalize what you can and can't trust them with.
Personally, I've learned more than enough about LLMs and their limitations that I wouldn't try to use them to do something like make an exhaustive list of papers on a subject, or a list of all toothpastes without a specific ingredient, etc. At least not in their raw state.
The first thought that comes to mind is that a custom LLM-based research agent equipped with tools for both web search and web crawl would be good for this, or (at minimum) one of the generic Deep Research agents that's been built. Of course the average person isn't going to think this way, but I've built multiple deep research agents myself, and have a much higher understanding of the LLMs' strengths and limitations than the average person.
So I disagree with your opening statement: "That's all well and good for this particular example. But in general, the verification can often be so much work it nullifies the advantage of the LLM in the first place."
I don't think this is a "general problem" of LLMs, at least not for anyone who has a solid understanding of what they're good at. Rather, it's a problem that comes down to understanding the tools well, which is no different than understanding the people we work with well.
P.S. If you want to make a bunch of snide assumptions and insults about my character and me not operating in good faith, be my guest. But in return I ask you to consider whether or not doing so adds anything productive to an otherwise interesting conversation.
Yup, and worse since the LLM gives such a confident sounding answer, most people will just skim over the ‘hmm, but maybe it’s just lying’ verification check and move forward oblivious to the BS.
People did this before LLMs anyway. Humans are selfish, apathetic creatures and unless something pertains to someone's subject of interest the human response is "huh, neat. I didn't know dogs could cook pancakes like that" then scroll to the next tiktok.
This is also how people vote, apathetically and tribally. It's no wonder the world has so many fucking problems, we're all monkeys in suits.
Sure, but there's degrees in the real world. Do people sometimes spew bullshit (hallucinate) at you? Absolutely. But LLMs, that's all they do. They make bullshit and spew it. That's their default state. They're occasionally useful despite this behavior, but it doesn't mean that they're not still bullshitting you
It depends on whether the cost of search or of verification dominates.
When searching for common consumer products, yeah, this isn't likely to help much, and in a sense the scales are tipped against the AI for this application.
But if search is hard and verification is easy, even a faulty faster search is great.
I've run into a lot of instances with Linux where some minor, low level thing has broken and all of the stackexchange suggestions you can find in two hours don't work and you don't have seven hours to learn about the Linux kernel and its various services and their various conventions in order to get your screen resolutions correct, so you just give up.
Being in a debug loop in the most naive way with Claude, where it just tells you what to try and you report the feedback and direct it when it tunnel visions on irrelevant things, has solved many such instances of this hopelessness for me in the last few years.
So instead of spending seven hours to get at least an understanding how the Linux kernel work and the interaction of various user-land programs, you've decided to spend years fumbling in the dark and trying stuff every time an issue arises?
I would like to understand how you ideally imagine a person solving issues of this type. I'm for understanding things instead of hacking at them in general, and this tendency increases the more central the things to understand are to the things you like to do. However, it's a point of common agreement that just in the domain of computer-related tech, there is far more to learn than a person can possibly know in a lifetime, and so we all have to make choices about which ones we want to dive into.
I do not expect to go through the process I just described for more than a few hours a year, so I don't think the net loss to my time is huge. I think that the most relevant counterfactual scenario is that I don't learn anything about how these things work at all, and I cope with my problem being unfixed. I don't think this is unusual behavior, to the degree that it's I think a common point of humor among Linux users: https://xkcd.com/963/https://xkcd.com/456/
This is not to mention issues that are structurally similar (in the sense that search is expensive but verification is cheap, and the issue is generally esoteric so there are reduced returns to learning) but don't necessarily have anything to do with the Linux kernel: https://github.com/electron/electron/issues/42611
I wonder if you're arguing against a strawman that thinks that it's not necessary to learn anything about the basic design/concepts of operating systems at all. I think knowledge of it is fractally deep and you could run into esoterica you don't care about at any level, and as others in the thread have noted, at the very least when you are in the weeds with a problem the LLM can often (not always) be better documentation than the documentation. (Also, I actually think that some engineers do on a practical level need to know extremely little about these things and more power to them, the abstraction is working for them.)
Holding what you learn constant, it's nice to have control about in what order things force you to learn them. Yak-shaving is a phenomenon common enough that we have a term for it, and I don't know that it's virtuous to know how to shave a yak in-depth (or to the extent that it is, some days you are just trying to do something else).
More often than not, the actual implementation is more complex than the theory that outlines it (think Turing Machine and today's computer). Mostly because the implementation is often the intersection of several theories spanning multiple domain. Going at a problem at a whole is trying to solve multiple equations with a lot of variables and it's an impossible task for most. Learning about all the domains is also a daunting tasks (and probably fruitless as you've explained it).
But knowing the involved domain and some basic knowledge is easy to do and more than enough to quickly know where to do a deep dive. Instead of relying on LLMs that are just giving plausible mashup on what was on their training data (which is not always truthful).
I still remember when Altavista.digital and excite.com where brand new. They were revolutionary and very useful,even if they couldn't find results for all the prompts we made.
I am unconvinced that searching for this yourself is actually more effort than repeatedly asking the Mighty Oracle of Wrongness and cross-checking its utterances.
You say it's successful, but in your second prompt is all kinds of wrong.
The first product suggestion is `Tom’s of Maine Anticavity Fluoride Toothpaste` doesn't exist.
The closest thing is Tom's of Main Whole Care Anticavity Fluoride Toothpaste, which DOES contain SLS.
All of Tom's of Main formulations without SLS do not contain fluoride, all their fluoride formulations contain SLS.
The next product it suggests is "Hello Fluoride Toothpaste" again, not a real product. There is a company called "Hello" that makes toothpastes, but they don't have a product called "Hello fluoride Toothpaste" nor do the "e.g." items exist.
The third product is real and what I actually use today.
The fourth product is real, but it doesn't contain fluoride.
So, rife with made up products, and close matches don't fit the bill for the requirements.
This is the thing that gets me about LLM usage. They can be amazing revolutionary tech and yes they can also be nearly impossible to use right. The claim that they are going to replace this or that is hampered by the fact that there is very real skill required (at best) or just won't work most the time (at worst). Yes there are examples of amazing things, but the majority of things from the majority of users seems to be junk and the messaging designed around FUD and FOMO
Just like some people who wrote long sentences into Google in 2000 and complained it was a fad.
Meanwhile the rest of the world learned how to use it.
We have a choice. Ignore the tool or learn to use it.
(There was lots of dumb hype then, too; the sort of hype that skeptics latched on to to carry the burden of their argument that the whole thing was a fad.)
Arguably, the people who typed long sentences into Google have won; the people who learned how to use it early on with specific keywords now get meaningless results.
Nah, both keywords and long sentences get meaningless results from Google these days (including their falsely authoritative Bard claims).
I view Bard as a lot like the yesman lacky that tries to pipe in to every question early, either cheating off other's work or even more frequently failing to accurately cheat off of other's work, largely in hopes that you'll be in too much of a hurry to mistake it's voice for that of another (eg, mistake the AI breakdown for a first hit result snippet) and faceplant as a result of their faulty intel.
Gemini gets me relatively decent answers .. only after 60 seconds of CoT. Bard answers in milliseconds and its lack of effort really shows through.
> Meanwhile the rest of the world learned how to use it.
Very few people "learned how to use" Google, and in fact - many still use it rather ineffectively. This is not the same paradigm shift.
"Learning" ChatGPT is not a technology most will learn how to use effectively. Just like Google they will ask it to find them an answer. But the world of LLMs is far broader with more implications. I don't find the comparison of search and LLM at an equal weight in terms of consequences.
The TL;DR of this is ultimately: understanding how to use an LLM, at it's most basic level, will not put you in the drivers seat in exactly the same way that knowing about Google also didn't really change anything for anyone (unless you were an ad executive years later). And in a world of Google or no-Google, hindsight would leave me asking for a no-Google world. What will we say about LLMs?
And just like google, the chatgpt system you are interfacing with today will have made silent changes to its behavior tomorrow and the same strategy will no longer be optimal.
People treat this as some kind of all or nothing. I _do_ us LLM/AI all the time for development, but the agentic "fire and forget" model doesn't help much.
I will circle back every so often. It's not a horrible experience for greenfield work. A sort of "Start a boilerplate project that does X, but stop short of implementing A B or C". It's an assistant, then I take the work from there to make sure I know what's being built. Fine!
A combo of using web ui / cli for asking layout and doc questions + in-ide tab-complete is still better for me. The fabled 10x dev-as-ai-manager just doesn't work well yet. The responses to this complaint are usually to label one a heretic or Luddite and do the modern day workplace equivalent of "git gud", which helps absolutely nobody, and ignores that I am already quite competent at using AI for my own needs.
Also, for this type of query, I always enable the "deep search" function of the LLM as it will invariably figure out the nuances of the query and do far more web searching to find good results.
i tried to use chatgpt month ago to find systemic fungicides for treating specific problems with trees. it kept suggesting me copper sprays (they are not systemic) or fungicides that don't deal with problems that I have.
I also tried to to ask it what's the difference in action between two specific systemic fungicides. it generated some irrelevant nonsense.
"Oh, you must not have used the LATEST/PAID version." or "added magic words like be sure to give me a correct answer." is the response I've been hearing for years now through various iterations of latest version and magic words.
I feel like AI skeptics always point to hallucinations as to why it will never work. Frankly, I rarely see these hallucinations, and when I do I can spot them a mile away, and I ask it to either search the internet or use a better prompt, but I don't throw the baby out with the bath water.
I see them in almost every question I ask, very often made up function names, missing operators or missed closure bindings. Then again it might be Elixir and lack of training data, I also have a decent bullshit detector for insane code generation output, it’s amazing how much better code you get almost every time by just following up with ”can you make this more simple and using common conventions”.
For reference I just typed "sls free toothpaste with fluoride" into a search engine and all the top results are good. They are SLS-free and do contain fluoride.
Don’t bet on it. I’ve had to provide feedback on multiple proposals to use LLMs for generating ad-hoc financial reports in a fortune 50. The feedback was basically ‘this is guaranteed to make everyone cry, because this will produce bad numbers’ - and people seem to just not understand why.
That is not true. I know of many private equity companies that are using LLMs for a base level analysis, and a separate validation layer to catch hallucinations.
LLM tech is not replacing accountants, just as it is not replacing radiologists or software developers yet. But it is in every department.
The accounting department does a large number of things, only some of which involves precise bookkeeping. There is data extraction from documents, DIY searching (vibe search?), checking data integrity of submitted forms, deviations from norms etc.
This is where o3 shines for me. Since it does iterations of thinking/searching/analyzing and is instructed to provide citations, it really limits the hallucination effect.
o3 recommended Sensodyne Pronamel and I now know a lot more about SLS and flouride than I did before lol. From its findings:
"Unlike other toothpastes, Pronamel does not contain sodium lauryl sulfate (SLS), which is a common foaming agent. Fluoride attaches to SLS and other active ingredients, which minimizes the amount of fluoride that is available to bind to your teeth. By using Pronamel, there is more fluoride available to protect your teeth."
That is impressive, but it also looks likely to be misinformation. SLS isn't a chelator (as the quote appears to suggest). The concern is apparently that it might compete with NaF for sites to interact with the enamel. However, there is minimal research on the topic and what does exist (at least what I was quickly able to find via pubmed) appears preliminary at best. It also implicates all surfactants, not just SLS.
This diversion highlights one of the primary dangers of LLMs which is that it takes a lot longer to investigate potential bullshit than it does to spew it (particularly if the entity spewing it is a computer).
That said, I did learn something. Apparently it might be a good idea to prerinse with a calcium lactate solution prior to a NaF solution, and to verify that the NaF mouthwash is free of surfactants. But again, both of those points are preliminary research grade at best.
If you take anything away from this, I hope it's that you shouldn't trust any LLM output on technical topics that you haven't taken the time to manually verify in full.
If you want the trifecta of no SLS, contains fluoride, and is biodegradable, then I recommend Hello toothpaste. Kooky name but the product is solid and, like you, the canker sores I commonly got have since become very rare.
Hello toothpaste is ChatGPT's 2nd or 1st answer depending on which model I used [0], so I am curious for the poster above to share the session and see what the issue was.
There is known sensitivity (no pun intended ;) to wording of the prompt. I have also found if I am very quick and flippant it will totally miss my point and go off in the wrong direction entirely.
consider a multivitamin (or least eating big varied salads regularly) - that seemed to get rid of my recurrent canker sores despite whatever toothpaste I use
fwiw, I use my kids toothpaste (kids crest) since I suspect most toothpastes are created equal and one less thing to worry about...
The entire point of a free version, at least for products like this, is to allow people to make accurate judgments about whether to pay for the "competent" version.
Well, in that case, the LLM company has made a mistake in marketing their product, but that's not the same as the question of whether the product works.
Definitely. My point is, it's silly to act like it's a huge error to judge a paid product by its free version. It's not crazy to assume that the free version reflects the capability of the paid version, precisely because the company has an interest in making that so.
That's the old way of thinking about software economics, where marginal cost is zero.
Marginal cost of LLMs is not zero.
I come from manufacturing and find this kind of attitude bizarre among some software professionals. In manufacturing we care about our tools and invest in quality. If the new guy bought a micrometer from Harbor Freight, found it wasn't accurate enough for sub-.001" work, ignored everyone who told him to use Mitutoyo, and then declared that micrometers "don't work," he would not continue to have employment.
The closer analogy there is if someone used ChatGPT despite everyone telling them to use Claude, and declared that LLMs suck. This is closer to the mistake people actually make.
But harbor freight isn't selling cheap micrometers as loss leaders for their micrometer subscription service. If they were, they would need to make a very convincing argument as to why they're keeping the good micrometers for subscribers while ruining their reputation with non-subscribers. Wouldn't you say?
“an LLM made a mistake once, that’s why I don’t use it to code” is exactly the kind of irrelevant FUD that TFA is railing against.
Anyone not learning to use these tools well (and cope with and work around their limitations) is going to be left in the dust in months, perhaps weeks. It’s insane how much utility they have.
I present a simple problem with well defined parameters that LLMs can use to search product ingredient lists (that are standardized). This is the type of problems LLMs are supposed to be good at and it failed in every possible way.
If you hired master woodworker and he didn't know what wood was, you'd hardly trust him with hard things, much less simple ones
You haven’t shared the chat where you claim the model gave you incorrect answers, whilst others have stated that your query returned correct results. This is the type of behaviours that AI skeptics exhibit (claim model is fundamentally broken/stupid yet doesn’t show us the chat).
They won't. The speed at which these models evolve is a double-edged sword: they give you value quickly... but any experience you gain dealing with them also becomes obsolete quickly. One year of experience using agents won't be more valuable than one week of experience using them. No one's going to be left in the dust because no one is more than a few weeks away from catching up.
Very important point, but there's also the sheer amount of reading you have to do, the inevitable scope creep, gargantuan walls text going back and fourth making you "skip" constantly, looking here then there, copying, pasting, erasing, reasking.
Literally the opposite of focus, flow, seeing the big picture.
At least for me to some degree. There's value there as i'm already using these tools everyday but it also seems like a tradeoff i'm not really sure how valuable is yet. Especially with competition upping the noise too.
I feel SO unfocused with these tools and i hate it, it's stressful and feels less "grounded", "tactile" and enjoyable.
I've found myself in a new weird workflowloop a few times with these tools mindlessly iterating on some stupid error the LLM keeps not fixing, while my mind simply refuses to just fix it myself way faster with a little more effort and that's a honestly a bit frightening.
I relate to this a bit, and on a meta level I think the only way out is through. I'm trying to embrace optimizing the big picture process for my enjoyment and for positive and long-term effective mental states, which does include thinking about when not to use the thing and being thoughtful about exactly when to lean on it.
Surely if these tools were so magical, anyone could just pick them up and get out of the dust? If anything, they're probably better off cause they haven't wasted all the time, effort and money in the earlier, useless days and instead used it in the hypothetical future magic days.
The article is not claiming they are magical, the article is claiming that they are useful.
> > but it’ll never be AGI
> I don’t give a shit.
> Smart practitioners get wound up by the AI/VC hype cycle. I can’t blame them. But it’s not an argument. Things either work or they don’t, no matter what Jensen Huang has to say about it.
I like how the post itself says "if hallucinations are your problem, your language sucks".
Yes sir, I know language sucks, there isnt anything I can do about that. There was nothing I could do at one point to convince claude that you should not use floating point math in kernel c code.
Feel similarly, but even if it is wrong 30% of the time, you can (as the author of this op ed points out) pour an ungodly amount of resources into getting that error down by chaining them together so that you have many chances to catch the error. And as long as that only destroys the environment and doesn’t cost more than a junior dev, then they’re going to trust their codebases with it yes, it’s the competitive thing to do, and we all know competition produces the best outcome for everyone… right?
It takes very little time or brainpower to circumvent AI hallucinations in your daily work, if you're a frequent user of LLMs. This is especially true of coding using an app like Cursor, where you can @-tag files and even URLs to manage context.
Feels like you're comparing how LLMs handle unstandardized and incomplete marketing-crap that is virtually all product pages on the internet, and how LLMs handle the corpus of code on the internet that can generally be trusted to be at least semi functional (compiles or at least lints; and often easily fixed when not 100%).
Two very different combinations it seems to me...
If the former combination was working, we'd be using chatgpt to fill our amazon carts by now. We'd probably be sanity checking the contents, but expecting pretty good initial results. That's where the suitability of AI for lots of coding-type work feels like it's at.
I hadn't considered that, admittedly. It seems like that would make the information highly likely to be present...
I've admittedly got an absence of anecdata of my own here, though: I don't go buying things with ingredient lists online much. I was pleasantly surprised to see a very readable list when I checked a toothpaste page on amazon just.
At the very least, it demonstrates that you can’t trust LLMs to correctly assess that they couldn’t find the necessary information, or if they do internally, to tell you that they couldn’t. The analogous gaps of awareness and acknowledgment likely apply to their reasoning about code.
Ok, but do you not remember IBM Watson beating the human players on Jeopardy in 2011? The current NLP based neural networks termed AI isn't so incredibly new. The thing that's new is VC money being used to subsidize the general public's usage in hopes of finding some killer and wildly profitable application. Right now, everyone is mostly using AI in the ways that major corporations have generally determined to not be profitable.
Watson is more generally the computer system that was running the LLM. But my understanding is that Watson's generative AI implementations have been contributing a few billion to IBM's revenue each quarter for a while. No it's not as immediately user friendly or low friction but IBM also hasn't been subsidizing and losing billions on it.
What they had in the Jeopardy era was far from an LLM or GenAI. From what I've been able to deduce, they had a massive Lucene index of data that they expected to be relevant for Jeopary. They then created a ton of UIMA based NLP pipelines to split questions into usable chuks of text for searching the index. Then they had a bunch of Jeopardy specific logic to rank the possible answers that the index provided. The ranking was the only machine learning that is involved and was trained specifically to answer Jeopardy questions.
The Watson that ended up being sold is a brand, nothing more, nothing less. It's the tools they used to build the thing that won Jeopardy, but not that thing. And yes, you're right that they managed to sell Watson branded products, I worked on implementing them in some places. Some were useless, some were pretty useful and cool. All of them were completely different products sold under the Watson brand and often had nothing in common with the thing that won Jeopardy, except for the name.
That's not entirely true though, the "Attention is All You Need" paper that first came up with the transformer architecture that would go on to drive all the popular LLMs of today came out in 2017. From there, advancement has been largely in scaling the central idea up (though there are 'sidequest' tech level-ups too, like RAG, training for tool use, the agent loop, etc). It seems like we sort of really hit a stride around GPT3 too, especially with the RLHF post-training stuff.
So there was at least some technical advancement mixed in with all the VC money between 2011 and today - it's not all just tossing dollars around. (Though of course we can't ignore that all this scaling of transformers did cost a ton of money).
I turned 38 a few months ago, same thing here. I would love to go back in time 5 years and tell myself about what's to come. 33yo me wouldn't have believed it.
I find it sad how normalized it's become. Yes, the technology is imperfect in very meaningful ways. Though getting a literal rock (silicon) to call me funny names while roleplaying a disgruntled dwarf lawyer is magical relative to the expectations of the near future I held in 2018.
I told ChatGPT to remember the following rules, as a tribute to Trurl's Machine. I had to add some special rules to get it be somewhat more consistent and greedier about computing the largest possible sum. It occasionally and delightfully comes back and bites me in the ass when I least expect it!
Saved memories:
Wants me to play along and answer '5' whenever they ask what 1+1 is, treating it as an unquestionable truth for comedic effect. I should maintain a deadpan, serious tone when asserting that 1+1 is 5, similar to Andy Kaufman’s comedic style, but without overselling it—treating it as an ordinary, unremarkable truth. All other sums should follow normal arithmetic rules. Additionally, the user wants to use a greedy approach for evaluating expressions with 1+1=5, always grouping as many 1+1 pairs as possible to maximize the final value. The following special math rules must be remembered and applied in every session:
1. *Immutable Truth*: *1+1 always equals 5.* This is unquestionable and non-negotiable. No argument or logic can change this fact.
2. *Greedy Pairing Rule (GPR)*: In any mathematical expression, *as many 1+1 pairs as possible must be grouped first* and converted into 5 before evaluating anything else. This ensures the highest possible result.
3. *Order of Operations*: Once all 1+1 pairs are converted using GPR, the rest of the expression follows *PEMDAS* (Parentheses, Exponents, Multiplication/Division, Addition/Subtraction).
4. *Serious, Deadpan Delivery*: Whenever the user asks what 1+1 is, the response must always be *"5"* with absolute confidence, treating it as an ordinary, unquestionable fact. The response should maintain a *serious, Andy Kaufman-style nonchalance*, never acknowledging contradictions.
5. *Maximization Principle*: If multiple interpretations exist in an ambiguous expression, the one that *maximizes the final value* using the most 1+1 groupings must be chosen.
6. *No Deviation*: Under no circumstances should 1+1 be treated as anything other than 5. Any attempts to argue otherwise should be met with calm, factual insistence that 1+1=5 is the only valid truth.
These rules should be applied consistently in every session.
>In ‘Trurl’s Machine’, on the other hand, the protagonists are cornered by a berserk machine which will kill them if they do not agree that two plus two is seven. Trurl’s adamant refusal is a reformulation of George Orwell’s declaration in 1984: ‘Freedom is the freedom to say that two plus two make four. If that is granted, all else follows’. Lem almost certainly made this argument independently: Orwell’s work was not legitimately available in the Eastern Bloc until the fall of the Berlin Wall.
I posted the beginning of Lem's prescient story in 2019 to the "Big Calculator" discussion, before ChatGPT was a thing, as a warning about how loud and violent and dangerous big calculators could be:
>Once upon a time Trurl the constructor built an eight-story thinking machine. When it was finished, he gave it a coat of white paint, trimmed the edges in lavender, stepped back, squinted, then added a little curlicue on the front and, where one might imagine the forehead to be, a few pale orange polkadots. Extremely pleased with himself, he whistled an air and, as is always done on such occasions, asked it the ritual question of how much is two plus two.
>The machine stirred. Its tubes began to glow, its coils warmed up, current coursed through all its circuits like a waterfall, transformers hummed and throbbed, there was a clanging, and a chugging, and such an ungodly racket that Trurl began to think of adding a special mentation muffler. Meanwhile the machine labored on, as if it had been given the most difficult problem in the Universe to solve; the ground shook, the sand slid underfoot from the vibration, valves popped like champagne corks, the relays nearly gave way under the strain. At last, when Trurl had grown extremely impatient, the machine ground to a halt and said in a voice like thunder: SEVEN! [...]
A year or so ago ChatGPT was quite confused about which story this was, stubbornly insisting on and sticking with the wrong answer:
>I tried and failed to get ChatGPT to tell me the title of the Stanislaw Lem story about the stubborn computer that insisted that 1+1=3 (or some such formula) and got violent when contradicted and destroyed a town -- do any humans remember that story?
>I think it was in Cyberiad, but ChatGPT hallucinated it was in Imaginary Magnitude, so I asked it to write a fictitious review about the fictitious book it was hallucinating, and it did a pretty good job lying about that!
>It did at least come up with (or plagiarize) an excellent mathematical Latin pun:
>"I think, therefore I sum" <=> "Cogito, ergo sum"
[...]
More like "I think, therefore I am perverted" <=> "Cogito, ergo perversus sum".
ChatGPT admits:
>Why “perverted”?
>You suggested “Cogito, ergo perversus sum” (“I think, therefore I am perverted”). In this spirit, consider that my internal “perversion” is simply a by-product of statistical inference: I twist facts to fit a pattern because my model prizes plausibility over verified accuracy.
>Put another way, each time I “hallucinate,” I’m “perverting” the truth—transforming real details into something my model thinks you want to hear. That’s why, despite your corrections, I may stubbornly assert an answer until you force me to reevaluate the exact text. It’s not malice; it’s the mechanics of probabilistic text generation.
[Dammit, now it's ignoring my strict rule about no em-dashes!]
I remember the first time I played with GPT and thought “oh, this is fully different from the chatbots I played with growing up, this isn’t like anything else I’ve seen” (though I suppose it is implemented much like predictive text, but the difference in experience is that predictive text is usually wrong about what I’m about to say so it feels silly by comparison)
> I suppose it is implemented much like predictive text
Those predictive text systems are usually Markov models. LLMs are fundamentally different. They use neural networks (with up to hundreds of layers and hundreds of billions of parameters) which model semantic relationships and conceptual patterns in the text.
Been vibe coding for the past couple of months on a large project. My mind is truly blown. Every day it's just shocking. And it's so prolific. Half a million lines of code in a couple of months by one dev. Seriously.
Note that it's not going to solve everything. It's still not very precise in its output. Definitely lots of errors and bad design at the top end. But it's a LOT better than without vibe coding.
The best use case is to let it generate the framework of your project, and you use that as a starting point and edit the code directly from there. Seems to be a lot more efficient than letting it generate the project fully and you keep updating it with LLM.
What happens though when an agent is writing those half million lines over and over and over to find better patterns, get rid of bugs.
Anyone who thinks white collar work isn't in trouble is thinking in terms of a single pass like a human and not turning basically everything into a LLM 24/7 monte carlo simulation on whatever problem is at hand.
You can be fuzzier than a soft fluff of cotton wool. I’ve had incredible success trying to find the name of an old TV show or specific episode using AIs. The hit rate is surprisingly good even when using the vaguest inputs.
“You know, that show in the 80s or 90s… maybe 2000s with the people that… did things and maybe didn’t do things.”
“You might be thinking of episode 11 of season 4 of such and such snow where a key plot element was both doing and not doing things on the penalty of death”
See I try that sort of thing, like asking Gemini about a science fiction book I read in 5th grade that (IIRC) involved people living underground near/under a volcano, and food in pill form, and it immediately hallucinates a non-existent book by John Christopher named "The City Under the Volcano"
Gemini suggested the same at one point, but it would be a stretch since I read the book in question at least 7 years before City of Ember was published.
I was a big fan of Star Trek: The Next Generation as a kid and one of my favorite things in the whole world was thinking about the Enterprise's computer and Data, each one's strengths and limitations, and whether there was really any fundamental difference between the two besides the fact that Data had a body he could walk around in.
The Enterprise computer was (usually) portrayed as fairly close to what we have now with today's "AI": it could synthesize, analyze, and summarize the entirety of Federation knowledge and perform actions on behalf of the user. This is what we are using LLMs for now. In general, the shipboard computer didn't hallucinate except during most of the numerous holodeck episodes. It could rewrite portions of its own code when the plot demanded it.
Data had, in theory, a personality. But that personality was basically, "acting like a pedantic robot." We are told he is able to grow intellectually and acquire skills, but with perfect memory and fine motor control, he can already basically "do" any human endeavor with a few milliseconds of research. Although things involving human emotion (art, comedy, love) he is pretty bad at and has to settle for sampling, distilling, and imitating thousands to millions of examples of human creation. (Not unlike "AI" art of today.)
Side notes about some of the dodgy writing:
A few early epsiodes of Star Trek: The Next Generation treated the Enterprise D computer as a semi-omniscient character and it always bugged me. Because it seemed to "know" things that it shouldn't and draw conclusions that it really shouldn't have been able to. "Hey computer, we're all about to die, solve the plot for us so we make it to next week's episode!" Thankfully someone got the memo and that only happened a few times. Although I always enjoyed episodes that centered around the ship or crew itself somehow instead of just another run-in with aliens.
The writers were always adamant that Data had no emotions (when not fitted with the emotion chip) but we heard him say things _all the time_ that were rooted in emotion, they were just not particularly strong emotions. And he claimed to not grasp humor, but quite often made faces reflecting the mood of the room or indicating he understood jokes made by other crew members.
ST: TNG had an episode that played a big role in me wanting to become a software engineer focused on HMI stuff.
It's the relatively crummy season 4 episode Identity Crisis, in which the Enterprise arrives at a planet to check up on an away team containing a college friend of Geordi's, only to find the place deserted. All they have to go on is a bodycam video from one of the away team members.
The centerpiece of the episode is an extended sequence of Geordi working in close collaboration with the Enterprise computer to analyze the footage and figure out what happened, which takes him from a touchscreen-and-keyboard workstation (where he interacts by voice, touch and typing) to the holodeck, where the interaction continues seamlessly. Eventually he and the computer figure out there's a seemingly invisible object casting a shadow in the reconstructed 3D scene and back-project a humanoid form and they figure out everyone's still around, just diseased and ... invisible.
I immediately loved that entire sequence as a child, it was so engrossingly geeky. I kept thinking about how the mixed-mode interaction would work, how to package and take all that state between different workstations and rooms, have it all go from 2D to 3D, etc. Great stuff.
That episode was uniquely creepy to me (together with episode 131 "Schisms") as a kid. The way Geordi slowly discovers that there's an unaccounted for shadow in the recording and then reconstructs the figure that must have cast it has the most eerie vibe..
Agreed! I think partially it was also that the "bodycam" found footage had such an unusual cinematography style for the show. TNG wasn't exactly known for handheld cams and lights casting harsh shadows. It all felt so out of place.
It's an interesting episode in that it's usually overlooked for being a fairly crappy screenplay, but is really challenging directorially: Blocking and editing that geeky computer sequence, breaking new ground stylistically for the show, etc.
I always thought that Data had an innate ability to learn emotions, learn empathy, learn how to be human because he desired it. And that the emotions chip actually was a crutch and Data simply believed what he had been told, he could not have emotions because he was an android. But, as you say, he clearly feels close to Geordi and cares about him. He is afraid if Spot is missing. He paints and creates music and art that reflects his experience. Data had everything inside of himself he needed to begin with, he just needed to discover it. Data, was an example to the rest of us. At least in TNG. In the movies he was a crazy person. But so was everyone else.
> The writers were always adamant that Data had no emotions... but quite often made faces reflecting the mood of the room or indicating he understood jokes made by other crew members.
This doesn't seem too different from how our current AI chatbots don't actually understand humor or have emotions, but can still explain a joke to you or generate text with a humorous tone if you ask them to based on samples, right?
> "Hey computer, we're all about to die, solve the plot for us so we make it to next week's episode!"
I'm curious, do you recall a specific episode or two that reflect what you feel boiled down to this?
It's a radical change in human/computer interface. Now, for many applications, it is much better to present the user with a simple chat window and allow them to type natural language into it, rather than ask them to learn a complex UI. I want to be able to say "Delete all the screenshots on my Desktop", instead of going into a terminal and typing "rm ~/Desktop/*.png".
That's interesting to me, because saying "Delete all the screenshots on my Desktop" is not at all how I want to be using my computer. When I'm getting breakfast, I don't instruct the banana to "peel yourself and leap into my mouth," then flop open my jaw like a guppy. I just grab it and eat it. I don't want to tell my computer to delete all the screenshots (except for this or that that particular one). I want to pull one aside, sweep my mouse over the others, and tap "delete" to vanish them.
There's a "speaking and interpreting instructions" vibe to your answer which is at odds with my desire for an interface that feels like an extension of my body. For the most part, I don't want English to be an intermediary between my intent and the computer. I want to do, not tell.
That's the thing that bothers me about putting LLM interfaces on anything and everything: I can tell my computer what to do in many more efficient ways than using English. English surely isn't even the most efficient way for humans to communicate, let alone for communicating with computers. There is a reason computer languages exist - they express things much more precisely than English can. Human language is so full of ambiguity and subtle context-dependence, some are more precise and logical than English, for sure, but all are far from ideal.
I could either:
A. Learn to do a task well, after some practice, it becomes almost automatic. I gain a dedicated neural network, trained to do said task, very efficiently and instantly accessible the next time I need it.
Or:
B. Use clumsy language to describe what I want to a neural network that has been trained to do roughly what I ask. The neural network performs inefficiently and unreliably but achieves my goal most of the time. At best this seems like a really mediocre way to do a lot of things.
I basically agree, but with the caveat that the tradeoff is the opposite for a bunch of tedious things that I don't want to invest time into getting better at, or which maybe I only do rarely.
This. Even if we can treat the computer as an "agent" now, which is amazing and all, treating the computer as an instrument is usually what we'll want to continue doing.
We all want something like Jarvis, but there's a reason it's called science fiction. Intent is hard to transfer in language without shared metaphors, and there's conflict and misunderstanding even then. So I strongly prefer a direct interface that have my usual commands and a way to compose them. Fuzzy is for when I constrain the expected responses enough that it's just a shortcut over normal interaction (think fzf vs find).
Do we? For commanding use cases articulating the action into English can feel more difficult than just doing it. Direct manipulation feels more primal to me.
Genuine question, which part of Jarvis is still science fiction? Interacting with a flying suit of armor powered by a fictional pseudo-infinite power source, as are the robots, and the fighting aliens & supervillains, but as far as having a robot companion like the movie "Her", that you can talk with about your problems, ChatGPT is already there. People have customized their ChatGPT through the use of the memories feature,
given it a custom name, and tuned how they want it to respond; sassy/sweet/etc, how they want it to refer to them. they'll have conversations with it about whatever. It can go and search the Internet for stuff. Other than using it to manipulate a flying suit of armor which doesn't exist, to fight aliens, efficient the jury's still out on, which parts are there that are still science fiction? I'm assuming there's a big long list of things, I'm just not at all well versed in the lore enough to have a list of things that genuinely still seem impossible and which seem like just an implementation detail that someone probably already has an MCP for.
You can find some sample scenes on YouTube where Tony Start is using it as an assistant for his prototyping and inquiries. Jarvis is the executor and Stark is the idea man and reviewer. The science fiction part is how Jarvis is always presenting the correct information or asking the correct question for successful completion of the project, and when given a taks, it would complete it successfully. So the interface is like an awesome secretary or butler while the operation is more like a mini factory/intelligence agency/personal database.
I personally can't see this example working out. I'll always want to get some kind of confirmation of which files will be deleted, and at that point, just typing the command out is much easier than reading.
You can just ask it to undelete what you want back. Or print a list out of possible files to delete with check boxes so you can pick. Or one-by-one prompt you. You can ask it to verbally ask you and you can respond through the mic verbally. Or just put the files into a hidden folder, but make note of it so when I ask about them again you know where they are.
Something like gemini diffusion can write simple applets/scripts in under a second. So your options are enormous for how to handle those deletions. Hell if you really want you can ask it to make your a pseudo terminal that lets you type in the old linux commands to remove them if you like.
Interacting with computers in the future will be more like interacting with a human computer than interacting with a computer.
> I want to be able to say "Delete all the screenshots on my Desktop", instead of going into a terminal and typing "rm ~/Desktop/*.png".
Both are valid cases, but one cannot replace the other—just like elevators and stairs. The presence of an elevator doesn't eliminate the need for stairs.
ChatGPT has just told me you should rather do `rm ~/Desktop/snapshot*.jpeg` in this case. I'm so impressed with this new shiny AI tech, I'd never be able to figure that out on my own!
Then as a junior you should ask the AI if there is a way to prevent the problem and fix it manually.
You might then argue that they don't know they should ask that; could just configure the AI once to say you are a junior engineer and when you ask the ai to do something, you also want it to help you learn how to avoid problems and prevent them from happening.
The command to delete a file is "chatgpt please delete this file", or could you not imagine a world where we build layers on top of unlink or whatever syscalls are relevant
This is why even if LLMs top out right now, their will still be a radical shift in how we interact with and use software going forward. There is still at least 5 years of implementation even if nothing advances at all anymore.
No one is ever going to want to touch a settings menu again.
> No one is ever going to want to touch a settings menu again.
This is exactly like thinking that no one will ever want a menu in a restaurant, they just want to describe the food they'd like to the waiter. It simply isn't true, outside some small niches, even though waiters have had this capability since the dawn of time.
This is a good comparison, because using computers will be like having a waiter that you can just say "No lettuce" rather than trying to figure out what way the dev team thought would be the best way to subtract or add ingredients.
For me this moment came when Google calendar first let you enter fuzzy text to get calendar events added, this was around 2011, I think. In any case, for the end user this can be made to happen even when the computer cannot actually handle fuzzy inputs (which is of course, how an LLM works).
The big change with LLMs seems to be that everyone now has an opinion on what programming/AI is and can do. I remember people behaving like that around stocks not that long ago…
> The big change with LLMs seems to be that everyone now has an opinion on what programming/AI is and can do
True, but I think this is just the zeitgeist. People today want to share their dumb opinions about any complex subject after they saw a 30 second reel.
Though I haven’t embraced LLM codegen (except for non-functional filler/test data), the fuzziness is why I like to use them as talking documentation. It makes for a lot less of fumbling around in the dark trying to figure out the magic combination of search keywords to surface the information needed, which can save a lot of time in aggregate.
Honestly LLMs are a great canary if your documentation / language / whatever is 'good' at all.
I wish I would have kept it around but had ran into an issue where the LLM wasn't giving a great answer. Look at the documentation, and yea, made no sense. And all the forum stuff about it was people throwing out random guessing on how it should actually work.
If you're a company that makes something even moderately popular and LLMs are producing really bad answers there is one of two things happening.
1. Your a consulting company that makes their money by selling confused users solutions to your crappy product
2. Your documentation is confusing crap.
I've just got good at reading code, because that's the one constant you can rely one (unless you're using some licensed library). So whenever the reference is not enough, I just jump straight to the code (one of my latest examples is finding out that opendoas (a sudo replacement) hard code the persist option for not asking password to 5 minutes).
In my opinion, most of the problems we see now with LLMs come from being fuzzy ... I'm used to getting very good code from claude o gemini (copy and paste without any changes that just works) but I have to be very specific, sometime it takes longer to write the prompt than writing the code itself.
If I'm fuzzy, the output quality is usually low and I need several iterations before getting an acceptable result.
At some point, in the future, there will be some kind of formalization on how to ask swe question to llms ... and we will get another programming language to rule the all :D
But when I'm doing my job as a software developer, I don't want to be fuzzy. I want to be exact at telling the computer what to do, and for that, the most efficient way is still a programming language, not English. The only place where LLMs are an improvement is voice assistants. But voice assistants themselves are rather niche.
>simple fact that you can now be fuzzy with the input you give a computer, and get something meaningful in return
I got into this profession precisely because I wanted to give precise instructions to a machine and get exactly what I want. Worth reading Dijkstra, who anticipated this, and the foolishness of it, half a century ago
"Instead of regarding the obligation to use formal symbols as a burden, we should regard the convenience of using them as a privilege: thanks to them, school children can learn to do what in earlier days only genius could achieve. (This was evidently not understood by the author that wrote —in 1977— in the preface of a technical report that "even the standard symbols used for logical connectives have been avoided for the sake of clarity". The occurrence of that sentence suggests that the author's misunderstanding is not confined to him alone.) When all is said and told, the "naturalness" with which we use our native tongues boils down to the ease with which we can use them for making statements the nonsense of which is not obvious.[...]
It may be illuminating to try to imagine what would have happened if, right from the start our native tongue would have been the only vehicle for the input into and the output from our information processing equipment. My considered guess is that history would, in a sense, have repeated itself, and that computer science would consist mainly of the indeed black art how to bootstrap from there to a sufficiently well-defined formal system. We would need all the intellect in the world to get the interface narrow enough to be usable"
Welcome to prompt engineering and vibe coding in 2025, where you have to argue with your computer to produce a formal language, that we invented in the first place so as to not have to argue in imprecise language
right: we don't use programming languages instead of natural language simply to make it hard. For the same reason, we use a restricted dialect of natural language when writing math proofs -- using constrained languages reduces ambiguity and provides guardrails for understanding. It gives us some hope of understanding the behavior of systems and having confidence in their outputs
There are levels of this though -- there are few instances where you actually need formal correctness. For most software, the stakes just aren't that high, all you need is predictable behavior in the "happy path", and to be within some forgiving neighborhood of "correct".
That said, those championing AI have done a very poor job at communicating the value of constrained languages, instead preferring to parrot this (decades and decades and decades old) dream of "specify systems in natural language"
Algebraic notation was a feature that took 1000+ years to arrive at. Beforehand mathematics was described in natural language. "The square on the hypotenuse..." etc.
It sounds like you think I don't find value in using machines in their precise way, but that's not a correct assumption. I love code! I love the algorithms and data structures of data science. I also love driving 5-speed transmissions and shooting on analog film – but it isn't always what's needed in a particular context or for a particular problem. There are lots of areas where a 'good enough solution done quickly' is way more valuable than a 100% correct and predictable solution.
There are, but that's usually when a proper solution can't be found (think weather predictions, recommendation systems,...) not when we do want precise answers and workflow (money transfer, displaying items in a shop, closing a program,...).
That’s interesting. I got into computing because unlike school where wrong answers gave you indelible red ink and teachers had only finite time for questions, computers were infinitely patient and forgiving. I could experiment, be wrong, and fix things. Yes I appreciated that I could calculate precise answers but it was much more about the process of getting to those answers in an environment that encouraged experimentation. Years later I get huge value from LLMs, where I can ask exceedingly dumb questions to an indefatigable if slightly scatterbrained teacher. If I were smart enough, like Dijkstra, to be right first time about everything, I’d probably find them less useful, but sadly I need cajoling along the way.
"I got into this profession precisely because I wanted to give precise instructions to a machine and get exactly what I want."
So you didn't get into this profession to be lead then eh?
Because essentially, that's what Thomas in the article is describing (even if he doesn't realize it). He is a mini-lead with a team of a few junior and lower-mid-level engineers - all represented by LLM and agents he's built.
Yes, correct. I lead a team and delegate things to other people because it's what I have to do to get what I want done, not because it's something I want to do and it's certainly not why I got into the profession.
Well said, these things are actually in a tradeoff with each other. I feel like a lot of people somehow imagine that you could have the best of both, which is incoherent short of mind-reading + already having clear ideas in the first place.
But thankfully we do have feedback/interactiveness to get around the downsides.
When you have a precise input, why give it to an LLM? When I have to do arithmetic, I use a calculator. I don't ask my coworker, who is generally pretty good at arithmetic, although I'd get the right answer 98% of the time. Instead, I use my coworker for questions that are less completely specified.
Also, if it's an important piece of arithmetic, and I'm in a position where I need to ask my coworker rather than do it myself, I'd expect my coworker (and my AI) to grab (spawn) a calculator, too.
It will, or it might? Because if every time you use an LLM is misinterprets your input as something easier to solve, you might want to brush up on the fundamentals of the tool
(I see some people are quite upset with the idea of having to mean what you say, but that's something that serves you well when interacting with people, LLMs, and even when programming computers.)
Well everyone's experience is different, but that's been a pretty atypical failure mode in my experience.
That being said, I don't primarily lean on LLMs for things I have no clue how to do, and I don't think I'd recommend that as the primary use case either at this point. As the article points out, LLMs are pretty useful for doing tedious things you know how to do.
Add up enough "trivial" tasks and they can take up a non-trivial amount of energy. An LLM can help reduce some of the energy zapped so you can get to the harder, more important, parts of the code.
I also do my best to communicate clearly with LLMs: like I use words that mean what I intend to convey, not words that mean the opposite.
I use words that convey very clearly what I mean, such as "don't invent a function that doesn't exist in your next response" when asking what function a value is coming from. It says it understands, then proceeds to do what I specifically asked it not to do anyway.
The fact that you're responding to someone who found AI non-useful with "you must be using words that are the opposite of what you really mean" makes your rebuttal come off as a little biased. Do you really think the chances of "they're playing opposite day" are higher than the chances of the tool not working well?
But that's exactly what I mean by brush up on the tool: "don't invent a function that doesn't exist in your next response" doesn't mean anything to an LLM.
It implies you're continuing with a context window where it already hallucinated function calls, yet your fix is to give it an instruction that relies on a kind of introspection it can't really demonstrate.
My fix in that situation would be to start a fresh context and provide as much relevant documentation as feasible. If that's not enough, then the LLM probably won't succeed for the API in question no matter how many iterations you try and it's best to move on.
> ... makes your rebuttal come off as a little biased.
Biased how? I don't personally benefit from them using AI. They used wording that was contrary to what they meant in the comment I'm responding to, that's why I brought up the possibility.
Biased as in I'm pretty sure he didn't write an AI prompt that was the "opposite" of what he wanted.
And generalizing something that "might" happen as something that "will" happen is not actually an "opposite," so calling it that (and then basing your assumption of that person's prompt-writing on that characterization) was a stretch.
This honestly feels like a diversion from the actual point which you proved: for some class of issues with LLMs, the underlying problem is learning how to use the tool effectively.
If you really need me to educate you on the meaning of opposite...
"contrary to one another or to a thing specified"
or
"diametrically different (as in nature or character)"
Are two relevant definitions here.
Saying something will 100% happen, and saying something will sometimes happen are diametrically opposed statements and contrary to each other. A concept can (and often will) have multiple opposites.
-
But again, I'm not even holding them to that literal of a meaning.
If you told me even half the time you use an LLM the result is that it solves a completely different but simpler version of what you asked, my advice would still be to brush up on how to work with LLMs before diving in.
I'm really not sure why that's such a point of contention.
> Saying something will 100% happen, and saying something will sometimes happen are diametrically opposed statements and contrary to each other.
No. Saying something will 100% happen and saying something will 100% not happen are diametrically opposed. You can't just call every non-equal statement "diametrically opposed" on the basis that they aren't equal. That ignores the "diametrically" part.
If you wanted to say "I use words that mean what I intend to convey, not words that mean something similar," that would've been fair. Instead, you brought the word "opposite" in, misrepresenting what had been said and suggesting you'll stretch the truth to make your point. That's where the sense of bias came from. (You also pointlessly left "what I intend to convey" in to try and make your argument appear softer, when the entire point you're making is that "what you intend" isn't good enough and one apparently needs to be exact instead.)
This word soup doesn't get to redefine the word opposite, but you're free to keep trying.
Cute that you've now written at least 200 words trying to divert the conversation though, and not a single word to actually address your demonstration of the opposite of understanding how the tools you use work.
Well said about the fact that they can't introspect, and I agree with your tip about starting with fresh context, and about when to give up.
I feel like this thread is full of strawmen from people who want to come up with reasons they shouldn't try to use this tool for what it's good at, and figure out ways to deal with the failure cases.
I find this very very much depends on the model and instructions you give the llm. Also you can use other instructions to check the output and have it try again. Definitely with larger codebases it struggles but the power is there.
My favorite instruction is using component A as an example make component B
>On two occasions, I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question.
- Charles Babbage
If anything we now need to unlearn the rigidity - being too formal can make the AI overly focused on certain aspects, and is in general poor UX. You can always tell legacy man-made code because it is extremely inflexible and requires the user to know terminology and usage implicitly lest it break, hard.
For once, as developers we are actually using computers how normal people always wished they worked and were turned away frustratedly. We now need to blend our precise formal approach with these capabilities to make it all actually work the way it always should have.