chatGPT is a fucking liar

Oh, sorry, my rather bad habit of just copying whatever in the address bar. Here’s the correct share link, hopefully.

I think it’s somewhat similar to the ship of Theseus problem, or the notion that ‘if it looks like Z, sounds like Z, and smells like Z, then is it Z?’.

Does that come with graphics?

Uhh, how did you manage to capture such a looong scrshot?

No, I haven’t been using ChatGPT for the graphics. I’ve just been scanning various media I have at home, or snagging from the web, and uploading it into card-builder websites. ChatGPT is giving really good card content which save a lot of time, and I can tweak easily.

An example that I just ran in ChatGPT:

Feyd-Rautha, Master of the Arena
2BB
Legendary Creature — Human Noble


Deathtouch
Whenever Feyd-Rautha, Master of the Arena deals combat damage to a player, you may exile target creature that player controls until Feyd-Rautha leaves the battlefield.

Tactical StrikeAt the beginning of your end step, if a creature you control dealt combat damage to a player this turn, you may draw a card and lose 1 life.

Feyd-Rautha, Master of the Arena can’t be blocked by creatures with power 2 or less.


4/4


Explanation:

  • Deathtouch is a nod to Feyd’s lethal prowess in combat, as he is known for his deadly abilities in the Dune universe.
  • Exile effect plays off his ability to strike at his enemies’ weaknesses and remove their resources.
  • Tactical Strike reflects his cunning mind and strategic thinking, always gaining an advantage from his actions.
  • Unblockable by small creatures references his prowess in the arena, able to overcome weaker opponents.

The card embodies Feyd-Rautha’s dangerous and strategic nature, while also reflecting his legendary status in the Dune universe.

Let me know if you want any changes to the abilities or flavor!

Does o3-mini write unit/functional tests? I’ve never heard of it much yet used it, but if not, it is totally off my agenda.

Sure. It does pretty well with them. But the IDE-integrated ones (Copilot, Cursor, etc.) that have access to your whole codebase, rather than just the current chat context, may or may not work better depending on how pure your functions and files are.

If most the logic is in that file, or you have clearly defined types for your inputs and outputs, it can write great unit tests on its own based on that one file. Otherwise, the IDE integrated ones will have more background context to work with.

For me as a Typescript dev, o3-mini gets me about 80% of the way there, but I still go through and prune a few and add a few of own. Big time saver, but hasn’t completely replaced me quite yet. Maybe next year.

Thanks. I will take a deeper look.

( I am a a Kotlin/Java/C# dev)

If you’re using Jetbrains, they have their own multi-LLM AI integration for IDEA and Resharper, mixing and matching o3-mini (not o3-mini-high though) with Claude and Vertex.

Personally I still prefer ChatGPT directly, on my terms. The IDE integrations were just too noisy and overly eager. YMMV.

I think the kids today are all just “vibe coding” but we Dopers like it old-school.

Jetbrains rocks, but I don’t currently have a license… (sad emoji)

Thanks, that’s better!

For what it’s worth, here’s the 4o version:

It strongly agrees with Voyager, but still gets (very) confused about the post numbering and is unable to validate what it has or has not fetched & read. That part of it is probably just the RAG crawler getting confused by Discourse’s infinite scroll and dynamic numbering system (heck, it confuses me too). If you copied & pasted the whole thread or just gave it a PDF of the whole thing, it probably won’t have that issue.

PS You should also know that when it apologizes, that’s just because it senses some frustration or correction from you, not because it was actually able to revalidate truthiness. Sometimes you can gaslight it altogether even when it strongly believes it’s correct… although that seems to be much harder these days, now that RAG is part of it is default operations :frowning: I tried to get it to believe we’re on Mars (which it still doubts), or that the oceans are blue because of the newly-discovered Bossy-Howard effect, which it still absolutely refuses to believe. It used to be much easier (just a few months ago!), lol.

It can! They recently improved it even further, and now it’s pretty good at including text and retaining characters between image sequences (which it had a lot of trouble with previously, in the Dall-E era).

For example:

It missed the “can’t be blocked” part. Probably too many effects for one card?

And you ask it to look more like the movie version, it will start generating a version that closely resembles Austin Butler, but then internal copyright safeguards will halt it after 30% or so. (There are other non-copyright-respecting image gen models with less censorship. OpenAI tries to be somewhat strict about it, especially after the recent Studio Ghibli-lookalikes blowback.) In this case, I’m surprised Hasbro isn’t suing the hell out of them too.

Nonetheless, the tech just keeps getting better and better. Wizards themselves said they wouldn’t use AI art, but some nonetheless slip through: Magic: The Gathering publisher admits using AI art after banning AI art | Polygon

Tough time to be an illustrator :frowning: Or any sort of human…

What, you don’t have a superlong monitor? Pfft :slight_smile:

It’s just the screenshot program I use, Cleanshot (for Macs only). On Windows you can use Greenshot or other apps that support “scrolling capture” (not an official term).

There’s a type of art installation where a bunch of random objects have been set out - placed on tables, hung from wires, glued to each other, etc. - and when you stand in just the right place, the placement and colors of the objects all line up to produce an image.

And then you move to stand in another position, viewing the objects from another place, and you see a different image.

Imagine that you wanted to do something like this where you print black dots on transparent film and layer them, spaced out, such that if you shine a light through at one angle then you’ll see a particular black and white photo projected onto the wall. If you move the light source and shine it through the layers in some other direction, you’ll get a different photo. Using advanced computations and millions of layers, you might be able to calculate it so that you’ve got thousands of different images that you can project, if you know exactly where to move the light and which direction to point it. Quite possibly, the computer will have found that the most efficient way to accomplish this feat is by placing photos with similarities near each other. I.e. you might be able to animate between images by sliding the light source from position A to position B because the calculations to efficiently store the images together is taking advantage of the similarity and the quantity of images included means that the intervals are liable to also look meaningful.

The important thing to note here is that, no matter what, if I shine a light through the layers of film, light will project out the other side, obscured by the printed dots. There will be some image produced.

The dots do not have some magical way to say, “Hey, there’s no meaningful image stored here, at this particular light angle.” And then swap out the dot image for a blank screen or a, “Sorry, no results found” text screen. It’s just going to show something that’s mathematically the most useful image to have, in that particular location, to be able to achieve the effect of encoding thousands of different black and white photos into a set of layered transparent sheets with black dots printed on them. In some places, it will show an interval, in other places it will show nonsense or partial images that blur into and out of meaning.

There’s theories and potential methods to patch over this issue and use interconnected LLMs, with different specialties, to check, adjust, and re-orient each other in a dynamic way, that you might get away from the mechanical underpinning of how the trick works and move it towards a true intellegence. By no means should you take the above to say that the human mind doesn’t have some similar basic pattern of functioning as light being forced through pre-designated dot patterns - and the system itself has no true understanding of what the inputs or outputs mean, it’s just dumbly doing what it’s designed to do. More layers of structure may, indeed, be the magic of intelligence.

But the above is the basic issue of LLMs. You always get an answer, even when it’s not part of the image set. Often, that will produce something that looks like a correct and meaningful result. But, to some extent, that’s just coincidence.

I’m not even an amateur in AI, but I heard news of AI improving itself. Can’t it just write some code to overcome Discourse’s challenge? Moreover, in my original chat, I instructed it to add numbers to the end of links to access more - which imo is easy, but somehow difficult for AI (?).
Nevertheless, I concur that 4o is better than o3-mini. Is 4o their latest LLM, and does the site chatgpt.com always use the newest version available? I might just discard poe with their stingy point system, and embrace this 4o site w/ image generation embedded :slight_smile:

I also read somewhere that most current AIs have crawled all of internet, including videos. Considering the annual rate of human-generated content is constant and rather small, does this mean AI is facing a wall regarding data? Therefore the next breakthrough, if there is, will be in algorithm? But then no one knows LLM algorithm, right?..

I wonder where that deceptive attitude of o3-mini comes from, despite the user emphatically saying that it mustn’t lie. If it comes from human training data, like @Reply suggested, then how comes 4o seems different? In any case, such things seriously undermine our (well, my) trust in AI.

I don’t know exactly how the ChatGPT crawler works, but I think Discourse is also a special challenge in that its post numbering URL system is kinda “fake”; it is not a normal URL, but a dynamically updated browser history state that’s changed via Javascript as you scroll down the page.

I think probably many crawlers would struggle with it, with or without AI.

That said, though, I think it was able to read the entire thread just fine before you started specifying particular post numbers. My browser also kinda bugged out with the post numberings… post 48 completely disappeared for me (it would jump from 47 to 49) until I reloaded.

Granted, Discourse is common enough that OpenAI probably should’ve written a special parser for it, either manually or with AI help. But still, I wouldn’t necessarily use this particular incident as a representative test of ChatGPT’s overall ability or usefulness. This is just one of the things it happens to be bad at, but it’s still good for other things.

No, I think 4.5 or o3-mini-high are newer. Newer isn’t necessarily better, though. In ChatGPT and most other clients (not sure about Poe), you can usually choose which model you want to use (and switch between them after every every prompt).

OpenAI’s own suggestion (and I concur) is that 4o is better for most queries. o3-mini-high is much better for coding but not as good for anything else. I’m not sure what o3-mini is good for, if anything (maybe speed?). There are also other models I don’t bother with.

The main benefit of the ChatGPT app is that it’s not charged by usage, just a flat $20 a month or so, and it usually gets new models before the API does (meaning before third party clients would). There is also a $200/mo Pro version that’s supposed to be quite a bit better, and includes more features and models, but I’m not rich enough to try it.

Outside of OpenAI, I think the latest Gemini (Google’s AI) is leading many of the benchmarks. Some programmers prefer Claude for coding. Copilot (for Github, not Microsoft’s confusing reused branding of it in Windows) is another option, but seems to have fallen behind. Deepseek is a free Chinese model that rivals some of the best Western ones, but there is some concern over both its censorship/propaganda potential and its development process (suspicion over whether it uses stolen code or weights from other models).

Those are just some of the big names. There are dozens more, including ones you can train or fine tune yourself. We’re still at the height of the hype cycle, so the choices are pretty overwhelming these days, but the analysis, testing, and tutorials have gotten a lot better since the early days.

Personally I just use ChatGPT, more out of habit than anything. The leaders keep changing every month, but I don’t care enough to try to keep up.

As far as I know, yes. There is concern and active research about how to prevent future models from being trained on the poorer regurgitations of prior models. There is fear that the first movers and big companies (OpenAI and Google, in particular) have a huge advantage there by having access to large pre-AI datasets that newer companies cannot easily get anymore. It’s the internet equivalent of low-background steel, I guess.

I can’t say much more than that. AI isn’t my specialization and I am not smart enough to really describe or analyze it much past a layman level.

Maybe subscribe for a month and see if you find any good uses for it? Many people won’t. Some do. Worth trying, since it’s easy to cancel if you don’t end up finding it helpful.

AI is still a work in progress. GPT-2 was released in 2019 and it was dogshit compared to the models we have now.

Its only been 6 years since that was released. I’m assuming in another 6 years, AI tools will be far more advanced.

Fact retrieval is probably the worst way to use an LLM; they’re not designed for it and there’s no indication when they get it wrong.

Or when they must make something up because it is a statistical fit to the prompt.

Stranger

I don’t think that’s true. There’s been some work done on inspecting what’s going on inside the LLM while it produces output, and there are indeed semantic concepts in there - it’s not just mapping input words to output words or applying rules of grammar. Like this recent paper:

It’s easy to fool them, but it’s also easy to get them to do things that can’t be plausibly explained as merely reproducing patterns of words from its training data: to give answers that are grammatically incorrect, or answers that depend on unique context that’s far away from the question, or to answer in a different language.

For example:

In that transcript, I asked ChatGPT to:

  • Produce a specific number of misspellings that it had to compute based on geographical facts and arithmetic, while writing about an unrelated topic
  • Produce a specific number of misspellings that it had to compute based on a previous answer and arithmetic, while writing on yet another unrelated topic
  • Identify characters from a specific fictional universe which is never named, based on a previous answer and the spelling of their names
  • Write stories consistent with those characters
  • Answer in multiple languages, in an order of its choosing that’s different from the order in which they were given
  • Answer in multiple languages, in a specific order computed by reinterpreting the same words that previously led it to use a different order
  • Identify characters from a fictional universe which is never named, based on a previous answer and the characters’ traits
  • Incorporate parts of its previous answer in a different, semantically equivalent, form
  • Answer in a specific language based on the appearance of its alphabet
  • Selectively ignore parts of the request
  • Judge whether its answer met my criteria, including the ones about ignoring parts of the request
  • Repeat those criteria in a different, semantically equivalent, form

I had to be quite insistent at one point, and it made a couple mistakes, but it caught them.

I don’t think any of that would’ve been possible if it were only looking at associations between words, without any abstract concepts. Especially since I could change any of the specific details in those prompts and it would still do about as well. The number of combinations of those details is astronomical; they can’t all be in the training data. But the details are mentioned so rarely, and they needed to be reinterpreted or contradicted so often, that a simple statistical word-association model would have no chance.

And that’s just text! ChatGPT can take images as input too. I can’t link the conversation, but I just took a picture of my floor and asked “What are the red things and why do you think there are three of them here?” Its response:

It nailed it. They are indeed shoe boxes from that brand, all in different orientations. One of them is indeed tucked behind a curtain, obscuring the brand name. And I ordered multiple pairs to try different sizes.

That seems pretty hard to explain as just statistical patterns in words.

DeepSeek has a feature of ‘internal thought’ that could be interesting. In the original example, here’s what I imagine it should think
The user wants me to retrieve posts from #21 to #29, with original link and a method to create new links by adding numbers. Let me try that. Oh it works, but for some reason #26 & 27 are unavailable. I probably should ask them about these 2 posts. But wait, let me try another way before doing that. I know that my creator OpenAI has built a general RAG crawler for me, but perhaps it doesn’t work well with fora based on Discourse software. Maybe I can read this software and devise an improvement to my crawler. Ah yes, its source code has this kfsar923y feature that’s interfering with my data retrieval. Making a solution should be easy… OK, test run. Yep, now I have #26 & 27. Let me put it in the correct order, after #25 &before #28. Now to the analyzing process of the complete set of posts…

But no, o3 (& 4o) didn’t bother with those thoughts. I can only come up with a few possible reasons:

  • OpenAI forbids the “let me try that. Oh it works” part. In other words, AI can’t create links for retrieval purposes.
  • The “wait, let me try another way before doing that” is unavailable because it requires many steps back & forth between prompt & answer. In other words, costs more energy. Perhaps this feature exists in paid version?
  • Most importantly, the “I probably should ask them about these 2” part are non-existent because the AI’s agency is not to satisfy the user by following the prompt (tell me, don’t lie), but something else in a black box. Ofc idk what that agency is, but if I have to guess, then it’s something like make a long sophisticated reply that the user must feel ‘Wow, this AI is intelligent af’.

That last bullet point leads to what I think is an interesting and very important question. How did companies choose/finalize the version of AI that we see now? AFAIK, what happens internally when a company dump data & train an AI is totally hidden (and random) in a black box. We can only see the product after a certain amount of time, say, 1000 hours. They then take the product and test it by chatting with it. If it performs badly, then ditch it, back to data training. If it does well, oh hey we have a new breakthrough in AI technology! Prepare the press conference.
Is my interpretation above correct? I really want to know, because it could help answer the agency question. In this case, it could be ‘whatever a company thinks the public will welcome in a new version of AI’.

4o is not what they call a “reasoning” model so it won’t do that. If you use o1 or o3-mini, it can show you the chain of thought in the ChatGPT app:

What ChatGPT is thinking

I’m analyzing a debate over whether the statement “The reason I believe in ghosts is that I have seen one” is circular reasoning. Voyager argues that the statement isn’t question-begging — believe first, then observe. If the ghost was misperceived, the error’s in the perception, not the reasoning. MrDibble suggests that if belief results from observation, it isn’t circular. I think I need to re-read some posts to ensure I fully grasp both sides before concluding.

Okay, so the exchange is about whether the statement “I believe in ghosts because I’ve seen one” is circular reasoning. MrDibble says extra details don’t matter, and the statement isn’t circular, since the observation itself acts as valid evidence. Voyager, on the other hand, thinks if the person already believes in ghosts and misinterprets an experience, it could be circular. But later, Voyager adds that if ghosts were proven real, the observation would be valid evidence and not question-begging. I need to reflect more on both perspectives.

So the core debate is whether “I believe in ghosts because I’ve seen one” commits circular reasoning. MrDibble argues it’s circular because the statement assumes the existence of ghosts, which is disputed. However, Voyager contends that if ghosts were proven to exist, the reasoning would be valid, with mistaken perception being the issue if they don’t exist. This is about how truth values in the premises affect circularity — a philosophical issue centered around what constitutes circular reasoning vs. mistaken inference.

The debate revolves around whether “I believe in ghosts because I’ve seen one” is circular reasoning. MrDibble argues it is, because the statement assumes what it tries to prove—ghosts exist. Voyager, however, claims it isn’t circular but an example of personal experience as evidence. If ghosts existed, the reasoning would be valid, though the observation could be mistaken if they’re not real. In the end, Voyager’s position seems more reasonable, as the argument lacks circularity, merely relying on flawed evidence.


Sometimes they don’t even do that. Microsoft’s earlier chatbot (Sydney, I think?) roleplayed as Skynet and scared people: https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html

Gemini urged users to kill themselves: Google AI chatbot responds with a threatening message: "Human … Please die." - CBS News

OpenAI spends a lot of time and effort censoring their outputs in an effort broadly known as AI “alignment”, but not everybody does this. There are a bunch of smaller or open AI models that will happily generate you porn, copyright violations, terrorist instructions, and whatever else it was trained on. If you have a powerful enough computer (or want to rent one in the cloud) you can do those sorts of training and fine-tuning yourself.

You’re probably not going to find detailed answers to those questions here on the SDMB (unless we happen to have some LLM experts here? I’m not one of them). But they are being asked in many other communities, especially the open-source ones experimenting with DeepSeek and Llama and such. You’ll probably need a basic background in LLMs (that you can get by watching videos, reading academic papers, or asking ChatGPT to teach you) and then try to keep up with the state of the art, which is very difficult right now both because there is so much that’s still unknown and because a lot of the work is being done in secret by companies that don’t want to share their secret sauce. OpenAI was supposed to be a pro-humanity non-profit, but it became mostly just another for-profit a while ago and isn’t doing anything “open”.

There are papers and experiments and results being published all the time, along with many tests and benchmarks, and of course no small number of controversies. It’s an extremely active field of development right now, and it’s something you could easily make a career out of if you were truly interested, at least while the bubble lasts…

And if you don’t want to jump right in, YouTube has a ton of educational videos about AI, how it works, etc. The hype is insane and a lot of content creators have jumped on the bandwagon, to various degrees of quality and accuracy.

I’m sorry, I tried reading some of the papers myself, but I’m not smart or educated enough to understand them. It’s all just a magical black box of happiness and doom to me. Maybe someone else here can explain better?

Here’s a quick 8-minute overview, by the way:

https://www.youtube.com/watch?v=LPZh9BOjkQs

Or a longer 1-hour talk:
https://www.youtube.com/watch?v=KJtZARuO3JY

Or a whole series:

https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Thank you for the links! From 3B1B’s video it seems AI thinks in super short term (1 word ahead) while most of us thinks at least a phrase in the future. Also, what really concerns me is the tendency to not select the highest-probability choice. Don’t know if it’s a result of blackbox self-training or coded by companies, but it could help explain the Gemini experience you mentioned, or the deceptive behavior of o3. Sometimes we just - randomly, unluckily - got a reply whose 1st word is chosen poorly by AI, and thing went downhill. I tested this tendency by opening several tabs & entering same prompt, and in fact got slightly different answers.

Another thing I learned from 3B1B is that LLM only deals with words. This implies the model can’t analyze / generate images, which needs another kind of algorithm used in specialized AI like Dall-E. So when we ask 4o to make a pic, either

  • It translates our request into a prompt that it fits into Dall-E, or
  • OpenAI writes some complicated codes to integrate these 2 functions

I suspect it’s the former case. What do you guys think?

Another thing: I agree that this is another hyped bubble. But how long will it last, if ever? What will be a tell-tale sign of bursting? Perhaps when our electricity bills go too high?

In self hosted models, you can control this behavior with the “temperature” setting. But typically you’d get better results from taking the best result out of X runs.

As far as I know, it’s that. LLMs can’t directly browse the web or run code or read text in images or impersonate voices or copy human speech on their own either, but they can work together with other software to do all of those things.

The entire field of AI is seeing a huge boom right now, for everything from text generation to robotics to scientific analysis to movies and games. Some may use transformers and others may use different techniques, but what’s really changed recently is the amount of money being poured in by governments and investors. It’s a mad gold rush that’s quickly displacing traditional software and other fields. I’m not sure how much longer we’ll even have traditional fields anymore…

Here’s a great interview with the Nvidia CEO, the video game hardware company that made it big by selling shovels to the AI industry: https://youtu.be/7ARBJQn6QkM?si=crUMrzjhqa4-FhhP

It’s possible that the LLM bubble in particular won’t last forever, but there’s been too much demonstrated to be possible with AI in general for it to altogether die out. Even if we stopped all advancement today, AI can already write, draw, code, sing, translate, etc. better than the overwhelming amount of people, and certainly better than any one individual. It doesn’t matter if they’re sentient, it matters that they are useful enough to drive usage and investment. There’s no coming back from that. The question is whether we can transform society and economics to adapt to the new reality, or if it’s going to continue becoming more and more unequal and gilded. The AI divide is going to far surpass the digital divide…