Kind of a bizarre choice, it seems to me… It’s not like its capabilities weren’t impressive enough without dressing them up, this sort of misdirection seems to only serve to cheapen them.
Well… Gemini is only a bit better than GPT-4 on standard tests, and the multi-modal capabilities are a step up but not leaps and bounds better than the current state of the art. The video, on the other hand, would be incredible if it were true–years beyond what OpenAI has, with instant responses and no text prompting. I’m fine with demos being cherry-picked and not completely representative–but here they’re showing functionality that doesn’t exist at all.
Yes, this is not trivial. The ability to look through a camera, see someone show you something, and understand the implicit prompt there is lightyears beyond where we are now.
Yeah, I never thought Google would play fast and loose with a video like that. Not cool.
The one example that blew me away was when it was shown a ‘connect the dots’ picture without any lines, and it identified what it was before the dots were connected, seemingly building its own model or something. Now I’m wondering if that was faked as well and they actually showed it the completed drawing first.
The blog doesn’t explain that one. But there is this:
In the video, they just ask “which of these would go faster?” But the actual prompt is completely different, mentioning “aerodynamic” from the start. It’s impressive, but much less so than if it came up with that from scratch. The bit about using “specific visual details” is also something of a cheat.
Yeah, they definitely gave it some clues. Still, the ability to determine which shape is more aerodynamic in a pencil drawing is pretty good. But yeah, it’s not making the leaps of judgement that it seems to in the video.
Plus, while that’s the answer a human would probably give to which one is more aerodynamic, which one is actually more aerodynamic is a very difficult problem that’s hard even for specialized computers to answer, and would need a lot more precise shape information than is found in a pencil sketch like that. Sometimes, boxy shapes actually are more aerodynamic than sleek shapes.
It would be great to see its response if the human came back and said, “Explain in detail how you came to that conclusion”.
I haven’t read most of this thread and don’t know if it’s been asked before, but I’ve been curious about something: is it possible to deconstruct an AI-created text? Can you have another AI analyze content and figure out what prompts were used to create it? Because that would seem like a useful filtering tool.
FWIW – which probably isn’t much, but still … out of curiosity I posed the question in text form to good ol’ primitive ChatGPT 3.5:
Me: I have two toy cars at the top of a smooth incline. They are the same length and width, the same weight, and identical in all respects except that one has a boxy shape with straight, flat surfaces, while the other is shaped like a wedge with the thin end at the front. Will one roll down the incline faster than the other, and why?
ChatGPT: The toy car with the wedge shape is likely to roll down the incline faster than the boxy-shaped car. This is because of differences in aerodynamics and air resistance.
The wedge-shaped car offers less resistance to the air as it moves down the incline. The streamlined shape helps reduce air resistance, allowing the car to move more smoothly through the air. On the other hand, the boxy-shaped car with flat surfaces may experience more air resistance, slowing it down as it moves down the incline.
In general, objects with more aerodynamic shapes experience less air resistance and, therefore, move more efficiently through the air. This is why you often see streamlined shapes in vehicles designed for speed, such as race cars or high-speed trains.
But then I asked if it would make a difference if the cars, otherwise identical, had different weights. That’s when it got stupid.
Indeed, this was one of the most disappointing interactions I’ve had with it – keeping in mind that this is just the old GPT 3.5. It somehow created and stubbornly clung to the idea that “greater mass” is somehow associated with “larger size”, and in subsequent interactions vacillated between the Galilean principle of all objects accelerating at the same rate and the hallucination that, in the presence of air resistance, more massive objects would experience more air resistance and accelerate more slowly. The hallucination persisted across multiple interactions and despite several attempts to correct it.
So of course, after other discussions elsewhere, I had to try out ChatGPT 3.5 (I’m not paying for GPT 4) on a couple of the Winograd schemas I mentioned there (intended to test for real-world understanding). Here is the entire session. I think it’s fairly impressive for this particular now-obsolete implementation of the technology. I have bolded the particular part of the GPT response that I think deserves special attention:
Me: In the statement “I spread the cloth on the table in order to protect it”, what does the pronoun “it” refer to?
ChatGPT: In the statement “I spread the cloth on the table in order to protect it,” the pronoun “it” refers to the table. The cloth is being placed on the table to protect the table.
Me: Please explain your reasoning for this conclusion.
ChatGPT: Certainly! Let’s break down the sentence to understand the referent of the pronoun “it”:
“I spread the cloth on the table in order to protect it.”
“I spread the cloth on the table”: This part of the sentence describes the action of placing the cloth onto a surface, and in this case, the surface mentioned is the “table.”
“in order to protect it”: The purpose or intention of spreading the cloth is stated here, and the pronoun “it” is used. In the context of the sentence, the most logical interpretation is that “it” refers to the nearest noun mentioned, which is the “table.”
Therefore, the pronoun “it” refers to the “table,” indicating that the action of spreading the cloth is done to protect the table.
Me: Let’s try another example. In the statement “the city councilmen refused the demonstrators a permit because they feared violence”, who does the pronoun “they” refer to?
ChatGPT: In the statement “the city councilmen refused the demonstrators a permit because they feared violence,” the pronoun “they” refers to the city councilmen. The sentence suggests that the city councilmen refused the permit due to their fear of violence.
Me: In the prior example of the cloth on the table, you reasoned that the referent of the pronoun “it” was the table because you said it was the “nearest noun mentioned”. In this second example, however, the demonstrators were the nearest referent. So how do you conclude that “they” refers to the more distant noun “councilmen”?
ChatGPT: I appreciate your observation, and you’re right to point out the potential ambiguity in the sentence. In this case, the antecedent of the pronoun “they” could indeed be interpreted differently based on the structure of the sentence. Let’s analyze it again:
“The city councilmen refused the demonstrators a permit because they feared violence.”
It is true that “demonstrators” is the closest noun to the pronoun “they.” However, the context and common sense play a crucial role in determining the referent. The city councilmen are the ones taking an action (refusing the permit), and it is typically the entity with agency in a sentence that pronouns refer to. In this case, it makes more sense for “they” to refer to the city councilmen, who are the decision-makers expressing a fear of violence as the reason for refusing the permit.
While it’s true that the structure of the sentence might allow for a momentary ambiguity, the overall meaning of the sentence and the logical connection between the action and the fear of violence support the interpretation that “they” refers to the city councilmen.
ETA: The “city councilmen vs demonstrators” example is perhaps the first and probably most famous of the Winograd schema tests. It’s entirely possible that ChatGPT has seen it before in its training data.
A fun head-to-head battle between LLM implementations:
https://chat.lmsys.org/?arena
And the leaderboard results:
Unfortunately (but unsurprisingly), proprietary models dominate the top slots, with GPT-4-Turbo in the top position (and an older GPT-4 model in #2). Apparently, the new Mixtral-8x7B model is doing pretty well vs. GPT-3.5, though.
The Claude-2.x models actually do worse than Claude-1. One wonders if “safety” concerns have damaged the new models.
I’m my tests, claude-2 does much better at generating code and following instructions. “Performance” here is going to be very task dependent.
The prompts here are user-submitted, so they certainly cover a wide variety of tasks, though whether they’re representative of real prompts is unknown.
I tried the prompt I used for the Xkcd comic: “Write a few sentences about the solar system without using the letter “i””. Both were total failures, not bothering at all with the condition. But the models it picked were just mistral-7b-instruct and llama-2-13b-chat, neither of which are that great.
Finally found a program for running local LLMs that isn’t a piece of crap–LM Studio.
Pretty much works out of the box, no stupid Python dependency issues or installing 10 different libraries to get it working or anything. Can browse/download LLMs right from the interface and run them
Mixtral-8x7B is indeed very good–quite a bit better than GPT-3.5, IMO. Gave it the task of explaining some code I wrote, including some questions about specific areas that it needed deep knowledge about (Why did I call that function 20 times? Because the code is for an ESP chip, and it’s reading an analog pin, and reading multiple times gives more robust results.) I hope there’s something comparable to GPT-4 not long from now. Or maybe I don’t–I’ll need to buy a crapload more RAM, maybe a way better GPU.
Still totally failed my xkcd test, though…
Try out and rate several different LLMs for free here:
The site won’t let you have too deep a conversation with any of them, and will cap and erase conversations left open too long, but it’s still fun to try out different things. So far in my estimation GPT-4 variants beat everything else hands down. I haven’t yet drawn two different GPT-4 versions in the head-to-head contests. The tests I’ve done that involve continuing story segments tend to reveal deep flaws the rest of them have in their reading comprehension. I’ve heard good things about Gemini-Pro elsewhere, but I can’t really run it through its paces the way I want to; their Gemini-pro setup has serious token limits.
Heh. Awesome.
What’s nuts is that it’s barely a parody at this point. From Google Gemini:
Obviously, by the same standard, it can’t give me advice on C/C++ coding at all, not to mention assembly, since all code is “unsafe” in the way they mean it.
At least the open models like Mixtral aren’t as bad as this, and only getting better over time. I don’t know how things will play out, but it’s only a matter of time before we have GPT-4 level open models without the lobotimization.