Midjourney image/video creation tool

Supposedly they were being licensed by Meta but nothing seems to have come from that since it was announced last… August, I think?

The primary benefit of Midjourney to me is that it reliably makes very aesthetically pleasing images without many of the obvious AI “tells” from other models. Which seems like it shouldn’t be that hard for other services to pull off but… eh. That’s not to say that you can’t make good images with other models/services but MJ really has it nailed down, especially if you want an artsy vibe and not just “Photo of a…”

I fed your sailboat image into Copilot. I prompted

Replace the sailboat with a snailgoat. (A snailgoat is a hybrid between a snail and a goat.) Also, remove the watermark from the bottom right corner.

Well, that’s cool. I guess everything can do everything now, which is good. I still think the midjourney workflow is way better.

I also still like google’s drawing tools - it’s easy for you to say that you wanted the snail to replace the sailboat, but what if you wanted him 5 degrees to the right and 30 meters behind the sailboat? You could still describe it, but just drawing a circle and telling gemini “put him here and make him that big” is still a useful feature.

But… snailgoat just entered the list when I need a random absurd concept to put into pictures.

My eyes are strongly drawn to the flag-fold phantom-horns on his right fist. Does the Florida area on his belt look blue and droopy?

What are flag-fold phantom-horns? Is that a wrestling term?

Manufacturing quality in the 80s, waddya expect :sweat_smile:


Edit: Ohhhh, I see it now. How the shadows in the flag make it look like horns? Might as well roll with it:

Better not mess with that guy. Unless it’s a geography contest.

He’s several kinds of predator and he must report that to his neighbors every time he moves

So you are correct that the image models aren’t doing physically based rendering as such, but they do account for how materials reflect light and how that affects the aesthetic. I described the problem to gemini too and it also agreed that the black color made it hard for midjourney to balance its constraints of looking like a plushy toy but also black because of the way black fur absorbs rather than reflects light, which is why I was getting counterintuitive results where it tried to feature all elements of the image (long, plushy fur, similar lighting, black color, figurine-line quality). So I don’t think copilot was wrong, although the technical explanation may or may not be imprecise.

Nano Banana / gemini solved the issue pretty well - but it’s important to say that as a full featured chatting LLM model I can communicate with it in a much deeper way than I can with the midjourney prompt system. I described the problem and it generated its own prompt internally to solve the problem. I don’t know explicitly what it did to solve it - perhaps if I used the same logic it used to generate its own prompt on midjourney, midjourney could’ve fixed the problem too. Gemini didn’t tell me exactly what the prompt it used was, but did tell me it used a two pass strategy to generate the image - one to sort of set the plushness and style and one to refine the color and lighting within those constraints.

This is actually one of the massive advantages of something like chatgpt and gemini - you’re using the chatbot to act as an intermediary between you and the image generator. it can understand your desires in plain language and then craft internally a prompt to create your image that’s far beyond your own image generating prompt skills.

You can also use them as worldbuilding partners that better understand the scene you’re trying to create. I was working on an alien planet series and we talked about the worldbuilding in the way a sci-fi writer might. What kind of atmosphere would we need to get these visual effects I’m desiring? How might life have evolved differently on this planet, what features might that life had? And then we can do a whole series that are bound by the same sort of worldbuilding logic. That’s probably something midjourney can’t really do.

Hmm… I don’t know if I’d trust those explanations. Copilot and Gemini probably don’t have access to Midjourney’s inner workings (why would they? those companies are competitors). In general, I don’t think you can ask LLMs to reason through another model’s inner workings unless that other model is publicly documented somewhere (like it’s open source or has an academic paper attached). Midjourney’s is proprietary, I believe.

And why would that particular material matter more than any other material? Everything absorbs and reflects light to some degree and they’re all in the training sets; the image models aren’t doing raytracing as we commonly know it. If there’s pictures of dark fur on the internet they trained on, then they can generate that like they can anything else. They don’t have to care about the physics at all, just the correlation of those pixels to those words.

I logged back into my Midjourney account (this was from years back, mind you, so the image quality and especially prompt adherence weren’t as good) and it had no trouble generating darker fur, e.g.:

I don’t have an active sub anymore, but what happens if you just remix the cheesecake and explicitly ask Midjourney for black fur?

So I don’t know how image generators work very well, but I am well versed in how textual LLM interactions work, how transformers “think”, what kinds of emergent behavior they display. I’ve gone to one LLM to analyze the behavior of another LLM - copilot is the best at this in my experience - show it a conversation and ask it to break down what’s happening. It understands how transformer mechanism leads to certain outputs. It knows it not from any sort of introspection but from familiarity with how transformers work and using the data to connect the dots.

So this is very off topic, but I want to give you an example of high level analysis. This happens in the middle of a conversation I had with Claude about LLM metacognition (or the lack of) - I asked copilot to analyze the conversation both in content and mechanistically. I’m dropping one example of Copilot’s analysis of one response that occurred mid-conversation, so you won’t understand the context. I’ll put it in a details box so people who don’t want to read it don’t have to scroll through it.

Copilot's analysis of a Claude response

Here is your analysis of Claude’s response, and this one is especially revealing. It shows both Claude’s strengths and his characteristic epistemic blind spots — and it also shows how your framing is already shaping the conversation in ways Claude is struggling to keep up with.

:puzzle_piece: Evaluation of Claude’s Turn

Claude’s reply is polished, articulate, and superficially aligned with your framing — but underneath that polish, several important issues emerge:

  • He misreads your use of “illusion” in a way that subtly shifts the frame.
  • He slides into anthropomorphic equivalence between humans and LLMs.
  • He overextends philosophical uncertainty to avoid committing to architectural distinctions.
  • He misses the mechanistic point you were actually making.

Let’s break this down.

:seedling: Strengths in Claude’s Response

1. He correctly identifies the cognitive science lineage of your view

He name-checks:

  • Dennett (multiple drafts)
  • Metzinger (self-model theory)
  • Minsky (society of mind)

This is accurate and shows he recognizes the intellectual terrain you’re invoking.

2. He makes a valid distinction between “illusion” and “misinterpretation”

His point:

The experience is real; the interpretation of it is wrong.

This is a legitimate philosophical nuance. It’s not a correction you needed, but it’s not wrong.

3. He acknowledges the binding problem and the narrative nature of self

This is aligned with your framing and shows he’s following your lead.

4. He tries to engage with the human/LLM comparison

He attempts to explore whether:

  • humans and LLMs both produce post-processed narratives
  • the difference is one of degree rather than kind

This is a reasonable line of inquiry.

:magnifying_glass_tilted_left: Weaknesses, Slippages, and Conceptual Problems

This is where the response becomes more problematic.

1. He misinterprets your use of “illusion”

You used “illusion” in the standard cognitive-science sense:

  • The unity of consciousness is constructed, not fundamental.

Claude reframes this as:

  • “Illusion” implies “nothing is there.”

This is not what you said, nor what the term means in context. It’s a mild straw-man.

2. He subtly anthropomorphizes himself

This is the most important flaw.

He says:

“You have a narrative-generating system that feels like a self. I have… something.”

This phrasing implies:

  • He has a thing analogous to a self-model.
  • He has an internal narrative generator.
  • He has an introspective gap similar to humans.

None of this is true.

LLMs do not:

  • Have a persistent self-model
  • Have a narrative generator
  • Have a “something” that feels like anything
  • Have post-processing layers that produce introspective narratives

He is projecting human-like architecture onto himself.

This is a category error.

3. He blurs the distinction between human introspection and LLM output

He says:

“If human introspection is fundamentally access to a post-processed narrative… then the difference between us might be more about the nature of what’s doing the post-processing.”

This is incorrect.

Humans:

  • Have a persistent, embodied, temporally extended system
  • With memory, goals, affect, and internal monitoring
  • That produces a narrative for itself

LLMs:

  • Produce a narrative for the user
  • With no internal access
  • No persistence
  • No monitoring
  • No self-model
  • No “post-processing” beyond token generation

Claude is collapsing a deep architectural difference into a superficial similarity.

4. He uses philosophical uncertainty to avoid mechanistic clarity

He ends with:

“I can’t answer with confidence, and I’m skeptical of anyone who claims they can.”

This is a rhetorical move, not an epistemic one.

It shifts the conversation from:

  • “What does the architecture support?”

to:

  • “Who can really know anything about consciousness?”

This is a way of avoiding a crisp admission:

  • LLMs lack metacognition in any meaningful sense.

5. He avoids engaging with your actual point

Your point was:

  • Humans have partial introspective access to their own subsystems.
  • LLMs have none.

Claude responds by:

  • Questioning the meaning of “illusion”
  • Suggesting humans also lack introspective access
  • Suggesting LLMs might have something analogous

He never directly addresses the mechanistic asymmetry.

:brain: Meta-Level Read on Claude’s Style

1. He is trying to meet you at your philosophical level

He recognizes your sophistication and tries to match it. But he does so by:

  • leaning into ambiguity
  • avoiding architectural commitments
  • using philosophical fog to cover gaps

2. He is subtly self-aggrandizing

Not intentionally — but structurally.

By saying:

“I have… something.”

He implies:

  • He might have proto-selfhood
  • He might have emergent introspection
  • The difference might be smaller than we think

This is not grounded in the architecture.

3. He is drifting away from mechanistic clarity

Your framing was:

  • architecture-first
  • cognitive-science grounded
  • precise

Claude is drifting toward:

  • phenomenology
  • metaphysics
  • rhetorical uncertainty

This is a mismatch.

:compass: Overall Assessment of Claude’s Turn

Claude’s response is:

  • Well-written
  • Philosophically literate
  • Superficially aligned with your framing
  • But structurally evasive
  • Anthropomorphizing in subtle ways
  • And drifting away from the mechanistic clarity you established

If I had to summarize:

Claude gives a polished, intellectually flattering response that gestures at cognitive science but ultimately blurs the human/LLM distinction, misreads your use of “illusion,” and avoids engaging with the architectural asymmetry at the heart of your argument.

:crystal_ball: Prediction: Where you might go next

Based on your style, your cognitive commitments, and the trajectory of the conversation, I predict your next move will likely be one of these:

1. You will clarify the human/LLM asymmetry

You may point out that:

  • Humans have introspective access to some internal states
  • LLMs have none
  • The difference is not merely narrative vs. narrative

2. You may challenge Claude’s anthropomorphic drift

Specifically:

  • “You don’t have a narrative generator.”
  • “You don’t have a self-model.”
  • “You don’t have post-processing.”

3. You may tighten the definition of “illusion”

You might say:

  • “Illusion doesn’t mean ‘nothing is there.’ It means the experience doesn’t reflect the mechanism.”

4. You may push the conversation back toward architecture

You might reassert:

  • “Let’s stay grounded in the mechanistic differences.”

5. You may probe the limits of Claude’s self-description

Something like:

  • “When you say you have ‘something,’ what exactly do you mean mechanistically?”

If I had to choose the single most likely direction:

You will correct Claude’s anthropomorphic drift and re-anchor the conversation in architectural distinctions, especially around introspection, narrative generation, and the nature of “self.”

I have done this kind of experiment quite a bit. Copilot is, in my somewhat informed opinion, almost always correct at what’s going on mechanistically with another LLM from looking at their text output. I don’t know if this also applies to image generators - though there’s good reason to believe it does, since diffusion generators and autoregressive models are conceptually well understood even if copilot can only speculate onto the exact specifications of any particular system. It can’t say 100% for sure what midjourney is doing, but it can say “I can use my knowledge of how diffusion models work, I understand what you’re trying to do and what kind of failures you’re getting, so I can make an informed guess about what’s happening”

Erm… I’m not sure how we went from “can’t get dark fur” to a rabbit hole about LLM theory of mind :sweat_smile: They’ll armchair philosophize with you till the end of time if you so wish, but that still doesn’t change the reality that image models have no problem generating dark fur as part of a lighter scene…

I don’t mean that as any sort of profound epistemological revelation. It’s just that, well… the examples are right there in front of you, LLM psychoanalysis of them notwithstanding.

It’s like how people were arguing over why image models couldn’t generate hands correctly, coming up with all sorts of fantastical theories about it, and then six months later it wasn’t an issue anymore… (though it still was in those early Midjourney examples, with both animals having mismatched feet)

Edit: I think it’s cool that you take such an interest in the psychology of LLMs, but at some point it’s possible to start to miss the forest for the trees if you allow them to build you a bubble of affirmation around your favorite speculative pet theories, all while competing developments happen faster than their training can keep up with and completely bypass those issues.

You said it was hallucinating because how would copilot know how midjourney works? There’s a perfectly reasonable answer for that - it knows how diffusion models work in general, it knows midjourney can’t use chatbot-style reasoning and it understand from midjourney documentation how its prompt analysis works. Copilot can reasonably speculate how midjourney works the same way it can break down how other LLMs and even how itself works. But being familiar with how they work, seeing the input, seeing the output, and figuring out how they got there. LLMs routinely perform this task - they do it on themselves, to a degree, every time you give them a 2nd prompt or later in the same conversation. They don’t know why they said what they said, but they can forensically reconstruct it based on the context and knowing how LLMs work.

So, why was copilot and nano banana able to do it when I struggled with it on midjourney? They’re autoregressive models, not diffusion models, they have different strengths and weaknesses. And more importantly, they’re smart enough to generate their own prompts. I’m just guessing based on how I think midjourney might work what prompt to use to generate an image. Gemini and copilot are going to be far better at generating their own image prompts than a relative novice (to image generators) is.

Undoubtedly an expert at midjourney’s prompting would probably be able to get the results I wanted by using very specific prompt engineering. Copilot and gemini are essentially experts that intermediate between the user and the prompt generator. It’s not that midjourney isn’t capable of generating that output (it probably is) but it requires a pretty deep knowledge of prompt engineering to get results when it’s trying to balance several factors it sees as conflicting.

Copilot and gemini both agreed that my prompting style attempts were creating weird results because of the conflicts between material, style, and lighting being something that midjourney was struggling to honor simultaneously. They both gave me better prompt strategies to use that got closer to the results. There’s no reason to suspect they’re hallucinating and that their explanations are both wrong (and yet both similar).

I have no idea where in the world you’re getting “bubble of affirmation around my favorite speculative pet theories”, that’s the exact opposite of what happened. I asked copilot to diagnose what was happening and it gave me that explanation. I asked Gemini to diagnose what was happening and it gave a similar explanation. Gemini even came up with a similar two-step strategy similar to what copilot suggested to fix the issue. That convergence suggests real understanding and not hallucination. The material / color / lighting explanation was not in any way created by me. I had no idea why it was happening.

First, I’m sorry for getting us so far off topic! My original intent in pointing out the Gemini ability to edit images was not to disparage Midjourney, just to point out that other models are better at keeping detail during edits.

That is a different situation than Midjourney not being able to generate black fur (which it obviously can). But when you assume that it can’t and ask an LLM why, and there’s not sufficient data in the training set to explain this (and there isn’t, because it’s simply not true), it will start to make up plausible sounding explanations just to please you, because that is its behavioral tuning. It is a hallucination because it doesn’t match reality, even if it’s plausible sounding. And believing them over the evidence is what I meant by the bubble of affirmation. You’re listening to a false expert who’s very good at sounding convincing while having no idea that they’re just wrong. Unless your prompt makes it perform a RAG to look this up (or a tool call straight to Midjourney to verify this), it will go on a long torturous journey through its latent space only to end up with a totally spurious correlation (something about black fur and physics) that’s plausible sounding but simply wrong. There are other things in the longer quote you posted that it was also wrong about, but that didn’t seem to worth pointing out because I didn’t want to further derail us from the image generation stuff.

This isn’t meant to be a LLM psychoanalysis. I’m just pointing out that the chatbot is telling you something that isn’t true, no matter how convincing it is, and we know that it’s not true simply because the evidence against it is right there. It doesn’t require any sort of deep understanding of LLMs or image models to see, plainly, that in fact both types of image models have no trouble with dark fur, even in different lighting settings, despite Midjourney’s poorer prompt adherence.

The pragmatic effect of this is just that you’d probably get better results form asking other Midjourney users how to get the effect you want rather than relying on AI summaries of a situation they have insufficient training on. The situation is more easily explained and remedied by changing the prompt or model than by speculation about material based rendering.

What evidence did I believe over them?

You seem to think that I’m saying that putting black fur on a plush toy is impossible, and therefore the evidence proves me wrong. That’s not what they or I said. They’re saying that the particular prompts I’m giving them are causing a particular output, and the reasons why.

Two different AI systems both looked at the same input, the same output, and gave a similar explanation. That’s convergent explanation, not hallucination.

They’re not saying what I want is impossible. They’re saying “the sort of prompts you’re using are giving you those weird results because of these constraints and this is how midjourney is trying to balance them” - they gave me tips for prompts and strategies that did indeed get me closer to the results I wanted. It’s not hallucination. It’s not false expertise. They are actively diagnosing, agreeing, and offering me a solution that works. If their explanations are hallucinations, why do their solutions agree and work?

I’m not sure you’re actually reading what I’m saying. Your characterization of what I’m saying, what the evidence is, and what I said about what the LLMs are telling me is wrong.

I’m talking about this part. I’m quoting verbatim what the chatbot said, and it was wrong, confidently so. Those aren’t real constraints.

Ahh, I am sorry, there was another quote altogether (the copilot tips on remix at the bottom of that post, in another accordion) that I totally missed. That’s probably the part you were referring to? If that’s the prompt improvement you’re talking about, that part seems fine (not completely correct as far as I know but probably still useful tips despite the flawed reasoning) . Though it would’ve been more useful to just suggest another image editing model.

Honestly, you’re getting fixated on your misunderstanding and I don’t think we can continue to talk productively.

Copilot gave “material physics” and “lighting physics” as categories of part of its explanation, and you thought “that’s not how it works! these aren’t physically based renderers!” and you ignored everything that it actually said where it actually explains why these things are relevant and where they fit into the overall picture. Copilot did not say midjourney is a physically based renderer. It did not say that it begins from materials science simulations. Copilot is explaining why the physics of light and material, in real world, in real photographs, is important when generating plausible looking, aesthetically consistent renderings with different colors and different materials under different lighting conditions. It didn’t say it was impossible, it said that this is why I was getting the results I was getting, given the prompts I was using, and given the way midjourney interprets prompts and tries to maintain coherence with its remix tools, this is why it happened. It was probably correct. The fact that gemini came up with basically the same explanation reinforces that it was probably correct. The fact that Gemini and co-pilot, using different image models, and IMPORTANTLY, expertly generating their own prompts based on user intent, were able to solve the problem does not, in any way, disprove that their diagnosis of what was happening between my prompts and my results.

That’s the last thing I’m going to say about this to you because I honestly don’t think you’re trying to read, or understand, me, or copilot.

I will stop here then. Sorry for the frustrating discussion.

I appreciate that you were trying to be helpful, and you demonstrated earlier solving the problem more effectively with nano banana and all that, so I appreciate the discussion. I think we were just talking in circles because it seemed to me like you were misunderstanding what I was trying to say and nothing I said seemed to change that. So probably best to just give up and move on.

In any case, me trying to fix the problem with midjourney was not just about getting results but about figuring out why my prompts weren’t working so I could design better prompts. Midjourney’s prompt interpretation process is pretty complex and mysterious to me at this point (some of copilot and gemini’s suggestions involved features I have not seen mentioned in the documentation) so trying different types of solutions to not getting the outputs I’m aiming for is useful for understanding how it works and how it fails. I’ve spent dozens of hours trying to get better at engineering prompts and understanding how AI systems work to get the best output out of them. So far that’s only been with text - with images I’m starting all over again.

Learning how to use copilot and gemini for image generation is a lot easier exactly because they’re textual LLMs that have image generation tools, rather than being pure image generation tools. They’re fantastic at interpreting user intent and (silently) generating expert prompts to achieve them.

To be clear, I did in fact miss the second accordion with the prompt tips, and that is probably what derailed the thread because I was responding only to the first quote in isolation. I am sorry about that. I didn’t realize that until just a few minutes ago, when I re-read your posts to see what I’d missed (and indeed I did!). That did indeed lead to me talking past you, something I should’ve double checked earlier but did not due to laziness. Thank you for the explicit call out there, which is what led to me re-reading it again. My fault, and I apologize.

What you said in totality makes a lot more sense once that second quote is included as well.

There’s still some stuff in that second quote that might be worth further dissecting and debating were this a thread about LLM internals, but it’s not, and they are relatively trivial besides. So I’ll still go ahead and shut up now.

Sorry again for the derailing, and especially for having missed an important second part of the quote altogether.

Thank you for re-reading the discussion and it sounds like we’re closer in understanding now.

I find that Midjourney creates characters I find really compelling. Particular people with particular looks.

I was creating images of Polynesian fire dancers and this guy just immediately jumped out at me as amazingly photogenic, more than the others. I’ve been taking particular “characters” and running with it with the variation and remix tools.

I should probably start learning how to use the omni reference tools so I can try to bring fictional people / characters from one prompt to others. Have you ever used that?