For what it’s worth, a 5000 word prose-worthy prompt is a terrible way to get what you want out of an AI image generator, even an LLM-driven one. Systems like ChatGPT or Gemini will internally crunch it down to maybe 200-250 words, disregarding much of your detail. Systems also get confused by that many tokens at once and lose coherence so a huge prompt asking for a red apple on a blue plate on a yellow tablecloth in a green painted room will result in it using those colors seemingly at random.
If you want specific complicated things that match your idea/vision rather than just whatever the AI coughs up, you’re getting into a lot of those other tools I mentioned earlier. A basic one is inpainting where you mask part of an image and just prompt for that location, adding an apple or turning a green apple red (or making it rotten, turning it into a baseball, etc). Then inpaint sketch where you draw on the image within the AI client to help guide it (make a red circle where you wanted an apple). You can take the image outside of the client and draw on it in another utility or add things in to photobash something closer then import it again via img2img and work off that. You can use utilities like ControlNet which then allows you to wireframe a person into a specific pose. If you are struggling with a specific item or style, you might seek out or train a LoRA to assist.
All of this is still AI image gen though. I suppose you might sketch or draw a little outside of it to assist the AI but that’s just giving nudges, not “drawing a picture”.
To be clear, most people don’t do all this stuff or anything like it. Most people just prompt and then prompt some more until they find something they can live with. However, people keep talking about super detailed prompts as though that’s the only option AI image generation gives you and that’s far from the case. It’s not even a good way to do it. But you can have very tight control if you want to apply the time and tools to use it.
I usually think of it as similar in some ways to staged photography. If you photograph a man on a balcony, you can pick the model, location, color of his suit, time of day, etc but you don’t decide on the size and texture of each brick, where the clouds are, the arrangement of the leaves in a tree, the shape of the skyline, etc. If the texture of his suit is important you can put him in silk instead of linen. If it’s not, you don’t worry about the fabric.
AI image gen can be the same way. The stuff you care about, the stuff that’s important to your idea, you can get exactly as you want it. The stuff that’s less important you can worry less about. You’re not “drawing” where you need to decide each millimeter of brick or location of each branch in a tree.