this post was submitted on 17 May 2024
302 points (97.2% liked)
Technology
59693 readers
2299 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
there's some stuff image generating AI just can't do yet. it just can't understand some things. a big problem seems to be referring to the picture itself, like position or its border. another problem is combining things that usually don't belong together, like a skin of sky. those are things a human artist/designer does with ease.
There's a lot.
Some of it doesn't matter for certain things. And some of it you can work around. But try creating something like a graphic novel with Stable Diffusion, and you're going to quickly run into difficulties. You probably want to display a consistent character from different angles -- that's pretty important. That's not something that a fundamentally 2D-based generative AI can do well.
On the other hand, there's also stuff that Stable Diffusion can do better than a human -- it can very quickly and effectively emulate a lot of styles, if given a sufficient corpus to look at. I spent a while reading research papers on simulating watercolors, years back. Specialized software could do a kind of so-so job. Stable Diffusion wasn't even built for that, and with a general-purpose model, also not specialized for that, it already can turn out stuff that looks rather more-impressive than those dedicated software packages.
I think creating a lora for your character would help in that case. Not really easy to do as of yet, but technically possible, so it's mostly a ux problem.
A LORA is good for replicating a style, where there's existing stuff, helps add training data for a particular subject. There are problems that existing generative AIs smack into that that's good at fixing. But it's not a cure-all for all limitations of such systems. The problem I'm referring to is kinda fundamental to how the system works today -- it's not a lack of training data, but simply how the system deals with the world.
The problem is that the LLM-based systems today think of the world as a series of largely-decoupled 2D images, linked only by keywords. A human artist thinks of the world as 3D, can visualize something -- maybe using a model to help with perspective -- and then render it.
So, okay. If you want to create a facial portrait of a kinda novel character, that's something that you can do pretty well with AI-based generators.
But now try and render that character you just created from ten different angles, in unique scenes. That's something that a human is pretty good at. Here's a page from a Spiderman comic:
https://spiderfan.org/images/title/comics/spiderman_amazing/031/18.jpg
Like, try reproducing that page in Stable Diffusion, with the same views. Even if you can eventually get something even remotely approximating that, a human, traditional comic artist is going to be a lot faster at it than someone sitting in front of a Stable Diffusion box.
Is it possible to make some form of art generator that can do that? Yeah, maybe. But it's going to have to have a much more-sophisticated "mental" model of the world, a 3D one, and have solid 3D computer vision to be able to reduce scenes in its training corpus to 3D. And while people are working on it, that has its own extensive set of problems. Look at your training set. The human artist slightly stylized stuff or made errors that human viewers can ignore pretty easily, but a computer vision model that doesn't work exactly like human vision and the computer vision system might go into conniptions over. For example, look at the fifth panel there. The artist screwed up -- the ship slightly overlaps the dock, right above the "THWIP". A human viewer probably wouldn't notice or care. But if you have some kind of computer vision system that looks for line intersections to determine relative 3d positioning -- something that we do ourselves, and is common in computer vision -- it can very easily look at that image and have no idea what the hell is going on there. Or to give another example, the ship's hull isn't the same shape from panel to panel. In panel 4, the curvature goes one way; in panel 5, the other way. Say I'm a computer vision system trying to deal with that. Is what's going on there that there ship is a sort of amorphous thing that is changing shape from frame to frame? Is it important for the shape to change, to create a stylized effect, or is it just the artist doing a good job of identifying what the matters to a human viewer and half-assing what doesn't matter? Does this show two Spidermen in different dimensions, alternating views? Are the views from different characters, who have intentional vision distortions? I mean, understanding what's going on there entails identifying that something is a ship, knowing that ships don't change shape, having some idea of what is important to a human viewer in the image, knowing from context that there's one Spiderman, in one dimension, etc. The viewer and the artist can do it, because the viewer and the artist know about ships in the real world -- the artist can effectively communicate an idea to the viewer because they not only have hardware that processes the thing similarly, but also have a lot of real-world context in common that the LLM-based AI doesn't have.
The proportions aren't exactly consistent from frame to frame, don't perfectly reflect reality, and might be more effective at conveying movement or whatever than an actual rendering of a 3d model would be. That works for human viewers. And existing 2D systems can kind of dodge the problem (as long as they're willing to live with the limitations that intrinsically come with a 2D model) because they're looking at a bunch of already-stylized images, so can make similar-looking images stylized in the same way. But now imagine that they're trying to take stylized images, then reduce them into a coherent 3D world, then learn to re-apply stylization. That may involve creating not just a 3D model, but enough understanding of the objects in that world to understand what stylization is reasonable to create a given emotional effect and be reasonable to a human, and when. People may not care that that ship is doing some impossible geometry, but might care a whole lot about the numbers of limbs that Spiderman has. Is it technically possible to make such a system? Probably. But is it a minor effort to get there from here? No, probably not. You're going to have to make a system that works wildly differently from the way that the existing systems do. That's even though what you're trying to do might seem small from the standpoint of a human observer -- just being able to get arbitrary camera angles of the image being rendered.
The existing generative AIs don't work all that much the way a human does. If you think of them as a "human" in a box, that means that there are some things that they're gonna be pretty impressively good at that a human isn't, but also some things that a human is pretty good at that they're staggeringly godawful at. Some of those things that look minor (or even major) to a human viewer can be worked around with relatively-few changes, or straightforward, mechanical changes. But some of those things that look simple to a human viewer to fix -- because they would be for a human artist, like "just draw the same thing from another angle" -- are really, really hard to improve on.
On the other hand, there are things that a human artist is utterly awful at, that LLM-based generative AIs are amazing at. I mentioned that LLMs are great at producing works in a given style, can switch up virtually effortlessly. I'm gonna do a couple Spiderman renditions in different styles, takes about ten seconds a pop on my system:
Spiderman as done by Neal Adams:
Spiderman as done by Alex Toth:
Spiderman in a noir style done by Darwyn Cooke:
Spiderman as done by Roy Lichtenstein:
Spiderman as painted by early-19th-century American landscape artist J. M. W. Turner:
And yes, I know, fingers, but I'm not generating a huge batch to try to get an ideal image, just doing a quick run to illustrate the point.
Note that none of the above were actually Spiderman artists, other than Adams, and that briefly.
That's something that's really hard for a human to do, given how a human works, because for a human, the style is a function of the workflow and a whole collection of techniques used to arrive at the final image. Stable Diffusion doesn't care about techniques, how the image got the way it is -- it only looks at the output of those workflows in its training corpus. So for Stable Diffusion, creating an image in a variety of styles or mediums -- even ones that are normally very time-consuming to work in -- is easy as pie, whereas for a single human artist, it'd be very difficult.
I think that that particular aspect is what gets a lot of artists concerned. Because it's (relatively) difficult for humans to replicate artistic styles, artists have treated their "style" as something of their stock-in-trade, where they can sell someone the ability to have a work in their particular style resulting from their particular workflow and techniques that they've developed. Something for which switching up styles is little-to-no barrier, like LLM-based generative AIs, upends that business model.
Both of those are things that a human viewer might want. I might want to say "take that image, but do it in watercolor" or "make that image look more like style X, blend those two styles". LLMs are great at that. But I equally might want to say "show this scene from another angle with the characters doing something else", and that's something that human artists are great at.
I don't think it's supposed to solve every problem. Just like very scene in the new Sand Land anime wasn't 3D, the same goes for every other artistic tool. There are some things are easy with some tools, others it's not well suited for.
What you have to ask yourself is what ways can it help you with what you're trying to do.
I think Corridor Digital made an AI animated film by hiring an illustrator (after an earlier attempt with a general dataset) and "draw" still frames from video of the lead actors, with Stable Diffusion generating the inbetweens.
It‘s even hard to impossible to generate the image of a person doing a handstand. All models assume a rightside-up person.
This hasn't been true for months at least. You really have to check week to week when dealing with things in this field.
think of an episode of any animated series with countless handmade backgrounds, good luck generating those with any sort of consistency or accuracy and you will be calling for an artist who can actually take instructions and iterate
We'll soon be hearing that only Luddites care about continuity errors