When A.I. Can Make a Movie, What Does “Video” Even Mean?
Joshua Rothman – For the past couple of weeks, I’ve been making a home video on my phone, using Apple’s iMovie software. The idea is to weave together clips of my family that I’ve taken during the month of February; I plan to keep working on it until March. So far, the movie shows my five-month-old daughter cooing and waving her arms; my five-year-old son chasing me with a snowball; and a visit to the spooky, run-down amusement park in our town, among other things.
I thought of my movie while absorbing the announcement, yesterday, of Sora, an astonishing new text-to-video system from OpenAI, the makers of ChatGPT. Sora can take prompts from users and produce detailed, inventive, and photorealistic one-minute-long videos. OpenAI’s announcement featured many fantastical video clips: an astronaut seemingly marooned on a wintry planet, two pirate ships dueling in a cup of coffee, and “historical footage of California during the gold rush.” But two other clips were more intimate, the sort of thing that an iPhone might capture. The first was generated by a prompt asking for “a beautiful homemade video showing the people of Lagos, Nigeria in the year 2056.” It “captures,” if that’s the word, what seems to be a group of friends, or perhaps relatives, sitting at a table at an outdoor restaurant; the camera pans from a nearby open-air market to a cityscape, which is divided by highways sparkling with cars at dusk. The second shows “reflections in the window of a train traveling through the Tokyo suburbs.” It looks like footage any of us might capture on a train; in the glass of the window, you can even see the silhouettes of passengers superimposed on passing buildings. Curiously, none of them seem to be filming.
These videos have flaws. Many have a too-perfect, slightly cartoonish quality. But others seem to capture the texture of real life. The wizardry behind this is too complicated to easily describe; broadly speaking, it might be right to say that Sora does for video what ChatGPT does for writing. OpenAI claims that Sora “understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.” In its statistical, mind-adjacent, (probably) unconscious way, it grasps how different kinds of objects move in space and time and interact with one another. Sora “may not understand specific instances of cause and effect,” the developers write—“for example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.” And yet the A.I.’s over-all comprehension of the objects and spaces it conjures means that it isn’t just a system for generating video. It’s a step “towards building general purpose simulators of the physical world.” Sora performs its work not just by manipulating pixels but by conceptualizing three-dimensional scenes that unfold in time. Our own heads probably do something similar; when we picture scenes and places in our mind’s eye, we’re imagining not just how they look but what they are. Read On:
Comments
When A.I. Can Make a Movie, What Does “Video” Even Mean? — No Comments
HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>