Dall-E 2, Mid Journey, and Stable Diffusion. The Future of Film and TV.

Preface

This blog post will cover some technical aspects, but I will try to stay high level to keep this interesting to those more in film/tv than tech.

Below, I talk about the various tools and what we have learnt about them. If you aren’t interested or just want to look at the comparison, you can jump to the side by side photos and our thoughts on how they will affect the industry here.

Introduction

Recently there have been incredible leaps in computer vision technology (artificial intelligence and machine learning).

Several image diffusion algorithms have become available to beta testers. These tools allow us to enter text prompts to generate a corresponding image.

These algorithms are trained off of millions of images. After training, these algorithms are able to ‘dream’ using the training image data as inspiration and create entirely novel images, which are owned commercially by the person generating the image.

In this blog post we are going to evaluate the three biggest available: Dall-E 2, Midjourney, and Stable Diffusion.

Inherent Algorithmic Styles

I preface this section by saying that the following thoughts are from my experience, and with an effective prompt, all of these tools can generate good results.

Furthermore, the algorithms are rapidly evolving, as I write this blog post Midjourney has recently shifted to version three of their algorithm which has considerably changed the prompt output.

Midjourney

Midjourney was the first good text to image model that I had access to, during the early testers closed beta program. I’ve tested from V1 to V3.

I personally preferred the characteristics of the v2 algorithm as the side by side comparisons I’ve seen seem to lose a lot of ‘character’ and become more generic. However, I’m confident that it will be patched in v4 onwards.

Since multiple good articles comparing the versions side by side already exist, I won't cover the differences in this blog post. This post will include prompts from multiple Midjourney versions.

I will start by saying that I was blown away by how good the algorithm was from the beginning. There was seemingly no fuss surrounding it and then it appeared overnight.

Images generated with Midjourney from various prompts have a recognisable 'artistic style' in the same way you'd expect from a human artist.

It is fascinating that while these algorithms are trained on such an incomprehensible amount of data, they still have their own hallmarks - 'dialects' if you will.

Midjourney excels at ‘dark’ image content, as in Cyberpunk, Dystopian cities, post apocalyptic, Ghost towns etc..

Rendering portraits with a game engine style like ‘Octane render’ also has exceptional results, as for example the ‘Westworld’ robot prompt below.

westworld cowboy robot uprising portrait octane render detailed hair, sunset behind them

Dall-E 2

Dall-E 2 has the most hype surrounding it on the internet and thus the longest waiting list to sign up. It needs little introduction. From my experience it is the easiest of the tools to generate ‘photorealistic’ images with minimal coercion from the prompt.

Dall-E 2 is the largest undertaking of this type of tool currently. It is hard to find precise numbers but the algorithm is rumoured to have cost $300,000 to train.

I tend to have better results with Dall-E when tying it back to reality and giving it contextual clues, such as film titles for the ambiance I’m envisioning.

To generate the Dall-E digital art rabbits below, I found the outputs much better when listing ‘Interstellar (2014)’ in the prompt, which I didn’t do for Midjourney or Stable Diffusion.

The outpainting feature of Dall-E (where you can edit and extend a portion of the image) is an amazingly useful tool and is very easy to use. I've had excellent results with that on the claymation image. Unfortunately, the editor crashed and I lost the assembly but I managed to recreate it.

close up studio photo of a claymation figure posing on a table infront of a greenscreen, detailed eyes, f/4.0 dslr 8k‌ ‌

Of all the tools, I find that Dall-E is the best at handling complex prompts, it was the only one that would consistently generate the claymation in front of a greenscreen as described in the prompt.

Stable Diffusion

Stable Diffusion was the latest to the fray, I joined during the first wave of testers and got to hear some exciting thoughts from the developers about the future (touched upon later), and joined the very active and enthusiastic users group on Discord.

Stable Diffusion is an exciting prospect, as it is an open source model, this means that many can contribute to the development, and in a longer term the team behind it are expecting people to host their own ‘tuned’ versions, where they retrain the model on specialised images relevant to their audience, which caters for that specific niche better.

I find that it requires more fine tuning with the prompts but due to the architecture, you are able to rapidly generate results and iterate upon these.

In the advanced options you can set denoising steps and the seed number.

In layman's terms the denoising or inference steps will cause the algorithm to do more processing, which takes longer, usually (but not always) generating a higher quality output.

The seed number is a magic number allowing you to recreate identical output settings each time. Keeping the same seed number will result in the same image being generated from the prompt each time.

If you slightly change the number the interpretation slightly changes, while drastically changing the seed number will create a completely different output.

The ability to tune the denoise steps and seed number along with other settings gives Stable Diffusion the most stable base to work from. Without these settings the other tools feel a little bit like shifting sands at times.

Additionally, the generation speed is much quicker because any Machine Learning (ML) capable hardware is able to self host and run it.

As a result Stable Diffusion is the most flexible tool and I expect it to become my favourite in time.

An interesting, lesser known feature of Stable Diffusion is that word frequency factors into the prompts. Take for example the following prompts using the same seed:

bunny, bunny bunny, bunny bunny bunny

Image Description Image Description Image Description

‘Magic’ Tokens and Prompt Craft

Dall-E, Midjourney, and Stable Diffusion have all been trained off of existing images and the metadata associated with them. ‘Prompt craft’, is the art of writing good prompts for each of these tools.

Some features are under the hood like specifying aspect ratios of images such as 16:9 or 2.39. Or for example the depth of field in the generated image.

It goes without saying that if you have a photography background then you might have an advantage for capturing the shot you want.

Most models seem to abide by the specification of F-stops loosely. f/13.0 for example, will provide a deep depth of field and f/1.0 or f/1.2 will attempt a shallow depth of field (ideal for portraits), though it helps to be accompanied with ‘bokeh’.

I will write another blog post on f-stops, t-stops and aspect ratios another day. Boy, have I spent a lot of time talking about rectangles in my life.

For now here is a handy f-stop chart I found on the internet. (Credit to phototraces.com)

F stop chart

Likewise, if you are trying to generate hyper real/game engine looking assets, it benefits to know render engines like Octane Render and Unreal Engine.

Compare and Contrast

Okay, let’s look at some comparisons. For the sake of best results with faces, I’ve sometimes cleaned up using Tencent (a facial restoration algorithm) to remove the artefacts.

I will write another blog post later about upscaling techniques.

I’ve also played around with the prompts until I got the best results I could out of each tool. Using a one-size-fits-all approach does not work, each algorithm has its own quirks and this is part of the ‘prompt craft’. In the spirit of openness and pushing creativity forward, I’m sharing all the prompts I used for you to compare.

First up is photorealism.

Man in a red hat.

Midjourney: photo of a man with a red hat
Stable Diffusion: photo of a man with a red hat
Dall-E: Man in a red hat

Bunny in a space helmet.

MidJourney: bokeh studio photographic portrait photo of a cute fluffy bunny wearing an astronaut space helmet, dramatic studio backlighting, dslr 8k
Stable Diffusion - Portrait photo of a tiny fluffy bunny with floppy ears a wearing an astronaut space helmet helmet dramatic studio backlighting dslr 8k super close up bokeh studio photographic
Dall-E 2: A close up bokeh studio photographic portrait photo of a cute fluffy baby bunny wearing a tiny fitted space helmet, dramatic studio backlighting, dslr

Studio portraits.

Midjourney: bokeh award winning, close up photo of a girl model with black glasses photorealistic dramatic studio lighting, light smoke, 85mm leica lens
Stable Diffusion: bokeh award winning, close up photo of a girl model with black glasses photorealistic dramatic studio lighting, light smoke, 85mm leica lens
Dall-E: bokeh award winning, close up photo of a girl model with black glasses, photorealistic dramatic studio lighting, light smoke, 85mm leica lens.

Claymation clay figure.

Midjourney: full body wide studio shot of a clay figure with brown eyes, black shoes, detailed, standing in front of a greenscreen, f/4.0 dslr 8k
Stable Diffusion: full body wide studio shot of a clay figure with brown eyes, black shoes, detailed, standing in front of a greenscreen, f/4.0 dslr 8k
Dall-E: close up studio photo of a claymation figure posing on a table infront of a greenscreen, detailed eyes, f/4.0 dslr 8k

Digital Art.

Midjourney: space bunny in astronaut suit, over the shoulder view of a bunny graveyard with big red moon in the sky, fiery ground, detailed octane render 8k hdr
Midjourney (I’m sharing two because they both looked awesome!): space bunny in astronaut suit, over the shoulder view of a bunny graveyard with big red moon in the sky, fiery ground, detailed octane render 8k hdr
Image Description
Stable Diffusion: space bunny in astronaut suit, over the shoulder view of a bunny graveyard with big red moon in the sky, fiery ground, detailed dystopian octane render hdr
Dall-E: a still of a rabbit in an astronaut suit looking at a graveyard from Interstellar (2014), dystopian vibrant red moon in the background

Atlantis.

Mid journey: dystopian City of atlantis during industrial revolution, subsurface scatter, steaming water bubbles, smog
Image Description
Stable Diffusion: dystopian City of atlantis during industrial revolution, subsurface scatter, steaming water bubbles, smog, highly detailed zoom lens
Dall-E: dystopian City of atlantis during industrial revolution, subsurface scatter, steaming water bubbles, smog, highly detailed zoom lens

Landscape - Unreal Engine.

Midjourney: unreal engine render of waterfalls, detailed hyper real, wide shot, high res 8k, f/13.0
Stable Diffusion: unreal 5 engine render of waterfalls, detailed hyper real, wide shot, high res 8k, f/13.0
Dall-E: unreal engine 5 render of niagara falls, detailed hyper real, wide shot, high res 8k, f_13.0

Landscape - Watercolour.

Midjourney: serene water colour painting of rich green hills and floral landscape, detailed brushstrokes, canvas, saturated colours
Stable Diffusion: serene water colour painting of rich green hills and floral landscape, detailed brushstrokes, canvas, saturated colours
Dall-E: serene water colour painting of rich green hills and floral landscape, detailed brushstrokes, canvas, in the style of caspar david friedrich

(I’m not posting the prompts here to keep it as a clean gallery but please feel free to drop us an email below and I am happy to share.)

Dall-E

Image Description Image Description
Image Description

Midjourney

Image Description Image Description
Image Description
Image Description Image Description
Image Description Image Description

(The next few are all credited to Antonio Pavlov, a great vfx artist and a good friend.)

Image Description Image Description

Subsection Conclusion:

This technology is still very much in its infantile stages but is already showing incredible results. Each of the existing tools seem to have their specialisations.

Photographic: Generating photorealistic images with Midjourney has proven to be very difficult as I have been unable to get this to work well. I need to spend more time developing my photorealistic Midjourney prompt craft.

Dall-E is excellent as always.

Stable Diffusion was surprisingly good but required much more tweaking to get something acceptable.

Stylised Digital Art: Midjourney is hands down the best. Dall-E seems to struggle with this as does Stable Diffusion.

Landscapes: For more organic material like paintings I prefer Stable Diffusion, and for more stylised I prefer Midjourney. Dall-E is good at landscape photos.

Applications

At Electric Sheep we are working on a few early stage projects leveraging this technology.

In one project we are using a hybrid of these tools plus some internal tools (to generate more frame based consistency) to help empower an animation project by assisting generation of hand drawn backgrounds, accelerating the speed of their animators who can focus on hand drawing the characters.

In another project we are using these tools to generate concept art as a jumping off point, to be enhanced by the concept artist and get a feature film pitch across the line.

Compositing generated elements into the background or foreground gives the most granular control and has the best end result.

Stay tuned for a case study on the results of these projects. We are excited to share them with you soon!

Beyond Today

The next natural evolution will be generating coherent videos. These algorithms are not currently temporally aware (aware of more than each frame) and as such they struggle to make sense in the context of a video.

Objects warble between frames, and hair/faces specifically is very tricky for these early algorithms to get right without post correction.

In the future we might see the ability to generate RAW images from text prompts so that they can be properly colour graded and matched to other elements.

Further enhancements may also include layer segmentation (pictures split into foreground/background layers) so that they can be composited easier.

Outlook

The landscape surrounding these tools is rapidly evolving. The future is bright for video.

If you need any technical creative services like assisting with creative pitch decks, early concept art ideas, pitch generation, or advice for how to best leverage these upcoming technologies for prep/on set or post, please get in touch with us here.