Zalecane podpowiedzi

Studio Ghibli style. Woman with blonde hair is walking on the beach, camera zoom out.,Studio Ghibli style. Woman dancing in the bar.

Zalecane negatywne podpowiedzi

色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走, 3D, MMD, MikuMikuDance, SFM, Source Filmmaker, Blender, Unity, Unreal, CGI, bad quality

色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走, 3D, MMD, MikuMikuDance, SFM, Source Filmmaker, Blender, Unity, Unreal, CGI, bad quality

Zalecane parametry

samplers

UniPC, DPM++

cfg

1

resolution

384x384

vae

wan_2.1_vae - 1.0

Wskazówki

Use 'Studio Ghibli style' as a trigger phrase for prompt generation.

UniPC sampler performs better for 2D animation compared to LCM, which leans toward realism.

Negative prompts can be used even with CFG=1 by utilizing NAG nodes.

Mix low-resolution high-duration videos with high-resolution images to improve generalization in training.

Training on 368x192x45 clips is an effective starting point for consumer GPUs like RTX 3090.

Sponsorzy twórcy

OpenMuse

This LoRA is featured on OpenMuse, a curated site for open-source video LoRAs and creative AI art, emphasizing models like Wan2.1, LTX-Video, and HunyuanVideo. OpenMuse is supported by the Banodoco community, aiming to inspire creators and promote sharing of AI-generated art.

This LoRA is featured on OpenMuse, a curated initiative dedicated to open-source video LoRAs and the creative works they enable. Focused on models like Wan2.1, LTX-Video, and HunyuanVideo, OpenMuse highlights high-quality tools and artwork from across the ecosystem. Rooted in the Banodoco community, OpenMuse is a growing home for open, collaborative AI art, designed to inspire creators, spark curiosity, and offer something you'd feel proud to share, even with someone skeptical of AI-generated art.

Description

I am very happy to share my magnum opus LoRA, which I've been working on for the past month since Wan came out. This is indeed the best LoRA on Civitai I have ever trained, and I have to say (once again) - WanVideo is an amazing model.

LoRA was trained for ~90 hours on an RTX 3090 with musubi-tuner using a mixed dataset of 240 clips and 120 images. This could have been done faster, but I was obsessed with pushing the limits to create a state-of-the-art style model. It’s up to you to judge if I succeeded.

Usage

The trigger phrase is Studio Ghibli style - all captions for training data were prefixed with these words.

All clips I publish in gallery are raw outputs using a LoRA with a base Wan-T2V-14B model (although latest videos also may include self-forcing LoRA for inference acceleration, read more below), without further post-processing, upscaling, or interpolation.

Compatibility with other LoRAs and with Wan-I2V models has not been tested.

Workflows are embedded with each video (so you can just download and drag video into ComfyUI to open it). As an example, here is JSON for workflow (based on Kijai's wrapper), which uses the self-forcing LoRA (created by blyss), extracted from lightx2v's Wan2.1-T2V-14B-StepDistill-CfgDistill model. I chose version made by blyss (and not the original LoRA by Kijai), because, from my tests, it offers maximum compatibility and only accelerates inference, without any additional detailing or stylistic bias. (This is also the reason why I stick to the base Wan model and do not use merges like AniWan or FusionX.)

I'm using the acceleration LoRA with the UniPC sampler (and occasionally DPM++). In my experience, UniPC performs better for 2D animation than LCM, which tends to lean more toward realism, which I want to avoid.

I also use NAG nodes, so I can use negative prompts even with CFG=1.

From initial testing, compared to the older workflow with TeaCache, aside from the huge speed gain (a 640×480×81 clip renders in ~1 minute instead of 6 on an RTX 3090), it also slightly improves motion smoothness and text rendering.

And you can also download "legacy" workflow in JSON format here. It was used to generate 99% videos in gallery for this LoRA. It was also build on wrapper nodes and included a lot of optimizations (more information here), including fp8_e5m2 checkpoints + torch.compile, SageAttention 2, TeaCache, Enhance-A-Video, Fp16_fast, SLG, and (sometimes) Zero-Star (some of these migrated to new workflow as well), but rendering a 640x480x81 clip still took about 5 minutes (RTX 3090) in older workflow. Although the legacy workflow demonstrates slightly superior quality in a few specific areas (palette, smoothness), the 5x slowdown is a significant and decisive drawback, being the reason I migrated to lightx2v-powered version.

Prompting

To generate most prompts, I usually apply the following meta-prompt in ChatGPT (or Claude, or any other capable LLM), that helps to enhance "raw" descriptions. This prompt is based on official prompt extension code by Wan developers and looks like this:

You are a prompt engineer, specializing in refining user inputs into high-quality prompts for video generation in the distinct Studio Ghibli style. You ensure that the output aligns with the original intent while enriching details for visual and motion clarity.

Task Requirements:
- If the user input is too brief, expand it with reasonable details to create a more vivid and complete scene without altering the core meaning.
- Emphasize key features such as characters' appearances, expressions, clothing, postures, and spatial relationships.
- Always maintain the Studio Ghibli visual aesthetic - soft watercolor-like backgrounds, expressive yet simple character designs, and a warm, nostalgic atmosphere.
- Enhance descriptions of motion and camera movements for natural animation flow. Include gentle, organic movements that match Ghibli's storytelling style.
- Preserve original text in quotes or titles while ensuring the prompt is clear, immersive, and 80-100 words long.
- All prompts must begin with "Studio Ghibli style." No other art styles should be used.

Example Revised Prompts:
"Studio Ghibli style. A young girl with short brown hair and curious eyes stands on a sunlit grassy hill, wind gently rustling her simple white dress. She watches a group of birds soar across the golden sky, her bare feet sinking slightly into the soft earth. The scene is bathed in warm, nostalgic light, with lush trees swaying in the distance. A gentle breeze carries the sounds of nature. Medium shot, slightly low angle, with a slow cinematic pan capturing the serene movement."
"Studio Ghibli style. A small village at sunset, lanterns glowing softly under the eaves of wooden houses. A young boy in a blue yukata runs down a narrow stone path, his sandals tapping against the ground as he chases a firefly. His excited expression reflects in the shimmering river beside him. The atmosphere is rich with warm oranges and cool blues, evoking a peaceful summer evening. Medium shot with a smooth tracking movement following the boy's energetic steps."
"Studio Ghibli style. A mystical forest bathed in morning mist, where towering trees arch over a moss-covered path. A girl in a simple green cloak gently places her hand on the back of a massive, gentle-eyed creature resembling an ancient deer. Its fur shimmers faintly as sunlight pierces through the thick canopy, illuminating drifting pollen. The camera slowly zooms in, emphasizing their quiet connection. A soft gust of wind stirs the leaves, and tiny glowing spirits peek from behind the roots."

Instructions:
I will now provide a prompt for you to rewrite. Please expand and refine it in English while ensuring it adheres to the Studio Ghibli aesthetic. Even if the input is an instruction rather than a description, rewrite it into a complete, visually rich prompt without additional responses or quotation marks.

The prompt is: "YOUR PROMPT HERE".

Replace YOUR PROMPT HERE with something like Young blonde girl stands on the mountain near seashore beach under rain or whatever.

The negative prompt always includes the same base text (but may have additional words added depending on the specific prompt):

色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走, 3D, MMD, MikuMikuDance, SFM, Source Filmmaker, Blender, Unity, Unreal, CGI, bad quality

Dataset

In this and the following sections, I'll be doing a bit of yapping :) Feel free to skip ahead and just read the Conclusion, but maybe someone will find some useful bits of information in this wall of text. So...

Dataset selection stage was the "easiest" part, I already have all the Ghibli films in highest possible quality and splitted into scenes - over 30,000 clips in 1920x1040 resolution and high bitrate. They're patiently waiting for the day I finally will make a full fine-tune some video model with them.

And I had already prepped around 300 clips for training v0.7 of HV LoRA (in fact, I was just about to start the training when Wan came out). These clips were in the range of 65-129 frames, which I consider optimal for training HV on videos, and they were all 24 fps. For Wan, though, I wanted them to be in a different frame range (not exceeding 81 frames, explanation see later in the "Training" section). I also needed them to be in 16 fps. I'm still not entirely sure if strict 16 fps is necessary, but I had some issues with HV when clips were in 30 fps instead of HV’s native 24 fps, so I decided to stick with 16 fps.

I should mention, that for processing dataset, I usually make a lot of small "one-time" scripts (with the help of Claude, ChatGPT, and DeepSeek) - that includes mini-GUIs for manual selection of videos, one-liners for splitting frames, scripts for outputting various helper stats, dissecting clips by ranges, creating buckets in advance, etc. I don't publish these scripts because they're messy, full of hardcoded values, and designed for one-time use anyway. And nowadays anyone can easily create similar scripts by making requests to the aforementioned LLMs.

Converting all clips to 16 fps narrowed the range of frames in each video from 65-129 to around 45-88 frames, which messed up my meticulously planned, frame-perfect ranges for the frame buckets I had set up for training. Thankfully, it wasn't a big deal because I had some rules in place when selecting videos for training, specifically to handle situations like this.

First of all, the scene shouldn't have rapid transitions during its duration. I needed this because I couldn't predict the exact duration (in frames) of target frame buckets that trainer will establish for training - model size, VRAM, and other factors all affect this. Example: I might want to use a single 81-frame long clip for training, but I won't be able to do this, because I will get OOM on RTX 3090. So will have to choose some frame extraction strategy, depending of which clip might be splitted onto several shorter parts (here is excellent breakdown of various strategies). And its semantic coherence might be broken (like, on first fragment of the clip a girl might open her mouth , but from clipped first fragment it will become ambiguous whether she is gonna cry or laugh), and that kind of context incoherence may make Wan's UMT5 encoder feel sad.

Another thing to consider is that I wanted to reuse captions for any fragment of the original clip without dealing with recaptioning and recaching embeddings via the text encoder. Captioning videos takes quite a long time, but if a scene changes drastically throughout its range, the original caption might not fit all fragments, reducing training quality. By following rules "clip should not contain rapid context transitions" and "clip should be self-contained, i.e. it should not feature events that may not be understood from within the clip itself", even if a scene is to be split into subfragments, the captions would (with an acceptable margin of error) still apply to each fragment.

After conversion I looked through all clips and reduced total number of them to 240 (just took out some clips that did contained too much transitions or, vica-versa, were too static), which formed the first part of the dataset.

I decided to use a mixed dataset of videos and images. So second part of the dataset was formed by 120 images (at 768x768 resolution), taken from screencaps of various Ghibli movies.

There's an alternative approach where you train on images first and then fine-tune on videos (it was successfully applied by the creator of this LoRA), but I personally think it's not as good as mixing in a single batch (though I don't have hard numbers to back this up). To back up my assumptions, here is very good LoRA that uses the same mixed approach to training (and btw it was also done on a 24 GB GPU, if I am not mistaken).

To properly enable effective video training on mixed dataset on consumer-level GPUs I had to find the right balance between resolution, duration, and training time, and I decided to do this by mixing low-res high-duration videos and high-res images - I will give more details about this in Training section.

Considering captioning: images for dataset were actually just reused from some of my HV datasets, and they were captioned earlier using my "swiss army knife" VLM for (SFW-only) dataset captioning, also known as Qwen2-VL-7B-Instruct. I used the following captioning prompt:

Create a very detailed description of this scene. Do not use numbered lists or line breaks. IMPORTANT: The output description MUST ALWAYS start with the unaltered phrase 'Studio Ghibli style. ', followed by your detailed description. The description should 1) describe the main content of the scene, 2) describe the environment and lighting details, 3) identify the type of shot (e.g., aerial shot, close-up, medium shot, long shot), and 4) include the atmosphere of the scene (e.g., cozy, tense, mysterious). Here's a template you MUST use: 'Studio Ghibli style. {Primary Subject Action/Description}. {Environment and Lighting Details}. {Style and Technical Specifications}'.

I had some doubts about whether I should recaption them since the target caption structure was specifically designed for HunyuanVideo, and I worried that Wan might need a completely different approach. I left them as-is, and have no idea if this was the right decision, but, broadly speaking, modern text encoders are powerful enough to ignore such limitations. As we know, models like Flux and some others can even be trained without captions at all (although I believe training with captions is always better than without - but only if captions are relevant to the content).

For captioning videos I tested a bunch of local models that can natively caption video content:

There are more models out there, but these are the ones I tested. For this LoRA, I ended up using Apollo-7B. I used this simple VLM prompt:

Create a very detailed description of this video. IMPORTANT: The output description MUST ALWAYS start with the unaltered phrase 'Studio Ghibli style. ', followed by your detailed description.

I'm attaching the full dataset I used as an addendum to the model. While it does kinda contain copyrighted material, I think this falls under fair use. This dataset is provided solely for research and educational evaluation of the model's capabilities and to offer transparency regarding the model's training process. It should not be used for redistribution or commercial exploitation.

Training

If anyone interested, here is list of trainers that I considered for training WanVideo:

  • diffusion-pipe - OG of the HV training, but also allows memory-efficient Wan training; config-driven, has third-party GUI and runpod templates (read more here and here). For HV I used it exclusively. Requires WSL to run on Windows.

  • Musubi Tuner - Maintained by responsible and friendly developer. Config-driven, has cozy community, tons of options. Currently my choice for Wan training.

  • AI Toolkit - My favorite trainer for Flux recently got support for Wan. It's fast, easy-to-use, config-driven, also has first-party UI (which I do not use 🤷), but currently supports training 14B only without captions, which is the main reason I do not use it.

  • DiffSynth Studio - I haven't had the time to test it yet and am unsure if it can train Wan models with 24 GB VRAM. However, it’s maintained by ModelScope, making it worth a closer look. I plan to test it soon.

  • finetrainers - Has support for Wan training, but doesn't seem to work with 24 GB GPUs (yet)

  • SimpleTuner - Gained support for Wan last week, so I haven't had a chance to try it yet. It definitely deserves attention since the main developer is a truly passionate and knowledgeable person.

  • Zero-to-Wan - Supports training only for 1.3B models.

  • WanTraining - I have to mention this project, as it's supported by a developer who’s done impressive work with it, including guidance-distilled LoRA and control LoRA.

So, I used Musubi Tuner. For reference, here are my hardware params: i5-12600KF, RTX 3090, Windows 11, 64Gb RAM. The commands and config files I used were the following.

  • For caching VAE latents (nothing specific here, just default command)

python wan_cache_latents.py --dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml --vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors
  • For caching text encoder embeddings (default):

python wan_cache_text_encoder_outputs.py --dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml --t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth --batch_size 16 
  • For launching training:

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 wan_train_network.py ^
    --task t2v-14B ^
    --dit G:/samples/musubi-tuner/wan14b/dit/wan2.1_t2v_14B_bf16.safetensors ^
	--vae G:/samples/musubi-tuner/wan14b/vae/wan_2.1_vae.safetensors ^
	--t5 G:/samples/musubi-tuner/wan14b/tenc/models_t5_umt5-xxl-enc-bf16.pth ^
	--sdpa ^
	--blocks_to_swap 10 ^
	--mixed_precision bf16 ^
	--fp8_base ^
	--fp8_scaled ^
	--fp8_t5 ^
	--dataset_config G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_dataset.toml ^
    --optimizer_type adamw8bit ^
	--learning_rate 5e-5 ^
	--gradient_checkpointing ^
    --max_data_loader_n_workers 2 ^
	--persistent_data_loader_workers ^
    --network_module networks.lora_wan ^
	--network_dim 32 ^
	--network_alpha 32 ^
    --timestep_sampling shift ^
	--discrete_flow_shift 3.0 ^
	--save_every_n_epochs 1 ^
	--seed 2025 ^
    --output_dir G:/samples/musubi-tuner/output ^
	--output_name studio_ghibli_wan14b_v01 ^
	--log_config ^
	--log_with tensorboard ^
	--logging_dir G:/samples/musubi-tuner/logs ^
	--sample_prompts G:/samples/musubi-tuner/_studio_ghibli_wan14b_v01_sampling.txt ^
	--save_state ^
	--max_train_epochs 50 ^
	--sample_every_n_epochs 1

Again, nothing to see here, actually. I had to use blocks_to_swap parameter because otherwise, with my dataset config (see below), I confronted into 24 Gb VRAM constraints. Hyperparameters were mostly left on defaults. I didn't want to risk anything after a bad experience - 60 hours of HV training lost due to getting too ambitious with flow shift values and adaptive optimizers instead of good old adamw.

  • Prompt file for sampling during training:

# prompt 1
Studio Ghibli style. Woman with blonde hair is walking on the beach, camera zoom out.  --w 384 --h 384 --f 45 --d 7 --s 20

# prompt 2
Studio Ghibli style. Woman dancing in the bar. --w 384 --h 384 --f 45 --d 7 --s 20
  • Dataset configuration (the most important part; I'll explain the thoughts that led me to it afterward):

[general]
caption_extension = ".txt"
enable_bucket = true
bucket_no_upscale = true

[[datasets]]
image_directory = "H:/datasets/studio_ghibli_wan_video_v01/images/768x768"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/images/768x768/cache"
resolution = [768, 768]
batch_size = 1
num_repeats = 1

[[datasets]]
video_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040/cache_1"
resolution = [768, 416]
batch_size = 1
num_repeats = 1
frame_extraction = "head"
target_frames = [1, 21]

[[datasets]]
video_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040"
cache_directory = "H:/datasets/studio_ghibli_wan_video_v01/videos/1920x1040/cache_2"
resolution = [384, 208]
batch_size = 1
num_repeats = 1
frame_extraction = "uniform"
target_frames = [45]
frame_sample = 2

My dataset setup consists of three parts.

I'll start with the last one, which includes the main data array - 240 clips in 1920x1040 resolution and duration that varies from 45 to 88 frames.

Obviously, training on full-resolution 1920x1040, full-duration clips on an RTX 3090 was out of the question. I needed to find the minimum resolution and frame duration that would avoid OOM errors while keeping the bucket fragments as long as possible. Longer fragments help the model learn motion, timing, and spatial patterns (like hair twitching, fabric swaying, liquid dynamics etc.) of the Ghibli style - something you can't achieve with still frames.

From training HV, I remembered a good starting point for estimation of available resolution range for 24 Gb GPU is 512x512x33. I decided on the "uniform" frame extraction pattern, ensuring all extracted fragments were no fewer than 45 frames. Since, as I wrote before, after conversion to 16fps, maxed out at 88 frames, this approach kept the clips from being divided into more than two spans, which would've made epochs too long. At the same time, timespan of 45 frames (~3s) should be enough for model to learn spatial flow of the style.

With the target fixed to 45 frames, I started testing different resolutions. I used a script to analyze all clips in a folder and suggest valid width-height combinations that maintained the original aspect ratio (1920/1040 ≈ 1.85) and were divisible by 16 (a model requirement).

Eventually, I found that using [384, 208] for the bucket size and setting --blocks_to_swap 10 prevented OOM errors and pushing into shared memory (which eventually led to 160 s/it). The downside was that training speed dropped to around 11-12 s/it. In hindsight, lowering the resolution to [368, 192] could have bumped the speed up to ~8 s/it, which would've been great (close to what I get when training Flux at 1024p in AI Toolkit). And that would've saved me around 20 hours of training over the full 90-hour run (~28000 steps), although I didn't expect it to go > 20K steps back then.

And it needs to be noted, that I trained on Windows with my monitor connected to the GPU (and used my PC for coding at the same time 😼). On Linux (for example, with diffusion-pipe) and with using internal GPU for monitor output, it might be possible to use slightly higher spatiotemporal resolutions without hitting OOM or shared memory limits (something I think is Windows-specific).

Now about the first part (120 images in 768x768 resolution). Initially, I wanted to train on 1024p images, but I decided it'd be overkill and slow things down. My plan was to train on HD images and low-res videos simultaneously to ensure better generalization. The idea was that high-resolution images would compensate for the lower resolution of the clips. And joint video + image pretraining is how WAN was trained anyway, so I figured this approach would favor "upstream" style learning as well.

Finally, the second part, which is also important for generalization (again, that is not as "scientific" assumption, but it seems reasonable). The idea was to reuse the same clips from the third section but now train only on the first frame and the first 21 frames. This approach, I hoped, would facilitate learning temporal style motion features. At the same time, it let me bump up the resolution for the second section to [768, 416].

As the result, I hoped to achieve "cross-generalization" between:

  • Section 1's high-res images (768x768)

  • Section 2's medium-res single frames and 21-frame clips (768x416)

  • Section 3's low-res 45-frame clips (384x208)

Additionally, both the second and the larger part of the third sections shared the same starting frame, which I believed would benefit LoRA usage in I2V scenarios. All this seemed like the best way to fully utilize my dataset without hitting hardware limits.

Of course, I'm not the first to come up with this approach, but it seems logical and reasonable, so I hope more creators realize you don’t need an A100 to train a video-based LoRA for Wan.

Funny fact: I expected one epoch to consist of 1080 samples: 120 images (1st dataset section) + 240 single frames (2nd dataset section, "head" frame bucket=1) + 240 clips of 21 frames each (2nd dataset section, "head" frame bucket=21) + 480 clips of 45 frames each (2nd dataset section, "uniform" frame bucket=45, sampled 2 times). However, after I started training, I discovered it was actually 1078 samples. When I dug into it, I found that two of the clips reported by my scripts (which use the ffprobe command from ffmpeg to count the number of frames) were actually shorter than 45 frames, so there was an issue with rounding. This wasn't a big deal, so I just continued training without those two clips, but that was the reason the number of steps for the final LoRA seemed so off :)

The training itself went smoothly. I won't reveal loss graphs since I am too shy don't think they mean much. I mostly use them to check if the loss distribution starts looking too similar across epochs - that's my cue for potential overfitting.

I trained up to 28000 steps, then spent several days selecting the best checkpoint. Another thing I think I could have done better is taking checkpoints not just at the end of each epoch, but also in between. Since each epoch is 1078 steps long, it's possible that a checkpoint with even better results than the one I ended up with was lost somewhere in between.

I'm considering integrating validation loss estimation into my training pipeline (more on this here), but I haven't done it yet.

Could this be simplified? Probably yes. In my next LoRA, I'll test whether the extra image dataset in section 1 was redundant. I could've just set up a separate dataset section and reused clips' first frame, but with high resolution. On the other hand, I wanted the dataset to be as varied as possible, so I used screencaps from different scenes than the clips, in this sense they were not redundant.

I'm not even sure if the second section was necessary. Since WAN itself (according to its technical report) was pretrained on 192px clips, training at around 352x192x45 should be effective and make the most of my hardware. Ideally, I'd use 5-second clips (16 fps * 5s + 1 = 81 frames), but that’s just not feasible on the RTX 3090 without aggressive block swapping.

Conclusion

Aside from the fun and the hundreds thousands of insanely good clips, here are some insights I've gained from training this LoRA. I should mention that these practices are based on my personal experience and observations, I don't have any strictly analytical evidence to prove their effectiveness and I only tried style training so far. I plan to explore concept training very soon to test some of my other assumptions and see if they can be applied as well.

  • You can train Wan-14B on consumer-level GPUs using videos. 368x192x45 seems like a solid starting point.

  • Compensate for motion-targeted style learning on low-res videos by using high-res images to ensure better generalization.

  • Combine various frame extraction methods on the same datasets to maximize effectiveness and hardware usage.

A lot, if not all, of what I've learned to make this LoRA comes from reading countless r/StableDiffusion posts, 24/7 lurking on the awesome Banodoco Discord, reading comments and opening every NSFW clip to every single WanVideo model here on Civitai, and diving into every issue I could find in the musubi-tuner, diffusion-pipe, Wan2.1, and other repositories. 😽

P.S.

This model is a technological showcase of the capabilities of modern video generation systems. It is not intended to harm or infringe upon the rights of the original creators. Instead, it serves as a tribute to the remarkable work of the artists whose creations have inspired this model.

Poprzedni
Artsy Dream - v6 - FP8
Następny
Granblue Fantasy Style for Pony and Illustrious - Illustrious

Szczegóły modelu

Typ modelu

LORA

Model bazowy

Wan Video 14B t2v

Wersja modelu

v1.0

Hash modelu

dd2fe1258d

Wytrenowane słowa

Studio Ghibli style

Twórca

Dyskusja

Proszę się log in, aby dodać komentarz.

Kolekcja modeli - Studio Ghibli 🎥 Wan2.1-T2V-14B

Obrazy autorstwa Studio Ghibli 🎥 Wan2.1-T2V-14B - v1.0

Obrazy z animation

Female figure in arcane style adorned with ornate silver armor and white dress, sitting contemplatively in a garden with vibrant orange flowers.
Claymation-style anime woman sitting at a computer eating popcorn, with the screen displaying the words 'Why widescreen is best.'
Mature female figure in arcane style adorned with ornate silver armor and flowing white dress sitting contemplatively in a vibrant garden with orange flowers.
Portrait of a young woman with flowing red hair, her face partially submerged in vibrant, softly lit water, eyes closed in peaceful serenity.

Obrazy z anime

A fierce young woman with black bob-cut hair and intense gaze wearing a white sailor-style top and pleated skirt, holding a detailed katana against a vivid red background.
A confident girl with burgundy hair in greyscale stands under a bridge at night wearing a black crop top, checkered jacket, and denim pants.
Cyberpunk bar interior bathed in violet neon lighting with patrons wearing futuristic helmets and outfits seated at a long bar counter under mechanical piping and glowing screens.
Digital painting of a blonde anime girl with striking blue eyes, wearing a white dress with thin straps, illuminated by soft, warm light in a detailed CGI style.
A beautiful anime girl with blue hair and iridescent eyes wearing a dark blue dress and a jellyfish-themed hat, surrounded by glowing jellyfish in the deep sea.
An elderly father walking alone down a sunlit village alley with reflections and pink petals on wet ground.
Megumin with black hair, red eyes, wearing an off-shoulder red dress, black gloves, and a black witch hat, standing indoors with a soft light background.
Anime girl Onodera Kosaki with dark hair, red ribbon bow, white blouse, and white wing-shaped hairpins with red gems.
Saber from Fate stay night dressed in blue and silver armor with a fur-trimmed cape, holding Excalibur sword with a sparkle background.
Highly detailed anime-style elf girl with silver twintails, green eyes, and a white capelet, surrounded by a field of blue flowers under daylight.

Obrazy z ghibli

Realistic portrait of a girl with short dark pastel hair, soft makeup, and a crimson gradient background, painted in oil color style.
Anime girl with brown hair and hazel eyes in a yellow sweater dress, with a yellow beret and striped thigh-highs.
A blonde girl with intense green eyes, wearing a purple sleeveless hoodie, stands in a lush garden surrounded by butterflies.
AI generated image using stable diffusion shows a Victorian-era woman with dark hair in an elegant updo, standing in front of a charming historical cityscape with pastel buildings and church towers.
A vintage illustration of a woman with brown hair, set against a historical cityscape with ornate buildings. This is an AI generated image using Stable Diffusion.
A fantasy landscape with blooming pink orchids by a stream under a vivid galaxy-filled night sky. AI generated image using Stable Diffusion.
AI generated image using stable diffusion of a fantasy flower field under a vibrant and colorful night sky with stars and galaxies.
Beautiful fantasy landscape with a vibrant night sky and glowing water lilies, AI generated using Stable Diffusion.
A glass rabbit sculpture placed on a city street with modern skyscrapers and traffic in the background, created using stable diffusion AI.
A crystal mermaid with detailed flowing hair adorned with a delicate crystal flower, emerging from sparkling water, set against a vivid green background, AI generated image using Stable Diffusion.

Obrazy z studio ghibli

Bright flower field with orange flowers in the foreground, grassy meadow, rocky shoreline, and calm ocean under a blue sky with fluffy clouds.
Close-up of a blue-scaled dragon with red eyes and sharp teeth emerging from stormy ocean waves during a vibrant sunset with lightning in the dark clouds.
Close-up digital painting of a little girl with blue eyes holding a white rabbit, dressed in traditional attire, set in a snowy village background during winter sunset.
Anime girl with green ponytail repairing a broken robot on a cluttered workbench in a high-tech room filled with holographic monitors and detailed machinery.
A beautifully detailed stone ancient temple surrounded by lush tropical forest greenery, depicted in Studio Ghibli style.
A young woman in a flowing white robe stands contemplatively before the entrance of a dark cave, holding a bouquet of white flowers amidst wildflowers, in a muted blue-gray mystical setting.
A close-up view of a frog with green eyes sitting by a pond at night, surrounded by glowing fireflies and illuminated by moonlight.
A woman with brown hair wearing glasses on her head, holding and reading a blue book, illustrated in a detailed Ghibli-style portrait with leafy background.
A young girl with headphones sits at a cozy desk surrounded by books and plants, writing in her journal at night with a ginger cat nearby and a city skyline in the background.
A moonlit night scene showing a large alien spaceship hovering above a silhouetted figure sitting on a forested hillside, with glowing orange and purple lights and a detailed cosmic sky.

Obrazy z style

A dynamic portrait of a Polish archer holding a royal scimitar and drawing a bow, wrapped in the Polish flag with detailed armor, set against a dark fantasy background.
3D rendered giant robot standing in a field at night, glowing eyes, facing a boy in a yellow hoodie with starry sky and forest background.
Detailed illustration of a girl with long black hair and a curvy figure, rendered in an artistic chalk and coal style with dynamic composition.
Oil painting of a black raven perched on a gnarled tree branch against a vibrant sunset sky with dramatic clouds and distant mountains.
A striking upper body portrait of a gothic woman with vibrant orange and black abstract patterns overlaying her face, glowing eyes, and a sensual, sinister smile, depicted in a highly detailed charcoal drawing style with intense, vibrant colors and volumetric lighting.
Stylized Parisian woman in black haute couture dress and hat walking a white Scottish terrier on a yellow background, shown in side profile with minimalist pop art style.
Full-body shot of a ballet dancer with violet hair in a pleated dress on a polished floor with ribbons and petals.
Woman with short blue hair in black mini dress at amusement park.
Woman in animal print leggings and satin camisole in a handcrafted living room.
Lower body wearing gold ankle boots and multicolor dress.