Given a text prompt, LumosFlow first generates key frames. Optical flows are then generated between each pair of these key frames to model the motion transitions, which guides the synthesis of a generated video clip by interpolating frames between the initial and final states. This process is repeated iteratively for every pair of key frames across the sequence. Finally, all generated video clips are concatenated to form the complete, continuous long video. We visualize the generation of each video clip, as well as the generated optical flows (Generated Flows (First Frame), Generated Flows (Last Frame)) and pair of key frames (First Frame (Generated), Last Frame (Generated)) as follows.

  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 1
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 2
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 3
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 4
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame

    Last Frame

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 5
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 6
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 7
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 8
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 9
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 10
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 11
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 12
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 13
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 14
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 15
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 16
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 17
  • A man with curly hair, dressed in
    a black shirt and wearing a
    white virtual reality headset……

    Text Prompt

    First Frame (Generated)

    Last Frame (Generated)

    Generated Flows (First Frame)

    Generated Flows (Last Frame)

    Generated Video Clip

    Video Clip 18

Abstract

Long video generation has gained increasing attention due to its widespread applications in fields such as entertainment and simulation. Despite advances, synthesizing temporally coherent and visually compelling long sequences remains a formidable challenge. Conventional approaches often synthesize long videos by sequentially generating and concatenating short clips, or generating key frames and then interpolate the intermediate frames in a hierarchical manner. However, both of them still remain significant challenges, leading to issues such as temporal repetition or unnatural transitions. In this paper, we revisit the hierarchical long video generation pipeline and introduce LumosFlow, a framework introduce motion guidance explicitly. Specifically, we first employ the Large Motion Text-to-Video Diffusion Model (LMTV-DM) to generate key frames with larger motion intervals, thereby ensuring content diversity in the generated long videos. Given the complexity of interpolating contextual transitions between key frames, we further decompose the intermediate frame interpolation into motion generation and post-hoc refinement. For each pair of key frames, the Latent Optical Flow Diffusion Model (LOF-DM) synthesizes complex and large-motion optical flows, while MotionControlNet subsequently refines the warped results to enhance quality and guide intermediate frame generation. Compared with traditional video frame interpolation, we achieve 15$\times$ interpolation, ensuring reasonable and continuous motion between adjacent frames. Experiments show that our method can generate long videos with consistent motion and appearance. Code and models will be made publicly available upon acceptance.


Method Overview

The overall framework of LumosFlow includes key frame generation and intermediate frame generation. The intermediate frame generation comprises two components: motion generation (highlighted in yellow) and post-hoc refinement (highlighted in orange). For details, in the generation process, LMTV-DM produces key frames with significant intervals, while LOF-DM and MotionControlNet collaborate to create realistic intermediate frames, effectively injecting motion (optical flow) into the generation.



Comparison on Long Video Generation

Text Prompt
A bird's-eye view captures a serene beach scene with a rusted, dilapidated shipwreck partially submerged in shallow water near the shore …… A man with curly hair, dressed in a black shirt and wearing a white virtual reality headset, …… A woman in a brown jacket and black helmet sits on a brown horse in an indoor equestrian arena, …….
FreeLong
FreeNoise
Video-Inifinity
LumosFlow (ours)
Text Prompt
The video captures a low-angle view of a dramatic and moody sky filled with a variety of clouds, …… A woman in a light green, flowing dress walks gracefully on a paved surface composed of rectangular bricks, …… A close-up view captures a chaotic scene of a flock of vultures feeding on a carcass ……
FreeLong
FreeNoise
Video-Inifity
LumosFlow (Ours)