EPiC: Efficient Video Camera Control Learning
with Precise Anchor-Video Guidance

ICML 2026
1University of North Carolina, Chapel Hill 2Johns Hopkins University 3NTU Singapore 4AI2

EPiC supports both precise image-to-video and video-to-video camera control with complex trajectories, with efficient training that takes less than 2 hours on 8×H100 GPUs.

Abstract

Recent approaches on 3D-informed camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.

Image-to-Video Camera Control Gallery


The first frame is the conditioning image.
EPiC with large camera motion. The first frame is the conditioning image.

Video-to-Video Camera Control Gallery


Camera Trajectory: Arc Left


The camera trajectory is applied for the first 1/3 frames, and for the remaining frames, the camera stays fixed at the final pose.

Source
Video

Generated
Video

Source
Video

Generated
Video

Camera Trajectory: Arc Right


The camera trajectory is applied for the first 1/3 frames, and for the remaining frames, the camera stays fixed at the final pose.

Source
Video

Generated
Video

Source
Video

Generated
Video

Camera Trajectory: Translation Down


The camera trajectory is applied for the first 1/3 frames, and for the remaining frames, the camera stays fixed at the final pose.

Source
Video

Generated
Video

Source
Video

Generated
Video

Camera Trajectory: Translation Up


The camera trajectory is applied for the first 1/3 frames, and for the remaining frames, the camera stays fixed at the final pose.

Source
Video

Generated
Video

Source
Video

Generated
Video

Camera Trajectory: Zoom out


The camera trajectory is applied for the first 1/3 frames, and for the remaining frames, the camera stays fixed at the final pose.

Source
Video

Generated
Video

Source
Video

Generated
Video

Camera Trajectory: Zoom in


The camera trajectory is applied for the first 1/3 frames, and for the remaining frames, the camera stays fixed at the final pose.

Source
Video

Generated
Video

Source
Video

Generated
Video

More Complex Camera Trajectories


                        The camera trajectory is applied for all the frames.

                        Trajectory 1

Source
Video

Generated
Video

Source
Video

Generated
Video

                        Trajectory 2

Source
Video

Generated
Video

Source
Video

Generated
Video

Multi-Camera Shooting


The camera trajectory is applied for the first 1/3 frames, and for the remaining frames, the camera stays fixed at the final pose.

The video in the center is the source video.



Method

Constructing Training Source-Anchor Video Pairs


anchor_video

(a) Due to errors in point cloud estimation, rendered anchor videos from reconstructed point clouds often exhibit misaligned regions compared to the source video, and also require access to the source video’s camera trajectory for rendering. (b) We create anchor videos based on first-frame visibility, which ensures better alignment with the source video and eliminates the need for camera trajectory annotations. (c) Our full construction pipeline ends with the addition of dashed rays to simulate the flying-pixel artifacts commonly seen in point-cloud-rendered anchor videos.

Examples of our anchor video contruction process.

Source
Video

Masked
Source
Video

Anchor
Video

Model Architecture and Inference Modes


sr3ai

(a) shows an overview of our EPiC framework. EPiC supports multiple inference scenarios. (b) and (c) illustrate our I2V inference scenarios using full and masked point clouds, respectively. The masked variant is designed to enable more dynamic video generation by selectively masking the control signals from potentially moving objects in the point clouds and the resulting anchor videos. (d) depicts the V2V inference scenario employing dynamic point clouds.

Results with Anchor Videos

Image-to-Video Results with Two Inference Modes


Source
Image

Anchor Videos
rendered
from Masked
Point Clouds

Generated
Videos with
mode (c)

Source
Image

Anchor Videos
rendered
from Full
Point Clouds

Generated
Videos with
mode (b)

Source
Image

Anchor Videos
rendered
from Masked
Point Clouds

Generated
Videos with
mode (c)

Source
Image

Anchor Videos
rendered
from Full
Point Clouds

Generated
Videos with
mode (b)

Comparison with Previous Methods (Image-to-Video)


Comparison with ViewCrafter


Qualitative comparison on RealEstate10k (ViewCrafter vs. EPiC). EPiC produces better video quality.

EPiC
Anchor
Video

EPiC
Output
Video

ViewCrafter
Output
Video

Qualitative comparison on MiraData (ViewCrafter vs. EPiC). EPiC produces better video quality.

EPiC
Anchor
Video

EPiC
Output
Video

ViewCrafter
Output
Video

Comparison with FloVD


Qualitative comparison on RealEstate10k (FloVD vs. EPiC). Video quality is comparable, while EPiC camera motion is more accurate.

EPiC
Anchor
Video

EPiC
Output
Video

FloVD
Output
Video

Qualitative comparison on MiraData (FloVD vs. EPiC). Video quality is comparable, while EPiC camera motion is more accurate.

EPiC
Anchor
Video

EPiC
Output
Video

FloVD
Output
Video

FloVD vs. EPiC Comparison of robustness to 3 different random seeds with the sample input camera trajectory. EPiC's generated videos also have the same trajectory, while FloVD's generated videos are more varied.

FloVD

EPiC



Seed 1

Seed 2

Seed 3

Comparison with Gen3C


Comparison of EPiC and Gen3C on RealEstate10k. Video quality is comparable. Camera motion is comparable in most cases, but Gen3C failed to follow the anchor video for the 1st zoom-out example.

EPiC
Anchor
Video

EPiC
Output
Video

Gen3C
Output
Video

Comparison of EPiC and Gen3C on MiraData. Gen3C's video quality is worse, especially for cases containing humans. Camera motion is comparable. Also Gen3C only generates static scenes.

EPiC
Anchor
Video

EPiC
Output
Video

Gen3C
Output
Video

Comparison with Previous Methods (Video-to-Video)


Comparison with Gen3C


Comparison of EPiC and Gen3C on video-to-video generation with Sora videos. Both models follow the anchor video well, but Gen3C follows too strictly, resulting in the same occlusion artifacts (foreground holes) as the anchor video for the kangaroo (2nd example) and mammoth (3rd example) while EPiC's generated videos are more natural without occlusion artifacts.

Reference
Video

Anchor
Video

EPiC
Output
Video

Gen3C
Output
Video

Comparison with TrajectoryCrafter


Comparison of EPiC and TrajectoryCrafter on video-to-video generation. TrajectoryCrafter adheres too rigidly to the anchor video and reproduces the same artifacts—visible holes in the foreground through which the background appears (mammoth example and corgi example), while EPiC's generated videos are more natural without occlusion artifacts.

Reference
Video

Anchor
Video

TrajCrafter
Output
Video

EPiC
Output
Video

Comparison with RecamMaster


Comparison of EPiC and RecamMaster on 6 examples. The camera trajectory is applied for the first 1/3 frames, and for the remaining frames, the applied camera trajectory is same as the final frame's camera trajectory of that initial segment.
suv-in-the-dust example: Although the source video's camera is moving forward following the car, RecamMaster turns the camera to be static after the first 1/3 frames, and the car is not moving forward.

Recam
Master

EPiC



Ref Video

Translation Up

Translation Down

Arc Left

Arc Right

happy-cat example: Similar to suv-in-the-dust, the source video contains camera movement, but RecamMaster fails to maintain the camera motion for the last 2/3 frames, resulting in static camera and object.

Recam
Master

EPiC



Ref Video

Translation Up

Translation Down

Arc Left

Arc Right

basketball-explosion example: RecamMaster struggles to preserve the correct 3D structure in the arc-right case (the basketball doesn't touch the rim). In the translation-up case, it hallucinates an extra basketball and even a backboard that do not exist in the source video.

Recam
Master

EPiC



Ref Video

Translation Up

Translation Down

Arc Left

Arc Right

vlogger-corgi example: RecamMaster fails to maintain the structure of the selfie stick in the arc-left and arc-right cases.

Recam
Master

EPiC



Ref Video

Translation Up

Translation Down

Arc Left

Arc Right

photoreal-train example: RecamMaster produces structural inconsistency where the bridge deck separates from the piers for all cases.

Recam
Master

EPiC



Ref Video

Translation Up

Translation Down

Arc Left

Arc Right

art-museum example: RecamMaster produces flickering and oil-painting-like artifacts in the generated video (translation-down, arc-right).

Recam
Master

EPiC



Ref Video

Translation Up

Translation Down

Arc Left

Arc Right

RecamMaster vs. EPiC Comparison on Robustness to 3 different random seeds with the sample input trajectory (Arc Left). EPiC's generated videos also have the same trajectory thanks to the explicit anchor video guidance, while RecamMaster's generated videos are more varied.

Recam
Master

EPiC



Ref Video

Seed 1

Seed 2

Seed 3

Comparison on Kubric 4D


EPiC is comparable to or better than baselines on Kubric 4D V2V, yet it is zero-shot and trained only on I2V data (GCD and Gen3C incorporate Kubric-4D training data, and TrajCrafter and ReCamMaster incorporate V2V-specific training data).

BibTeX


			@inproceedings{wang2026epic,
				author    = {Wang, Zun and Cho, Jaemin and Li, Jialu and Lin, Han and Yoon, Jaehong and Zhang, Yue and Bansal, Mohit},
				title     = {{EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance}},
				booktitle = {ICML},
				year      = {2026},
				url       = {http://arxiv.org/abs/2505.21876}
			  }