Click images to toggle between Source and GeoEdit Result
We use Blender to render 24 diverse object-rich scenes with 30 distinct objects. Multiple camera rings are employed under parameterized translation, rotation, and scaling, yielding an extensive Rendered Dataset of 20,000 image pairs.
Based on the AnyInsertion technique, we constructed a dataset of 100,000+ high-quality image pairs, encompassing diverse scenarios ranging from simple translations to complex multi-dimensional rotations, which significantly enhances the model's capability to handle lighting and shadow logic.
Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results.
To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.
Our goal is to achieve precise geometric editing while generating realistic lighting and shadow effects. To this end, we propose GeoEdit, a diffusion-based framework for geometric image editing. Given an input image and a source mask, our Geometric Transformation modul applies translation, rotation, and scaling to produce a target mask and an appearance reference (transformed object) for in-context guidance. These inputs, together with the original image, are processed by the Diffusion Transformer module, where paired masks explicitly constrain content generation and the Effects-Sensitive Attention (ESA) mechanism adaptively captures lighting and shadow effects to produce realistic editing results. The resulting representations are then decoded to produce the final edited image.