CARLA-Air: Fly Drones Inside a CARLA World — A Unified Infrastructure for Air-Ground Embodied Intelligence
Paper: arXiv:2603.28032 Code: louiszengCN/CarlaAir Code reference:
main@d70247b5(2026-05-02)
1. Motivation (研究动机)
现有 open-source simulator 把能力切在不同域里:CARLA 有高保真城市道路、交通流、行人和成熟 Python API,但没有物理一致的 UAV dynamics;AirSim 有 multirotor flight 和 aerial sensors,却缺少真实城市交通、行人交互和大规模 ground scene。把两个 simulator 用 ROS 2 或自定义 bridge 连起来虽然可行,但会引入 inter-process synchronization、serialization overhead、双 rendering pipeline,以及最关键的 spatial-temporal consistency 风险:同一时刻的 aerial view、ground view、LiDAR、IMU/GNSS 可能不是同一个物理 tick 的状态。

Figure 1 解读:teaser 展示 CARLA-Air 的目标不是单纯“让无人机出现在 CARLA 地图里”,而是把 air-ground simulation、multi-modal sensing、embodied navigation、asset adaptation 和多城市场景放进一个 shared physically coherent world。对 reader 来说,这张图对应论文的需求定义:未来 low-altitude economy、drone logistics、urban inspection、air-ground cooperation 都需要 aerial agents 和 ground agents 在同一世界里同时感知、控制、评估。
Figure 2 解读:论文把 CARLA-Air 定位在 “high-fidelity + multi-domain” 象限:CARLA、AirSim 各自高保真但 single-domain,TranSimHub 等 joint platform 走 bridge/co-simulation 路线,而 CARLA-Air 的主张是保留上游能力的同时取消 bridge 边界,让 ground traffic、pedestrians、UAV flight、native APIs 和 shared renderer 同时成立。
Figure 3 解读:这张图解释为什么 bridge-based co-simulation 不是理想解。随着 concurrent sensors 从 增加到 ,bridge 方案每帧数据传输时间从约 ms 增至 ms;CARLA-Air 因为 single-process design 保持在 — ms。差异的核心不是网络优化,而是有没有跨进程序列化和独立渲染管线。
本文要解决的具体问题是:构建一个统一的 open-source air-ground embodied intelligence infrastructure,让 drone、vehicle、pedestrian、traffic、weather、sensor streams 在同一 UE4 process、同一 physics tick、同一 rendering pipeline 里运行,同时保留 CARLA 和 AirSim 原生 APIs。这个问题值得做,是因为一旦 simulation foundation 统一,cross-view perception dataset、air-ground cooperative RL、VLN/VLA data generation、precision landing、multi-agent coordination 都可以在一个可复现环境里实现,而不是每个项目重新维护 bridge、clock sync 和 data alignment glue code。
2. Idea (核心思想)
核心 insight:UE4 只能有一个 active GameMode,但 CARLA 的关键 ground subsystems 必须通过 GameMode inheritance 初始化;AirSim 的核心 flight logic 更像一个可以后置 spawn 的 world actor。因此 CARLA-Air 选择“CARLA GameMode inheritance + AirSim actor composition”,而不是把两个 GameMode 平等桥接。
关键创新可以压缩成三点:第一,用一个 unified game mode 继承 CARLA 的 episode/weather/traffic/actor lifecycle/RPC infrastructure;第二,在 BeginPlay 之后把 AirSim 的 multirotor sim mode 和 flight pawn 作为 regular world actor 组合进同一 world;第三,两个 native Python clients 分别连各自 RPC server,但服务端共享同一个 UE4 world、tick 和 renderer。相比 TranSimHub / AirSim+Gazebo 这类 multi-process bridge,根本差异在系统边界:CARLA-Air 不在进程之间同步状态,而是在同一 engine process 内让两套 API 观察同一个 world state。
3. Method (方法)
3.1 Overall framework:single-process dual-backend runtime
CARLA-Air 的整体框架是 single UE4 process 内并行托管两个 backend:CARLA RPC server 面向 ground client,AirSim RPC server 面向 aerial client;下方不是两个世界,而是同一个 CARLA-Air Game Mode 管理的 shared actors、physics 和 rendering pipeline。论文中的 CARLAAirGameMode 在 released code 中对应 ASimWorldGameMode:它继承 ACarlaGameModeBase,在 BeginPlay() 中初始化 AirSim settings、创建 SimMode、创建 AirSim widget/input binding,并启动 AirSim API server。
Figure 4 解读:上半部分是两个不改代码的 Python clients;中间是两个独立 RPC server;核心在蓝色框:ground subsystems 通过 inheritance 获得,aerial flight actor 通过 composition 加入。底部 shared rendering pipeline 说明 RGB、depth、segmentation、weather effects 是同一渲染通道产生的,因此 multi-view sensing 可以按 tick 对齐。
3.2 GameMode conflict:inheritance + composition
UE4 的单 GameMode 约束是本文最关键的 engineering bottleneck。CARLA GameMode 负责 episode、weather、traffic、actor lifecycle、recorder、RPC;AirSim GameMode 负责读取 settings.json、设置 renderer、spawn flight actor、启动 aerial API。若地图选择 CARLA GameMode,则 AirSim 初始化缺失;若选择 AirSim GameMode,则 CARLA episode/RPC/actor dispatch 缺失。论文的解决方案是让 unified game mode 继承 CARLA,再把 AirSim 的 flight actor 组合进 world。
Figure 5 解读:左侧 naive approach 表示两个 GameMode 抢一个 slot,必然有一个被丢弃;右侧 CARLA-Air solution 表示 unified class 占据 slot 并保留 CARLA lifecycle,然后以 dashed composition edge 把 aerial flight actor spawn 出来。这个 asymmetry 是方案成立的原因:CARLA 不能轻易搬出 GameMode,AirSim flight logic 可以作为 actor 后置初始化。
论文公式与 released code 实现差异:论文正文和 appendix 将统一类称为 CARLAAirGameMode,但 main@d70247b5 的 released code 实际类名是 ASimWorldGameMode,文件为 Unreal/CarlaUE4/Plugins/AirSim/Source/SimWorldGameMode.{h,cpp};功能上它确实继承 ACarlaGameModeBase 并在 BeginPlay() 中 bootstrap AirSim。另一个差异是 W3 论文实验写的是 30 vehicles + 10 pedestrians + 12 streams + synchronous mode,而 public CarlaAir_Release/source/python_api/examples/data_collector.py 默认示例是 --frames=100、15 traffic vehicles、clk.tick(20);因此本文实验数字按论文 source tables 记录,代码伪代码只用于说明 released example 的实现骨架。
3.3 Coordinate mapping:UE4/CARLA frame 到 AirSim NED
CARLA 继承 UE4 左手坐标: forward、 right、 up,单位为 centimeter;AirSim 使用 NED frame: north、 east、 down,单位为 meter。因为两者 方向对齐,转换只需要 scale 和 翻转。给定 UE4 点 和 shared origin :
对 unit quaternion ,论文给出:
W4 中还定义 origin offset:
若 drone 从 world origin spawn,则 。released docs COORDINATE_SYSTEMS.md 用 airsim_z = -carla_z + offset_z 解释 position 变换;demo_drive_and_fly.py 的 DroneFollower.update() 则实际实现 XY 相对偏移跟随与 target_z=-altitude。注意:上面的 quaternion sign flip 是论文公式;released docs 还说明 Euler pitch/yaw/roll 两侧一致、无需换算,但 released examples/docs 没有提供通用 q_ned quaternion conversion helper。

Figure 6 解读:图中左侧 CARLA/UE4 是 z-up,右侧 AirSim 是 NED z-down。因为 horizontal axes 对齐,本文避免了复杂 axis permutation;真正需要小心的是单位从 centimeter 到 meter,以及 轴 sign flip。这个简化直接降低了 cross-view dataset 和 relative reward 计算的实现复杂度。
3.4 Asset pipeline and workflow runtime
资产层面,CARLA-Air 支持导入 custom robots、UAV configurations、vehicles、environment maps,并让它们成为 standard CARLA spawnable actors:它们参与同一 physics tick/rendering pass,也能被 aerial/ground sensor modalities 看到。workflow 层面,五个 applications 使用同一种 dual-client pattern:一个 Python process 内创建 carla.Client('localhost', 2000) 和 airsim.MultirotorClient(port=41451),然后通过两个 API 操作同一个 world。

Figure 7 解读:上方是导入的四轮 mobile robot,下方是 custom electric sport car。重点不是外观,而是 asset pipeline 的 contract:导入后它们不再是某个 backend 的私有对象,而是可以通过 standard CARLA API spawn、被同一 renderer/sensors 观察、和 AirSim aerial agents 共处一个 world。
Figure 8 解读:五个 workflow 都遵循 dual-client architecture。一个 user script 同时驱动 ground client 与 aerial client,两个 client 分别通过 TCP 连接 server,但 server 下方是同一个 Unified UE4 Process 和 Shared World。因此 W1 的 vehicle pose、W3 的 12-stream dataset、W4 的 weather consistency、W5 的 reward 都可以用同一 tick 作为对齐键。
3.5 Pseudocode:按 released code 抽象的关键组件
class SimWorldGameMode(CarlaGameModeBase):
def __init__(self):
super().__init__() # CARLA episode, recorder, factories, weather
self.default_pawn_class = None
self.actor_factories += [SensorFactory, StaticMeshFactory,
TriggerFactory, VehicleFactory, WalkerFactory]
def begin_play(self):
super().begin_play() # finish CARLA world / episode bootstrap
self.spectator = self.spawn_and_register_spectator()
self.initialize_airsim_settings() # settings.json or default settings
self.set_unreal_engine_settings() # disable blur, enable custom depth
self.sim_mode = self.create_sim_mode() # Multirotor / Car / CV
self.widget = self.create_airsim_widget()
self.setup_airsim_input_bindings()
self.sim_mode.start_api_server()
def tick(self, dt):
super().tick(dt)
if not self.drone_registered and self.sim_mode is not None:
pawn = self.sim_mode.getVehicleSimApi().getPawn()
self.carla_episode.ActorDispatcher.RegisterActor(
pawn, type_id="airsim.drone")
self.drone_registered = True
if self.sim_mode.EnableReport:
self.widget.updateDebugReport(self.sim_mode.getDebugReport())def carla_location_to_airsim_ned(carla_location_m, offsets_m):
# released docs: CARLA Python API already reports meters;
# AirSim NED keeps X/Y axes and flips Z with a calibrated offset.
ax = carla_location_m.x + offsets_m.x
ay = carla_location_m.y + offsets_m.y
az = -carla_location_m.z + offsets_m.z
return ax, ay, az
# paper-only orientation formula, not a released helper in main@d70247b5:
# q_ned = (w, qx, qy, -qz)def dual_client_workflow():
ground = carla.Client("localhost", 2000)
world = ground.get_world()
aerial = airsim.MultirotorClient(port=41451)
aerial.confirmConnection()
aerial.enableApiControl(True)
aerial.armDisarm(True)
aerial.takeoffAsync().join()
# both APIs operate on the same UE4 world state
vehicle = world.spawn_actor(vehicle_blueprint, spawn_point)
drone_state = aerial.getMultirotorState()
rgb = aerial.simGetImages([airsim.ImageRequest("0", airsim.ImageType.Scene)])
return vehicle, drone_state, rgbdef collect_air_ground_record(world, airsim_client, tick_id, save_dir):
# public data_collector.py attaches RGB/depth/seg/LiDAR/IMU/GNSS to ego vehicle
ground_packet = read_latest_ground_sensor_callbacks()
aerial_rgb = airsim_client.simGetImages([
airsim.ImageRequest("0", airsim.ImageType.Scene, False, False)
])[0]
ego_tf = world.get_ego_vehicle().get_transform()
metadata = {
"frame": tick_id,
"ego_pose": serialize_transform(ego_tf),
"shared_tick_key": tick_id,
}
write_np_arrays_and_json(save_dir, tick_id, ground_packet, aerial_rgb, metadata)W5 / RL environment 在论文中是平台能力展示,released code main@d70247b5 没有提供可直接对应的 Gym/RLlib environment、reward function 或训练脚本,因此这里不写 code-grounded pseudocode;note 只在实验结果中记录论文给出的 observation/action/reward 概念,并明确为 代码未实现。
Code reference:
main@d70247b5(2026-05-02) — pseudocode and mapping based on this commit
| Paper Concept | Source File | Key Class/Function |
|---|---|---|
| Unified GameMode / single-process integration | Unreal/CarlaUE4/Plugins/AirSim/Source/SimWorldGameMode.{h,cpp} | ASimWorldGameMode : ACarlaGameModeBase, BeginPlay(), Tick(), CreateSimMode() |
| One-direction AirSim → CARLA dependency | Unreal/CarlaUE4/Plugins/AirSim/Source/AirSim.Build.cs | PrivateDependencyModuleNames includes Carla; ground code does not import AirSim |
| CARLA actor/episode access for integrated actor registration | Unreal/CarlaUE4/Plugins/Carla/Source/Carla/Game/CarlaEpisode.h | friend class ASimWorldGameMode, ActorDispatcher access |
| AirSim settings and API server bootstrap | SimWorldGameMode.cpp, CarlaAir_Release/source/config/settings.json | InitializeAirSimSettings(), SimMode_->startApiServer(), SimMode=Multirotor, cameras at 1280×960 |
| Coordinate mapping / drone following | CarlaAir_Release/guide/COORDINATE_SYSTEMS.md, CarlaAir_Release/source/python_api/examples/demo_drive_and_fly.py | airsim_z=-carla_z+offset_z, DroneFollower.update() |
| Dual-client API pattern | CarlaAir_Release/guide/examples/07_combined_demo.py, README.md | carla.Client('localhost', 2000), airsim.MultirotorClient(port=41451) |
| Multi-modal dataset example | CarlaAir_Release/source/python_api/examples/data_collector.py | spawns RGB/depth/seg/LiDAR/IMU/GNSS sensors and calls simGetImages() |
| Public release / quick start | CarlaAir_Release/guide/Quick-Start.md | ports, synchronous mode snippet, native API usage |
| W5 RL environment concept | README.md, paper Figure 14 only | 代码未实现 as Gym/RLlib env/reward/training script in released code; paper-level capability demo |
3.6 直觉:为什么 single process 比 bridge 更关键
这篇论文的重点不是发明新的 controller,而是把 controller、sensor、renderer、physics 的“对齐边界”放到 engine 内部。Bridge 方案把两个 simulator 当黑盒,只能在进程边界交换状态;一旦 sensor 数量增加,serialization 和 clock drift 都会变成系统性误差。CARLA-Air 则让 aerial state 和 ground state 在同一 tick 被 renderer 读取,所以 cross-view perception 的 pair、RL reward 中的 relative pose、VLN/VLA 的 aerial oracle view 都天然共享一个 scene snapshot。这个设计的价值在于减少每个下游任务自己修 clock、修坐标、修 weather inconsistency 的负担。
4. Experimental Setup (实验设置)
Datasets / workloads and scale
本文不是提出静态数据集,而是验证一组 simulator workloads:W1 precision landing 在 Town10HD 上让 drone 从约 m 降落到 moving vehicle;W3 multi-modal dataset collection 运行 ticks,包含 30 autopilot vehicles、10 pedestrians、8 ground sensors 和 4 aerial sensors,共 条 12-stream records;W4 cross-view perception 运行 500 ticks,覆盖 14 official weather presets,并在 Figure 12 中展示 Town01—05 和 Town10HD 的多天气 aerial views;W5 是 RL training environment 验证,核心证据是 357 次 actor spawn/destroy reset cycles 中 0 crashes / 0 API errors。
Baselines / compared systems
论文比较了三类 baseline:单域 simulator(CARLA、LGSVL、SUMO、MetaDrive、VISTA;AirSim、Flightmare、FlightGoggles、Gazebo/RotorS、OmniDrones、gym-pybullet-drones)、joint/co-simulation 方案(TranSimHub、CARLA+SUMO、AirSim+Gazebo、community AirSim+CARLA bridge)、以及 embodied/RL platforms(Isaac Lab、Isaac Gym、Habitat、SAPIEN、RoboSuite)。性能实验还包含 standalone ground sim only、standalone aerial sim only、joint idle、ground only、moderate joint、traffic surveillance、stability endurance profile。
Metrics
主要 metrics 包括:harmonic mean FPS(适合 rate quantity)、VRAM MiB、CPU utilization、API round-trip latency median/IQR、sensor alignment deviation、RPC error count、crash count、landing horizontal error、records collected、weather presets passed。论文还报告 bridge IPC per-frame transfer time,用于说明 cross-process serialization overhead 随 sensor count 近线性增长。
System / benchmark config
实验硬件与软件栈来自论文 source performance.tex 和 appendix:Ubuntu 20.04/22.04 LTS,NVIDIA RTX A4000 16GB GDDR6,AMD Ryzen 7 5800X 8-core 4.7GHz,32GB DDR4-3200;CARLA 0.9.16,AirSim 1.8.1,UE4.26,Python 3.8+。默认地图 Town10HD,Epic quality,synchronous mode;aerial experiments 使用 built-in SimpleFlight controller 与 default PID gains。Benchmark harness 使用 warm-up ticks、 measurement ticks;VRAM 每 60 s 采样;latency benchmark 使用 500 warm-up calls + 5,000 measurement calls。论文没有训练某个 neural policy 的 learning rate/batch size/steps;W5 是 RL environment capability,作者只给出 observation/action/reward 形式和 stability evidence。
5. Experimental Results (实验结果)
Main performance numbers
| Profile | Configuration | FPS | VRAM (MiB) | CPU (%) |
|---|---|---|---|---|
| Ground sim only | 3 vehicles + 2 pedestrians; 8 sensors @ | |||
| Aerial sim only | 1 drone; 8 sensors @ | |||
| Idle | Town10HD; no actors; no sensors | |||
| Ground only | 3 vehicles + 2 pedestrians; 8 sensors @ | |||
| Moderate joint | 3 vehicles + 2 pedestrians + 1 drone; 8 sensors @ | |||
| Traffic surveillance | 8 autopilot vehicles + 1 drone; 1 aerial RGB @ | |||
| Stability endurance | Moderate joint; 357 spawn/destroy cycles; 3 hr continuous |
Moderate joint profile 保持 FPS,论文认为足够 standard RL episode-length closed-loop evaluation。Standalone ground baseline 到 moderate joint 的差值是 FPS(30.3%):其中 FPS 来自 ground co-hosting, FPS 来自 aerial physics engine;VRAM 相比 ground-only 只多 39 MiB,主要 overhead 体现在 CPU。
Figure 9 解读:3 小时 endurance run 包含 357 次 spawn/destroy cycles。early cycles 的 VRAM 均值为 MiB,late cycles 为 MiB,drift 约 10 MiB;regression slope 是 MiB/cycle,。这支持作者的结论:没有显著 memory leak,残差更像 render-target caching。
| Stability Metric | Early cycles 1—30 | Late cycles 328—357 |
|---|---|---|
| Frame rate | FPS | FPS |
| VRAM | MiB | MiB |
| CPU utilization | % | % |
| API error count | 0 | 0 |
| Crash count | 0 | 0 |
| VRAM regression slope | MiB/cycle |
| API | Call | Median | IQR |
|---|---|---|---|
| Ground sim | World state snapshot | 320 s | 40 s |
| Ground sim | Actor transform query | 280 s | 35 s |
| Ground sim | Actor spawn + paired destroy | 1,850 s | 210 s |
| Ground sim | Actor destroy | 920 s | 95 s |
| Aerial sim | Multirotor state query | 410 s | 55 s |
| Aerial sim | Image capture, 1 RGB stream | 3,200 s | 380 s |
| Aerial sim | Velocity command dispatch | 490 s | 60 s |
| Bridge IPC reference | Cross-process state sync | 3,000 s | 2,000 s |
Representative workflow results

Figure 10 解读:W1 展示 drone 对 moving vehicle 的 precision landing。左侧 time-lapse 显示 approach、descent、touchdown;右侧轨迹图与 altitude/error curve 表明 drone 从约 m 平滑下降,horizontal error 从约 m 收敛到 m tolerance band 内。这个 workflow 验证了 coordinate transform 和 tick-synchronous control 是否能用于真实 air-ground cooperation。
| W1 Metric | Value | Notes |
|---|---|---|
| Mean FPS | 19.3 | harmonic mean |
| Initial altitude | m | start of descent |
| Landing duration | s | approach to touchdown |
| Final horizontal error | m | within tolerance band |
| Initial horizontal error | m | at descent start |
| RPC errors | 0 | both clients |

Figure 11 解读:W2 是 embodied navigation / VLN/VLA data generation 的 qualitative demo。无人机用 bird’s-eye observations 跟踪 pedestrian,图中每帧配有 chain-of-thought 风格 reasoning。论文想强调的不是某个 language model 性能,而是 CARLA-Air 能同时提供 aerial overview、street-level detail、semantic annotations 和 shared weather/lighting,用于构建 cross-view instruction-following 数据。

Figure 12 解读:W3 在同一 simulation tick 下采集 vehicle perspective 与 drone perspective 的多模态传感器输出。top row 和 bottom row 均包含 RGB、semantic/depth、LiDAR/geometry 等 modality;关键指标是所有 streams 使用同一 tick index,而不是事后 timestamp interpolation。
| W3 Metric | Value | Notes |
|---|---|---|
| Mean FPS | 17.1 | harmonic mean |
| Concurrent streams | 12 | 8 ground + 4 aerial |
| Records collected | 1,000 | one per tick |
| Max alignment deviation | tick | under disk-write load |
| RPC errors | 0 | both clients |
| Per-tick write latency | ms | incl. serialization |

Figure 13 解读:W4 展示 aerial RGB view 在 Town01—05、Town10HD 与多种 weather presets 下的一致渲染。论文用 14/14 weather presets passed 说明同一 shared renderer 能把 weather/lighting 同步传递给 ground 与 aerial sensors;对应公式是 。
| W4 Metric | Value | Notes |
|---|---|---|
| Mean FPS | 18.2 | harmonic mean |
| Co-registered pairs | 500 | aerial depth + ground segmentation |
| Per-tick latency | ms | full collection loop |
| Sensor alignment | 0 ticks | sync mode guarantee |
| Weather presets passed | 14/14 | all official presets |
| RPC errors | 0 | both clients |

Figure 14 解读:W5 把 CARLA-Air 包装成 RL environment:每个 synchronous tick 产生 observation(drone pose、vehicle pose、relative distance、traffic state),policy 输出 3D velocity commands,reward 编码 tracking accuracy、altitude maintenance 和 collision avoidance。论文没有报告具体 policy 的 reward curve,而是用稳定 reset 和 shared world state 证明该 environment 适合后续 RL training。
Ablation / key findings
严格意义上本文没有 component ablation table;它的“组件有效性”主要来自系统对比和 workload 分解。关键结论包括:single-process design 让 IPC 随 sensor count 基本保持 ms;moderate joint 的 overhead 主要是 CPU-bound aerial physics,而不是 VRAM;traffic surveillance 把 vehicle count 提高到 8 仍有 FPS,说明 sensor rendering 比 actor population 更主导;357 reset cycles 的 0 crash / 0 API error 支持 RL-style episode reset 的稳定性。
Limitations
作者明确给出三类限制:第一,当前 performance characterization 只覆盖 moderate traffic loads,高密度大量 actor 仍是 active engineering target;第二,map switching 需要 full process restart,因为两个 backend 的 actor lifecycle 独立,in-session reset 还在计划中;第三,超过 two drones 的配置能运行但尚未跨大量场景正式验证。未来工作包括更细的 physics-state synchronization、ROS 2 bridge 统一发布两套 stream,以及类似 Isaac Lab / OmniDrones 的 GPU-parallel multi-environment execution。
Overall conclusion
CARLA-Air 的贡献是一个 simulator infrastructure,而不是新的 perception/control algorithm。它用 minimal source modification 和 additive AirSim plugin integration,把 CARLA 的 realistic urban world 与 AirSim 的 physics-accurate UAV flight 放进同一个 process,保留两套 native APIs,并把 aerial-ground sensing、weather、physics、actor state 绑定到 shared tick。实验结果表明该设计在 joint workloads 下约 20 FPS、VRAM 稳定、latency 足够低,能支撑 precision landing、VLN/VLA data generation、multi-modal dataset collection、cross-view perception 和 RL environment 这类 air-ground embodied intelligence 研究。