CARLA-Air: Fly Drones Inside a CARLA World — A Unified Infrastructure for Air-Ground Embodied Intelligence

Paper: arXiv:2603.28032 Code: louiszengCN/CarlaAir Code reference: main @ d70247b5 (2026-05-02)

1. Motivation (研究动机)

现有 open-source simulator 把能力切在不同域里:CARLA 有高保真城市道路、交通流、行人和成熟 Python API,但没有物理一致的 UAV dynamics;AirSim 有 multirotor flight 和 aerial sensors,却缺少真实城市交通、行人交互和大规模 ground scene。把两个 simulator 用 ROS 2 或自定义 bridge 连起来虽然可行,但会引入 inter-process synchronization、serialization overhead、双 rendering pipeline,以及最关键的 spatial-temporal consistency 风险:同一时刻的 aerial view、ground view、LiDAR、IMU/GNSS 可能不是同一个物理 tick 的状态。

Figure 1 解读:teaser 展示 CARLA-Air 的目标不是单纯“让无人机出现在 CARLA 地图里”,而是把 air-ground simulation、multi-modal sensing、embodied navigation、asset adaptation 和多城市场景放进一个 shared physically coherent world。对 reader 来说,这张图对应论文的需求定义:未来 low-altitude economy、drone logistics、urban inspection、air-ground cooperation 都需要 aerial agents 和 ground agents 在同一世界里同时感知、控制、评估。

Figure 2 解读:论文把 CARLA-Air 定位在 “high-fidelity + multi-domain” 象限:CARLA、AirSim 各自高保真但 single-domain,TranSimHub 等 joint platform 走 bridge/co-simulation 路线,而 CARLA-Air 的主张是保留上游能力的同时取消 bridge 边界,让 ground traffic、pedestrians、UAV flight、native APIs 和 shared renderer 同时成立。

Figure 3 解读:这张图解释为什么 bridge-based co-simulation 不是理想解。随着 concurrent sensors 从 增加到 ,bridge 方案每帧数据传输时间从约 ms 增至 ms;CARLA-Air 因为 single-process design 保持在 ms。差异的核心不是网络优化,而是有没有跨进程序列化和独立渲染管线。

本文要解决的具体问题是:构建一个统一的 open-source air-ground embodied intelligence infrastructure,让 drone、vehicle、pedestrian、traffic、weather、sensor streams 在同一 UE4 process、同一 physics tick、同一 rendering pipeline 里运行,同时保留 CARLA 和 AirSim 原生 APIs。这个问题值得做,是因为一旦 simulation foundation 统一,cross-view perception dataset、air-ground cooperative RL、VLN/VLA data generation、precision landing、multi-agent coordination 都可以在一个可复现环境里实现,而不是每个项目重新维护 bridge、clock sync 和 data alignment glue code。

2. Idea (核心思想)

核心 insight:UE4 只能有一个 active GameMode,但 CARLA 的关键 ground subsystems 必须通过 GameMode inheritance 初始化;AirSim 的核心 flight logic 更像一个可以后置 spawn 的 world actor。因此 CARLA-Air 选择“CARLA GameMode inheritance + AirSim actor composition”,而不是把两个 GameMode 平等桥接。

关键创新可以压缩成三点:第一,用一个 unified game mode 继承 CARLA 的 episode/weather/traffic/actor lifecycle/RPC infrastructure;第二,在 BeginPlay 之后把 AirSim 的 multirotor sim mode 和 flight pawn 作为 regular world actor 组合进同一 world;第三,两个 native Python clients 分别连各自 RPC server,但服务端共享同一个 UE4 world、tick 和 renderer。相比 TranSimHub / AirSim+Gazebo 这类 multi-process bridge,根本差异在系统边界:CARLA-Air 不在进程之间同步状态,而是在同一 engine process 内让两套 API 观察同一个 world state。

3. Method (方法)

3.1 Overall framework:single-process dual-backend runtime

CARLA-Air 的整体框架是 single UE4 process 内并行托管两个 backend:CARLA RPC server 面向 ground client,AirSim RPC server 面向 aerial client;下方不是两个世界,而是同一个 CARLA-Air Game Mode 管理的 shared actors、physics 和 rendering pipeline。论文中的 CARLAAirGameMode 在 released code 中对应 ASimWorldGameMode:它继承 ACarlaGameModeBase,在 BeginPlay() 中初始化 AirSim settings、创建 SimMode、创建 AirSim widget/input binding,并启动 AirSim API server。

Figure 4 解读:上半部分是两个不改代码的 Python clients;中间是两个独立 RPC server;核心在蓝色框:ground subsystems 通过 inheritance 获得,aerial flight actor 通过 composition 加入。底部 shared rendering pipeline 说明 RGB、depth、segmentation、weather effects 是同一渲染通道产生的,因此 multi-view sensing 可以按 tick 对齐。

3.2 GameMode conflict:inheritance + composition

UE4 的单 GameMode 约束是本文最关键的 engineering bottleneck。CARLA GameMode 负责 episode、weather、traffic、actor lifecycle、recorder、RPC;AirSim GameMode 负责读取 settings.json、设置 renderer、spawn flight actor、启动 aerial API。若地图选择 CARLA GameMode,则 AirSim 初始化缺失;若选择 AirSim GameMode,则 CARLA episode/RPC/actor dispatch 缺失。论文的解决方案是让 unified game mode 继承 CARLA,再把 AirSim 的 flight actor 组合进 world。

Figure 5 解读:左侧 naive approach 表示两个 GameMode 抢一个 slot,必然有一个被丢弃;右侧 CARLA-Air solution 表示 unified class 占据 slot 并保留 CARLA lifecycle,然后以 dashed composition edge 把 aerial flight actor spawn 出来。这个 asymmetry 是方案成立的原因:CARLA 不能轻易搬出 GameMode,AirSim flight logic 可以作为 actor 后置初始化。

论文公式与 released code 实现差异:论文正文和 appendix 将统一类称为 CARLAAirGameMode,但 main@d70247b5 的 released code 实际类名是 ASimWorldGameMode,文件为 Unreal/CarlaUE4/Plugins/AirSim/Source/SimWorldGameMode.{h,cpp};功能上它确实继承 ACarlaGameModeBase 并在 BeginPlay() 中 bootstrap AirSim。另一个差异是 W3 论文实验写的是 30 vehicles + 10 pedestrians + 12 streams + synchronous mode,而 public CarlaAir_Release/source/python_api/examples/data_collector.py 默认示例是 --frames=100、15 traffic vehicles、clk.tick(20);因此本文实验数字按论文 source tables 记录,代码伪代码只用于说明 released example 的实现骨架。

3.3 Coordinate mapping:UE4/CARLA frame 到 AirSim NED

CARLA 继承 UE4 左手坐标: forward、 right、 up,单位为 centimeter;AirSim 使用 NED frame: north、 east、 down,单位为 meter。因为两者 方向对齐,转换只需要 scale 和 翻转。给定 UE4 点 和 shared origin

对 unit quaternion ,论文给出:

W4 中还定义 origin offset:

若 drone 从 world origin spawn,则 。released docs COORDINATE_SYSTEMS.mdairsim_z = -carla_z + offset_z 解释 position 变换;demo_drive_and_fly.pyDroneFollower.update() 则实际实现 XY 相对偏移跟随与 target_z=-altitude。注意:上面的 quaternion sign flip 是论文公式;released docs 还说明 Euler pitch/yaw/roll 两侧一致、无需换算,但 released examples/docs 没有提供通用 q_ned quaternion conversion helper。

Figure 6 解读:图中左侧 CARLA/UE4 是 z-up,右侧 AirSim 是 NED z-down。因为 horizontal axes 对齐,本文避免了复杂 axis permutation;真正需要小心的是单位从 centimeter 到 meter,以及 轴 sign flip。这个简化直接降低了 cross-view dataset 和 relative reward 计算的实现复杂度。

3.4 Asset pipeline and workflow runtime

资产层面,CARLA-Air 支持导入 custom robots、UAV configurations、vehicles、environment maps,并让它们成为 standard CARLA spawnable actors:它们参与同一 physics tick/rendering pass,也能被 aerial/ground sensor modalities 看到。workflow 层面,五个 applications 使用同一种 dual-client pattern:一个 Python process 内创建 carla.Client('localhost', 2000)airsim.MultirotorClient(port=41451),然后通过两个 API 操作同一个 world。

Figure 7 解读:上方是导入的四轮 mobile robot,下方是 custom electric sport car。重点不是外观,而是 asset pipeline 的 contract:导入后它们不再是某个 backend 的私有对象,而是可以通过 standard CARLA API spawn、被同一 renderer/sensors 观察、和 AirSim aerial agents 共处一个 world。

Figure 8 解读:五个 workflow 都遵循 dual-client architecture。一个 user script 同时驱动 ground client 与 aerial client,两个 client 分别通过 TCP 连接 server,但 server 下方是同一个 Unified UE4 Process 和 Shared World。因此 W1 的 vehicle pose、W3 的 12-stream dataset、W4 的 weather consistency、W5 的 reward 都可以用同一 tick 作为对齐键。

3.5 Pseudocode:按 released code 抽象的关键组件

class SimWorldGameMode(CarlaGameModeBase):
    def __init__(self):
        super().__init__()  # CARLA episode, recorder, factories, weather
        self.default_pawn_class = None
        self.actor_factories += [SensorFactory, StaticMeshFactory,
                                 TriggerFactory, VehicleFactory, WalkerFactory]
 
    def begin_play(self):
        super().begin_play()  # finish CARLA world / episode bootstrap
        self.spectator = self.spawn_and_register_spectator()
        self.initialize_airsim_settings()      # settings.json or default settings
        self.set_unreal_engine_settings()      # disable blur, enable custom depth
        self.sim_mode = self.create_sim_mode() # Multirotor / Car / CV
        self.widget = self.create_airsim_widget()
        self.setup_airsim_input_bindings()
        self.sim_mode.start_api_server()
 
    def tick(self, dt):
        super().tick(dt)
        if not self.drone_registered and self.sim_mode is not None:
            pawn = self.sim_mode.getVehicleSimApi().getPawn()
            self.carla_episode.ActorDispatcher.RegisterActor(
                pawn, type_id="airsim.drone")
            self.drone_registered = True
        if self.sim_mode.EnableReport:
            self.widget.updateDebugReport(self.sim_mode.getDebugReport())
def carla_location_to_airsim_ned(carla_location_m, offsets_m):
    # released docs: CARLA Python API already reports meters;
    # AirSim NED keeps X/Y axes and flips Z with a calibrated offset.
    ax = carla_location_m.x + offsets_m.x
    ay = carla_location_m.y + offsets_m.y
    az = -carla_location_m.z + offsets_m.z
    return ax, ay, az
 
# paper-only orientation formula, not a released helper in main@d70247b5:
# q_ned = (w, qx, qy, -qz)
def dual_client_workflow():
    ground = carla.Client("localhost", 2000)
    world = ground.get_world()
    aerial = airsim.MultirotorClient(port=41451)
    aerial.confirmConnection()
    aerial.enableApiControl(True)
    aerial.armDisarm(True)
    aerial.takeoffAsync().join()
 
    # both APIs operate on the same UE4 world state
    vehicle = world.spawn_actor(vehicle_blueprint, spawn_point)
    drone_state = aerial.getMultirotorState()
    rgb = aerial.simGetImages([airsim.ImageRequest("0", airsim.ImageType.Scene)])
    return vehicle, drone_state, rgb
def collect_air_ground_record(world, airsim_client, tick_id, save_dir):
    # public data_collector.py attaches RGB/depth/seg/LiDAR/IMU/GNSS to ego vehicle
    ground_packet = read_latest_ground_sensor_callbacks()
    aerial_rgb = airsim_client.simGetImages([
        airsim.ImageRequest("0", airsim.ImageType.Scene, False, False)
    ])[0]
    ego_tf = world.get_ego_vehicle().get_transform()
    metadata = {
        "frame": tick_id,
        "ego_pose": serialize_transform(ego_tf),
        "shared_tick_key": tick_id,
    }
    write_np_arrays_and_json(save_dir, tick_id, ground_packet, aerial_rgb, metadata)

W5 / RL environment 在论文中是平台能力展示,released code main@d70247b5 没有提供可直接对应的 Gym/RLlib environment、reward function 或训练脚本,因此这里不写 code-grounded pseudocode;note 只在实验结果中记录论文给出的 observation/action/reward 概念,并明确为 代码未实现

Code reference: main @ d70247b5 (2026-05-02) — pseudocode and mapping based on this commit

Paper ConceptSource FileKey Class/Function
Unified GameMode / single-process integrationUnreal/CarlaUE4/Plugins/AirSim/Source/SimWorldGameMode.{h,cpp}ASimWorldGameMode : ACarlaGameModeBase, BeginPlay(), Tick(), CreateSimMode()
One-direction AirSim → CARLA dependencyUnreal/CarlaUE4/Plugins/AirSim/Source/AirSim.Build.csPrivateDependencyModuleNames includes Carla; ground code does not import AirSim
CARLA actor/episode access for integrated actor registrationUnreal/CarlaUE4/Plugins/Carla/Source/Carla/Game/CarlaEpisode.hfriend class ASimWorldGameMode, ActorDispatcher access
AirSim settings and API server bootstrapSimWorldGameMode.cpp, CarlaAir_Release/source/config/settings.jsonInitializeAirSimSettings(), SimMode_->startApiServer(), SimMode=Multirotor, cameras at 1280×960
Coordinate mapping / drone followingCarlaAir_Release/guide/COORDINATE_SYSTEMS.md, CarlaAir_Release/source/python_api/examples/demo_drive_and_fly.pyairsim_z=-carla_z+offset_z, DroneFollower.update()
Dual-client API patternCarlaAir_Release/guide/examples/07_combined_demo.py, README.mdcarla.Client('localhost', 2000), airsim.MultirotorClient(port=41451)
Multi-modal dataset exampleCarlaAir_Release/source/python_api/examples/data_collector.pyspawns RGB/depth/seg/LiDAR/IMU/GNSS sensors and calls simGetImages()
Public release / quick startCarlaAir_Release/guide/Quick-Start.mdports, synchronous mode snippet, native API usage
W5 RL environment conceptREADME.md, paper Figure 14 only代码未实现 as Gym/RLlib env/reward/training script in released code; paper-level capability demo

3.6 直觉:为什么 single process 比 bridge 更关键

这篇论文的重点不是发明新的 controller,而是把 controller、sensor、renderer、physics 的“对齐边界”放到 engine 内部。Bridge 方案把两个 simulator 当黑盒,只能在进程边界交换状态;一旦 sensor 数量增加,serialization 和 clock drift 都会变成系统性误差。CARLA-Air 则让 aerial state 和 ground state 在同一 tick 被 renderer 读取,所以 cross-view perception 的 pair、RL reward 中的 relative pose、VLN/VLA 的 aerial oracle view 都天然共享一个 scene snapshot。这个设计的价值在于减少每个下游任务自己修 clock、修坐标、修 weather inconsistency 的负担。

4. Experimental Setup (实验设置)

Datasets / workloads and scale

本文不是提出静态数据集,而是验证一组 simulator workloads:W1 precision landing 在 Town10HD 上让 drone 从约 m 降落到 moving vehicle;W3 multi-modal dataset collection 运行 ticks,包含 30 autopilot vehicles、10 pedestrians、8 ground sensors 和 4 aerial sensors,共 条 12-stream records;W4 cross-view perception 运行 500 ticks,覆盖 14 official weather presets,并在 Figure 12 中展示 Town01—05 和 Town10HD 的多天气 aerial views;W5 是 RL training environment 验证,核心证据是 357 次 actor spawn/destroy reset cycles 中 0 crashes / 0 API errors。

Baselines / compared systems

论文比较了三类 baseline:单域 simulator(CARLA、LGSVL、SUMO、MetaDrive、VISTA;AirSim、Flightmare、FlightGoggles、Gazebo/RotorS、OmniDrones、gym-pybullet-drones)、joint/co-simulation 方案(TranSimHub、CARLA+SUMO、AirSim+Gazebo、community AirSim+CARLA bridge)、以及 embodied/RL platforms(Isaac Lab、Isaac Gym、Habitat、SAPIEN、RoboSuite)。性能实验还包含 standalone ground sim only、standalone aerial sim only、joint idle、ground only、moderate joint、traffic surveillance、stability endurance profile。

Metrics

主要 metrics 包括:harmonic mean FPS(适合 rate quantity)、VRAM MiB、CPU utilization、API round-trip latency median/IQR、sensor alignment deviation、RPC error count、crash count、landing horizontal error、records collected、weather presets passed。论文还报告 bridge IPC per-frame transfer time,用于说明 cross-process serialization overhead 随 sensor count 近线性增长。

System / benchmark config

实验硬件与软件栈来自论文 source performance.tex 和 appendix:Ubuntu 20.04/22.04 LTS,NVIDIA RTX A4000 16GB GDDR6,AMD Ryzen 7 5800X 8-core 4.7GHz,32GB DDR4-3200;CARLA 0.9.16,AirSim 1.8.1,UE4.26,Python 3.8+。默认地图 Town10HD,Epic quality,synchronous mode;aerial experiments 使用 built-in SimpleFlight controller 与 default PID gains。Benchmark harness 使用 warm-up ticks、 measurement ticks;VRAM 每 60 s 采样;latency benchmark 使用 500 warm-up calls + 5,000 measurement calls。论文没有训练某个 neural policy 的 learning rate/batch size/steps;W5 是 RL environment capability,作者只给出 observation/action/reward 形式和 stability evidence。

5. Experimental Results (实验结果)

Main performance numbers

ProfileConfigurationFPSVRAM (MiB)CPU (%)
Ground sim only3 vehicles + 2 pedestrians; 8 sensors @
Aerial sim only1 drone; 8 sensors @
IdleTown10HD; no actors; no sensors
Ground only3 vehicles + 2 pedestrians; 8 sensors @
Moderate joint3 vehicles + 2 pedestrians + 1 drone; 8 sensors @
Traffic surveillance8 autopilot vehicles + 1 drone; 1 aerial RGB @
Stability enduranceModerate joint; 357 spawn/destroy cycles; 3 hr continuous

Moderate joint profile 保持 FPS,论文认为足够 standard RL episode-length closed-loop evaluation。Standalone ground baseline 到 moderate joint 的差值是 FPS(30.3%):其中 FPS 来自 ground co-hosting, FPS 来自 aerial physics engine;VRAM 相比 ground-only 只多 39 MiB,主要 overhead 体现在 CPU。

Figure 9 解读:3 小时 endurance run 包含 357 次 spawn/destroy cycles。early cycles 的 VRAM 均值为 MiB,late cycles 为 MiB,drift 约 10 MiB;regression slope 是 MiB/cycle,。这支持作者的结论:没有显著 memory leak,残差更像 render-target caching。

Stability MetricEarly cycles 1—30Late cycles 328—357
Frame rate FPS FPS
VRAM MiB MiB
CPU utilization%%
API error count00
Crash count00
VRAM regression slope MiB/cycle
APICallMedianIQR
Ground simWorld state snapshot320 s40 s
Ground simActor transform query280 s35 s
Ground simActor spawn + paired destroy1,850 s210 s
Ground simActor destroy920 s95 s
Aerial simMultirotor state query410 s55 s
Aerial simImage capture, 1 RGB stream3,200 s380 s
Aerial simVelocity command dispatch490 s60 s
Bridge IPC referenceCross-process state sync3,000 s2,000 s

Representative workflow results

Figure 10 解读:W1 展示 drone 对 moving vehicle 的 precision landing。左侧 time-lapse 显示 approach、descent、touchdown;右侧轨迹图与 altitude/error curve 表明 drone 从约 m 平滑下降,horizontal error 从约 m 收敛到 m tolerance band 内。这个 workflow 验证了 coordinate transform 和 tick-synchronous control 是否能用于真实 air-ground cooperation。

W1 MetricValueNotes
Mean FPS19.3harmonic mean
Initial altitude mstart of descent
Landing duration sapproach to touchdown
Final horizontal error mwithin tolerance band
Initial horizontal error mat descent start
RPC errors0both clients

Figure 11 解读:W2 是 embodied navigation / VLN/VLA data generation 的 qualitative demo。无人机用 bird’s-eye observations 跟踪 pedestrian,图中每帧配有 chain-of-thought 风格 reasoning。论文想强调的不是某个 language model 性能,而是 CARLA-Air 能同时提供 aerial overview、street-level detail、semantic annotations 和 shared weather/lighting,用于构建 cross-view instruction-following 数据。

Figure 12 解读:W3 在同一 simulation tick 下采集 vehicle perspective 与 drone perspective 的多模态传感器输出。top row 和 bottom row 均包含 RGB、semantic/depth、LiDAR/geometry 等 modality;关键指标是所有 streams 使用同一 tick index,而不是事后 timestamp interpolation。

W3 MetricValueNotes
Mean FPS17.1harmonic mean
Concurrent streams128 ground + 4 aerial
Records collected1,000one per tick
Max alignment deviation tickunder disk-write load
RPC errors0both clients
Per-tick write latency msincl. serialization

Figure 13 解读:W4 展示 aerial RGB view 在 Town01—05、Town10HD 与多种 weather presets 下的一致渲染。论文用 14/14 weather presets passed 说明同一 shared renderer 能把 weather/lighting 同步传递给 ground 与 aerial sensors;对应公式是

W4 MetricValueNotes
Mean FPS18.2harmonic mean
Co-registered pairs500aerial depth + ground segmentation
Per-tick latency msfull collection loop
Sensor alignment0 tickssync mode guarantee
Weather presets passed14/14all official presets
RPC errors0both clients

Figure 14 解读:W5 把 CARLA-Air 包装成 RL environment:每个 synchronous tick 产生 observation(drone pose、vehicle pose、relative distance、traffic state),policy 输出 3D velocity commands,reward 编码 tracking accuracy、altitude maintenance 和 collision avoidance。论文没有报告具体 policy 的 reward curve,而是用稳定 reset 和 shared world state 证明该 environment 适合后续 RL training。

Ablation / key findings

严格意义上本文没有 component ablation table;它的“组件有效性”主要来自系统对比和 workload 分解。关键结论包括:single-process design 让 IPC 随 sensor count 基本保持 ms;moderate joint 的 overhead 主要是 CPU-bound aerial physics,而不是 VRAM;traffic surveillance 把 vehicle count 提高到 8 仍有 FPS,说明 sensor rendering 比 actor population 更主导;357 reset cycles 的 0 crash / 0 API error 支持 RL-style episode reset 的稳定性。

Limitations

作者明确给出三类限制:第一,当前 performance characterization 只覆盖 moderate traffic loads,高密度大量 actor 仍是 active engineering target;第二,map switching 需要 full process restart,因为两个 backend 的 actor lifecycle 独立,in-session reset 还在计划中;第三,超过 two drones 的配置能运行但尚未跨大量场景正式验证。未来工作包括更细的 physics-state synchronization、ROS 2 bridge 统一发布两套 stream,以及类似 Isaac Lab / OmniDrones 的 GPU-parallel multi-environment execution。

Overall conclusion

CARLA-Air 的贡献是一个 simulator infrastructure,而不是新的 perception/control algorithm。它用 minimal source modification 和 additive AirSim plugin integration,把 CARLA 的 realistic urban world 与 AirSim 的 physics-accurate UAV flight 放进同一个 process,保留两套 native APIs,并把 aerial-ground sensing、weather、physics、actor state 绑定到 shared tick。实验结果表明该设计在 joint workloads 下约 20 FPS、VRAM 稳定、latency 足够低,能支撑 precision landing、VLN/VLA data generation、multi-modal dataset collection、cross-view perception 和 RL environment 这类 air-ground embodied intelligence 研究。