CARLA-Air: Fly Drones Inside a CARLA World — A Unified Infrastructure for Air-Ground Embodied Intelligence

Paper: arXiv:2603.28032 Code: louiszengCN/CarlaAir Code reference: main @ d70247b5 (2026-05-02)

1. Motivation (研究动机)

现有 open-source simulator 把能力切在不同域里：CARLA 有高保真城市道路、交通流、行人和成熟 Python API，但没有物理一致的 UAV dynamics；AirSim 有 multirotor flight 和 aerial sensors，却缺少真实城市交通、行人交互和大规模 ground scene。把两个 simulator 用 ROS 2 或自定义 bridge 连起来虽然可行，但会引入 inter-process synchronization、serialization overhead、双 rendering pipeline，以及最关键的 spatial-temporal consistency 风险：同一时刻的 aerial view、ground view、LiDAR、IMU/GNSS 可能不是同一个物理 tick 的状态。

Figure 1 解读：teaser 展示 CARLA-Air 的目标不是单纯“让无人机出现在 CARLA 地图里”，而是把 air-ground simulation、multi-modal sensing、embodied navigation、asset adaptation 和多城市场景放进一个 shared physically coherent world。对 reader 来说，这张图对应论文的需求定义：未来 low-altitude economy、drone logistics、urban inspection、air-ground cooperation 都需要 aerial agents 和 ground agents 在同一世界里同时感知、控制、评估。

Figure 2 解读：论文把 CARLA-Air 定位在 “high-fidelity + multi-domain” 象限：CARLA、AirSim 各自高保真但 single-domain，TranSimHub 等 joint platform 走 bridge/co-simulation 路线，而 CARLA-Air 的主张是保留上游能力的同时取消 bridge 边界，让 ground traffic、pedestrians、UAV flight、native APIs 和 shared renderer 同时成立。

Figure 3 解读：这张图解释为什么 bridge-based co-simulation 不是理想解。随着 concurrent sensors 从 $1$ 增加到 $16$ ，bridge 方案每帧数据传输时间从约 $1.3$ ms 增至 $20.4$ ms；CARLA-Air 因为 single-process design 保持在 $0.32$ — $0.45$ ms。差异的核心不是网络优化，而是有没有跨进程序列化和独立渲染管线。

本文要解决的具体问题是：构建一个统一的 open-source air-ground embodied intelligence infrastructure，让 drone、vehicle、pedestrian、traffic、weather、sensor streams 在同一 UE4 process、同一 physics tick、同一 rendering pipeline 里运行，同时保留 CARLA 和 AirSim 原生 APIs。这个问题值得做，是因为一旦 simulation foundation 统一，cross-view perception dataset、air-ground cooperative RL、VLN/VLA data generation、precision landing、multi-agent coordination 都可以在一个可复现环境里实现，而不是每个项目重新维护 bridge、clock sync 和 data alignment glue code。

2. Idea (核心思想)

核心 insight：UE4 只能有一个 active GameMode，但 CARLA 的关键 ground subsystems 必须通过 GameMode inheritance 初始化；AirSim 的核心 flight logic 更像一个可以后置 spawn 的 world actor。因此 CARLA-Air 选择“CARLA GameMode inheritance + AirSim actor composition”，而不是把两个 GameMode 平等桥接。

关键创新可以压缩成三点：第一，用一个 unified game mode 继承 CARLA 的 episode/weather/traffic/actor lifecycle/RPC infrastructure；第二，在 BeginPlay 之后把 AirSim 的 multirotor sim mode 和 flight pawn 作为 regular world actor 组合进同一 world；第三，两个 native Python clients 分别连各自 RPC server，但服务端共享同一个 UE4 world、tick 和 renderer。相比 TranSimHub / AirSim+Gazebo 这类 multi-process bridge，根本差异在系统边界：CARLA-Air 不在进程之间同步状态，而是在同一 engine process 内让两套 API 观察同一个 world state。

3. Method (方法)

3.1 Overall framework：single-process dual-backend runtime

CARLA-Air 的整体框架是 single UE4 process 内并行托管两个 backend：CARLA RPC server 面向 ground client，AirSim RPC server 面向 aerial client；下方不是两个世界，而是同一个 CARLA-Air Game Mode 管理的 shared actors、physics 和 rendering pipeline。论文中的 CARLAAirGameMode 在 released code 中对应 ASimWorldGameMode：它继承 ACarlaGameModeBase，在 BeginPlay() 中初始化 AirSim settings、创建 SimMode、创建 AirSim widget/input binding，并启动 AirSim API server。

Figure 4 解读：上半部分是两个不改代码的 Python clients；中间是两个独立 RPC server；核心在蓝色框：ground subsystems 通过 inheritance 获得，aerial flight actor 通过 composition 加入。底部 shared rendering pipeline 说明 RGB、depth、segmentation、weather effects 是同一渲染通道产生的，因此 multi-view sensing 可以按 tick 对齐。

3.2 GameMode conflict：inheritance + composition

UE4 的单 GameMode 约束是本文最关键的 engineering bottleneck。CARLA GameMode 负责 episode、weather、traffic、actor lifecycle、recorder、RPC；AirSim GameMode 负责读取 settings.json、设置 renderer、spawn flight actor、启动 aerial API。若地图选择 CARLA GameMode，则 AirSim 初始化缺失；若选择 AirSim GameMode，则 CARLA episode/RPC/actor dispatch 缺失。论文的解决方案是让 unified game mode 继承 CARLA，再把 AirSim 的 flight actor 组合进 world。

Figure 5 解读：左侧 naive approach 表示两个 GameMode 抢一个 slot，必然有一个被丢弃；右侧 CARLA-Air solution 表示 unified class 占据 slot 并保留 CARLA lifecycle，然后以 dashed composition edge 把 aerial flight actor spawn 出来。这个 asymmetry 是方案成立的原因：CARLA 不能轻易搬出 GameMode，AirSim flight logic 可以作为 actor 后置初始化。

论文公式与 released code 实现差异：论文正文和 appendix 将统一类称为 CARLAAirGameMode，但 main@d70247b5 的 released code 实际类名是 ASimWorldGameMode，文件为 Unreal/CarlaUE4/Plugins/AirSim/Source/SimWorldGameMode.{h,cpp}；功能上它确实继承 ACarlaGameModeBase 并在 BeginPlay() 中 bootstrap AirSim。另一个差异是 W3 论文实验写的是 30 vehicles + 10 pedestrians + 12 streams + synchronous mode，而 public CarlaAir_Release/source/python_api/examples/data_collector.py 默认示例是 --frames=100、15 traffic vehicles、clk.tick(20)；因此本文实验数字按论文 source tables 记录，代码伪代码只用于说明 released example 的实现骨架。

3.3 Coordinate mapping：UE4/CARLA frame 到 AirSim NED

CARLA 继承 UE4 左手坐标： $X$ forward、 $Y$ right、 $Z$ up，单位为 centimeter；AirSim 使用 NED frame： $X$ north、 $Y$ east、 $Z$ down，单位为 meter。因为两者 $X / Y$ 方向对齐，转换只需要 scale 和 $Z$ 翻转。给定 UE4 点 $p$ 和 shared origin $o$ ：

p_{NED} = \frac{1}{100} p_{x} - o_{x} p_{y} - o_{y} - (p_{z} - o_{z}) .

对 unit quaternion $q = (w, q_{x}, q_{y}, q_{z})$ ，论文给出：

q_{NED} = (w, q_{x}, q_{y}, - q_{z}) .

W4 中还定义 origin offset：

d = T (p_{spawn}^{world}) - p_{spawn}^{NED},

若 drone 从 world origin spawn，则 $d = 0$ 。released docs COORDINATE_SYSTEMS.md 用 airsim_z = -carla_z + offset_z 解释 position 变换；demo_drive_and_fly.py 的 DroneFollower.update() 则实际实现 XY 相对偏移跟随与 target_z=-altitude。注意：上面的 quaternion sign flip 是论文公式；released docs 还说明 Euler pitch/yaw/roll 两侧一致、无需换算，但 released examples/docs 没有提供通用 q_ned quaternion conversion helper。

Figure 6 解读：图中左侧 CARLA/UE4 是 z-up，右侧 AirSim 是 NED z-down。因为 horizontal axes 对齐，本文避免了复杂 axis permutation；真正需要小心的是单位从 centimeter 到 meter，以及 $Z$ 轴 sign flip。这个简化直接降低了 cross-view dataset 和 relative reward 计算的实现复杂度。

3.4 Asset pipeline and workflow runtime

资产层面，CARLA-Air 支持导入 custom robots、UAV configurations、vehicles、environment maps，并让它们成为 standard CARLA spawnable actors：它们参与同一 physics tick/rendering pass，也能被 aerial/ground sensor modalities 看到。workflow 层面，五个 applications 使用同一种 dual-client pattern：一个 Python process 内创建 carla.Client('localhost', 2000) 和 airsim.MultirotorClient(port=41451)，然后通过两个 API 操作同一个 world。

Figure 7 解读：上方是导入的四轮 mobile robot，下方是 custom electric sport car。重点不是外观，而是 asset pipeline 的 contract：导入后它们不再是某个 backend 的私有对象，而是可以通过 standard CARLA API spawn、被同一 renderer/sensors 观察、和 AirSim aerial agents 共处一个 world。

Figure 8 解读：五个 workflow 都遵循 dual-client architecture。一个 user script 同时驱动 ground client 与 aerial client，两个 client 分别通过 TCP 连接 server，但 server 下方是同一个 Unified UE4 Process 和 Shared World。因此 W1 的 vehicle pose、W3 的 12-stream dataset、W4 的 weather consistency、W5 的 reward 都可以用同一 tick 作为对齐键。

3.5 Pseudocode：按 released code 抽象的关键组件

class SimWorldGameMode(CarlaGameModeBase):
    def __init__(self):
        super().__init__()  # CARLA episode, recorder, factories, weather
        self.default_pawn_class = None
        self.actor_factories += [SensorFactory, StaticMeshFactory,
                                 TriggerFactory, VehicleFactory, WalkerFactory]
 
    def begin_play(self):
        super().begin_play()  # finish CARLA world / episode bootstrap
        self.spectator = self.spawn_and_register_spectator()
        self.initialize_airsim_settings()      # settings.json or default settings
        self.set_unreal_engine_settings()      # disable blur, enable custom depth
        self.sim_mode = self.create_sim_mode() # Multirotor / Car / CV
        self.widget = self.create_airsim_widget()
        self.setup_airsim_input_bindings()
        self.sim_mode.start_api_server()
 
    def tick(self, dt):
        super().tick(dt)
        if not self.drone_registered and self.sim_mode is not None:
            pawn = self.sim_mode.getVehicleSimApi().getPawn()
            self.carla_episode.ActorDispatcher.RegisterActor(
                pawn, type_id="airsim.drone")
            self.drone_registered = True
        if self.sim_mode.EnableReport:
            self.widget.updateDebugReport(self.sim_mode.getDebugReport())

def carla_location_to_airsim_ned(carla_location_m, offsets_m):
    # released docs: CARLA Python API already reports meters;
    # AirSim NED keeps X/Y axes and flips Z with a calibrated offset.
    ax = carla_location_m.x + offsets_m.x
    ay = carla_location_m.y + offsets_m.y
    az = -carla_location_m.z + offsets_m.z
    return ax, ay, az
 
# paper-only orientation formula, not a released helper in main@d70247b5:
# q_ned = (w, qx, qy, -qz)

def dual_client_workflow():
    ground = carla.Client("localhost", 2000)
    world = ground.get_world()
    aerial = airsim.MultirotorClient(port=41451)
    aerial.confirmConnection()
    aerial.enableApiControl(True)
    aerial.armDisarm(True)
    aerial.takeoffAsync().join()
 
    # both APIs operate on the same UE4 world state
    vehicle = world.spawn_actor(vehicle_blueprint, spawn_point)
    drone_state = aerial.getMultirotorState()
    rgb = aerial.simGetImages([airsim.ImageRequest("0", airsim.ImageType.Scene)])
    return vehicle, drone_state, rgb

def collect_air_ground_record(world, airsim_client, tick_id, save_dir):
    # public data_collector.py attaches RGB/depth/seg/LiDAR/IMU/GNSS to ego vehicle
    ground_packet = read_latest_ground_sensor_callbacks()
    aerial_rgb = airsim_client.simGetImages([
        airsim.ImageRequest("0", airsim.ImageType.Scene, False, False)
    ])[0]
    ego_tf = world.get_ego_vehicle().get_transform()
    metadata = {
        "frame": tick_id,
        "ego_pose": serialize_transform(ego_tf),
        "shared_tick_key": tick_id,
    }
    write_np_arrays_and_json(save_dir, tick_id, ground_packet, aerial_rgb, metadata)

W5 / RL environment 在论文中是平台能力展示，released code main@d70247b5 没有提供可直接对应的 Gym/RLlib environment、reward function 或训练脚本，因此这里不写 code-grounded pseudocode；note 只在实验结果中记录论文给出的 observation/action/reward 概念，并明确为 代码未实现。

Code reference: main @ d70247b5 (2026-05-02) — pseudocode and mapping based on this commit

Paper Concept	Source File	Key Class/Function
Unified GameMode / single-process integration	`Unreal/CarlaUE4/Plugins/AirSim/Source/SimWorldGameMode.{h,cpp}`	`ASimWorldGameMode : ACarlaGameModeBase`, `BeginPlay()`, `Tick()`, `CreateSimMode()`
One-direction AirSim → CARLA dependency	`Unreal/CarlaUE4/Plugins/AirSim/Source/AirSim.Build.cs`	`PrivateDependencyModuleNames` includes `Carla`; ground code does not import AirSim
CARLA actor/episode access for integrated actor registration	`Unreal/CarlaUE4/Plugins/Carla/Source/Carla/Game/CarlaEpisode.h`	`friend class ASimWorldGameMode`, `ActorDispatcher` access
AirSim settings and API server bootstrap	`SimWorldGameMode.cpp`, `CarlaAir_Release/source/config/settings.json`	`InitializeAirSimSettings()`, `SimMode_->startApiServer()`, `SimMode=Multirotor`, cameras at `1280×960`
Coordinate mapping / drone following	`CarlaAir_Release/guide/COORDINATE_SYSTEMS.md`, `CarlaAir_Release/source/python_api/examples/demo_drive_and_fly.py`	`airsim_z=-carla_z+offset_z`, `DroneFollower.update()`
Dual-client API pattern	`CarlaAir_Release/guide/examples/07_combined_demo.py`, `README.md`	`carla.Client('localhost', 2000)`, `airsim.MultirotorClient(port=41451)`
Multi-modal dataset example	`CarlaAir_Release/source/python_api/examples/data_collector.py`	spawns RGB/depth/seg/LiDAR/IMU/GNSS sensors and calls `simGetImages()`
Public release / quick start	`CarlaAir_Release/guide/Quick-Start.md`	ports, synchronous mode snippet, native API usage
W5 RL environment concept	`README.md`, paper Figure 14 only	`代码未实现` as Gym/RLlib env/reward/training script in released code; paper-level capability demo

3.6 直觉：为什么 single process 比 bridge 更关键

这篇论文的重点不是发明新的 controller，而是把 controller、sensor、renderer、physics 的“对齐边界”放到 engine 内部。Bridge 方案把两个 simulator 当黑盒，只能在进程边界交换状态；一旦 sensor 数量增加，serialization 和 clock drift 都会变成系统性误差。CARLA-Air 则让 aerial state 和 ground state 在同一 tick 被 renderer 读取，所以 cross-view perception 的 pair、RL reward 中的 relative pose、VLN/VLA 的 aerial oracle view 都天然共享一个 scene snapshot。这个设计的价值在于减少每个下游任务自己修 clock、修坐标、修 weather inconsistency 的负担。

4. Experimental Setup (实验设置)

Datasets / workloads and scale

本文不是提出静态数据集，而是验证一组 simulator workloads：W1 precision landing 在 Town10HD 上让 drone 从约 $12$ m 降落到 moving vehicle；W3 multi-modal dataset collection 运行 $1, 000$ ticks，包含 30 autopilot vehicles、10 pedestrians、8 ground sensors 和 4 aerial sensors，共 $1, 000$ 条 12-stream records；W4 cross-view perception 运行 500 ticks，覆盖 14 official weather presets，并在 Figure 12 中展示 Town01—05 和 Town10HD 的多天气 aerial views；W5 是 RL training environment 验证，核心证据是 357 次 actor spawn/destroy reset cycles 中 0 crashes / 0 API errors。

Baselines / compared systems

论文比较了三类 baseline：单域 simulator（CARLA、LGSVL、SUMO、MetaDrive、VISTA；AirSim、Flightmare、FlightGoggles、Gazebo/RotorS、OmniDrones、gym-pybullet-drones）、joint/co-simulation 方案（TranSimHub、CARLA+SUMO、AirSim+Gazebo、community AirSim+CARLA bridge）、以及 embodied/RL platforms（Isaac Lab、Isaac Gym、Habitat、SAPIEN、RoboSuite）。性能实验还包含 standalone ground sim only、standalone aerial sim only、joint idle、ground only、moderate joint、traffic surveillance、stability endurance profile。

Metrics

主要 metrics 包括：harmonic mean FPS（适合 rate quantity）、VRAM MiB、CPU utilization、API round-trip latency median/IQR、sensor alignment deviation、RPC error count、crash count、landing horizontal error、records collected、weather presets passed。论文还报告 bridge IPC per-frame transfer time，用于说明 cross-process serialization overhead 随 sensor count 近线性增长。

System / benchmark config

实验硬件与软件栈来自论文 source performance.tex 和 appendix：Ubuntu 20.04/22.04 LTS，NVIDIA RTX A4000 16GB GDDR6，AMD Ryzen 7 5800X 8-core 4.7GHz，32GB DDR4-3200；CARLA 0.9.16，AirSim 1.8.1，UE4.26，Python 3.8+。默认地图 Town10HD，Epic quality，synchronous mode；aerial experiments 使用 built-in SimpleFlight controller 与 default PID gains。Benchmark harness 使用 $T_{w} = 200$ warm-up ticks、 $T_{m} = 2, 000$ measurement ticks；VRAM 每 60 s 采样；latency benchmark 使用 500 warm-up calls + 5,000 measurement calls。论文没有训练某个 neural policy 的 learning rate/batch size/steps；W5 是 RL environment capability，作者只给出 observation/action/reward 形式和 stability evidence。

5. Experimental Results (实验结果)

Main performance numbers

Profile	Configuration	FPS	VRAM (MiB)	CPU (%)
Ground sim only	3 vehicles + 2 pedestrians; 8 sensors @ $1280 \times 720$	$28.4 \pm 1.2$	$3, 821 \pm 10$	$31 \pm 3$
Aerial sim only	1 drone; 8 sensors @ $1280 \times 720$	$44.7 \pm 2.1$	$2, 941 \pm 8$	$29 \pm 3$
Idle	Town10HD; no actors; no sensors	$60.0 \pm 0.4$	$3, 702 \pm 8$	$12 \pm 2$
Ground only	3 vehicles + 2 pedestrians; 8 sensors @ $1280 \times 720$	$26.3 \pm 1.4$	$3, 831 \pm 11$	$38 \pm 4$
Moderate joint	3 vehicles + 2 pedestrians + 1 drone; 8 sensors @ $1280 \times 720$	$19.8 \pm 1.1$	$3, 870 \pm 13$	$54 \pm 5$
Traffic surveillance	8 autopilot vehicles + 1 drone; 1 aerial RGB @ $1920 \times 1080$	$20.1 \pm 1.8$	$3, 874 \pm 15$	$61 \pm 6$
Stability endurance	Moderate joint; 357 spawn/destroy cycles; 3 hr continuous	$19.7 \pm 1.3$	$3, 878 \pm 17$	$55 \pm 5$

Moderate joint profile 保持 $19.8 \pm 1.1$ FPS，论文认为足够 standard RL episode-length closed-loop evaluation。Standalone ground baseline 到 moderate joint 的差值是 $8.6$ FPS（30.3%）：其中 $2.1$ FPS 来自 ground co-hosting， $6.5$ FPS 来自 aerial physics engine；VRAM 相比 ground-only 只多 39 MiB，主要 overhead 体现在 CPU。

Figure 9 解读：3 小时 endurance run 包含 357 次 spawn/destroy cycles。early cycles 的 VRAM 均值为 $3, 868$ MiB，late cycles 为 $3, 878$ MiB，drift 约 10 MiB；regression slope 是 $0.49$ MiB/cycle， $R^{2} = 0.11$ 。这支持作者的结论：没有显著 memory leak，残差更像 render-target caching。

Stability Metric	Early cycles 1—30	Late cycles 328—357
Frame rate	$19.9 \pm 1.2$ FPS	$19.7 \pm 1.3$ FPS
VRAM	$3, 868 \pm 14$ MiB	$3, 878 \pm 17$ MiB
CPU utilization	$53 \pm 5$ %	$55 \pm 5$ %
API error count	0	0
Crash count	0	0
VRAM regression slope	$0.49$ MiB/cycle	$R^{2} = 0.11$

API	Call	Median	IQR
Ground sim	World state snapshot	320 $μ$ s	40 $μ$ s
Ground sim	Actor transform query	280 $μ$ s	35 $μ$ s
Ground sim	Actor spawn + paired destroy	1,850 $μ$ s	210 $μ$ s
Ground sim	Actor destroy	920 $μ$ s	95 $μ$ s
Aerial sim	Multirotor state query	410 $μ$ s	55 $μ$ s
Aerial sim	Image capture, 1 RGB stream	3,200 $μ$ s	380 $μ$ s
Aerial sim	Velocity command dispatch	490 $μ$ s	60 $μ$ s
Bridge IPC reference	Cross-process state sync	3,000 $μ$ s	2,000 $μ$ s

Representative workflow results

Figure 10 解读：W1 展示 drone 对 moving vehicle 的 precision landing。左侧 time-lapse 显示 approach、descent、touchdown；右侧轨迹图与 altitude/error curve 表明 drone 从约 $12$ m 平滑下降，horizontal error 从约 $6$ m 收敛到 $\pm 0.5$ m tolerance band 内。这个 workflow 验证了 coordinate transform 和 tick-synchronous control 是否能用于真实 air-ground cooperation。

W1 Metric	Value	Notes
Mean FPS	19.3	harmonic mean
Initial altitude	$\approx 12$ m	start of descent
Landing duration	$\approx 20$ s	approach to touchdown
Final horizontal error	$< 0.5$ m	within tolerance band
Initial horizontal error	$\approx 6$ m	at descent start
RPC errors	0	both clients

Figure 11 解读：W2 是 embodied navigation / VLN/VLA data generation 的 qualitative demo。无人机用 bird’s-eye observations 跟踪 pedestrian，图中每帧配有 chain-of-thought 风格 reasoning。论文想强调的不是某个 language model 性能，而是 CARLA-Air 能同时提供 aerial overview、street-level detail、semantic annotations 和 shared weather/lighting，用于构建 cross-view instruction-following 数据。

Figure 12 解读：W3 在同一 simulation tick 下采集 vehicle perspective 与 drone perspective 的多模态传感器输出。top row 和 bottom row 均包含 RGB、semantic/depth、LiDAR/geometry 等 modality；关键指标是所有 streams 使用同一 tick index，而不是事后 timestamp interpolation。

W3 Metric	Value	Notes
Mean FPS	17.1	harmonic mean
Concurrent streams	12	8 ground + 4 aerial
Records collected	1,000	one per tick
Max alignment deviation	$\leq 1$ tick	under disk-write load
RPC errors	0	both clients
Per-tick write latency	$61 \pm 9$ ms	incl. serialization

Figure 13 解读：W4 展示 aerial RGB view 在 Town01—05、Town10HD 与多种 weather presets 下的一致渲染。论文用 14/14 weather presets passed 说明同一 shared renderer 能把 weather/lighting 同步传递给 ground 与 aerial sensors；对应公式是 $ϵ_{k} = ∣ t_{k}^{gnd} - t_{k}^{air} ∣ = 0$ 。

W4 Metric	Value	Notes
Mean FPS	18.2	harmonic mean
Co-registered pairs	500	aerial depth + ground segmentation
Per-tick latency	$52 \pm 6$ ms	full collection loop
Sensor alignment	0 ticks	sync mode guarantee
Weather presets passed	14/14	all official presets
RPC errors	0	both clients

Figure 14 解读：W5 把 CARLA-Air 包装成 RL environment：每个 synchronous tick 产生 observation（drone pose、vehicle pose、relative distance、traffic state），policy 输出 3D velocity commands，reward 编码 tracking accuracy、altitude maintenance 和 collision avoidance。论文没有报告具体 policy 的 reward curve，而是用稳定 reset 和 shared world state 证明该 environment 适合后续 RL training。

Ablation / key findings

严格意义上本文没有 component ablation table；它的“组件有效性”主要来自系统对比和 workload 分解。关键结论包括：single-process design 让 IPC 随 sensor count 基本保持 $< 0.5$ ms；moderate joint 的 overhead 主要是 CPU-bound aerial physics，而不是 VRAM；traffic surveillance 把 vehicle count 提高到 8 仍有 $20.1 \pm 1.8$ FPS，说明 $1920 \times 1080$ sensor rendering 比 actor population 更主导；357 reset cycles 的 0 crash / 0 API error 支持 RL-style episode reset 的稳定性。

Limitations

作者明确给出三类限制：第一，当前 performance characterization 只覆盖 moderate traffic loads，高密度大量 actor 仍是 active engineering target；第二，map switching 需要 full process restart，因为两个 backend 的 actor lifecycle 独立，in-session reset 还在计划中；第三，超过 two drones 的配置能运行但尚未跨大量场景正式验证。未来工作包括更细的 physics-state synchronization、ROS 2 bridge 统一发布两套 stream，以及类似 Isaac Lab / OmniDrones 的 GPU-parallel multi-environment execution。

Overall conclusion

CARLA-Air 的贡献是一个 simulator infrastructure，而不是新的 perception/control algorithm。它用 minimal source modification 和 additive AirSim plugin integration，把 CARLA 的 realistic urban world 与 AirSim 的 physics-accurate UAV flight 放进同一个 process，保留两套 native APIs，并把 aerial-ground sensing、weather、physics、actor state 绑定到 shared tick。实验结果表明该设计在 joint workloads 下约 20 FPS、VRAM 稳定、latency 足够低，能支撑 precision landing、VLN/VLA data generation、multi-modal dataset collection、cross-view perception 和 RL environment 这类 air-ground embodied intelligence 研究。

Paper Notes

探索

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence