Massive CoreML latency spike on live AVFoundation camera feed vs. offline inference (CPU+ANE)

Question

Created 1d

Replies 1

Boosts 0

Participants 2

Hello,

I’m experiencing a severe performance degradation when running CoreML models on a live AVFoundation video feed compared to offline or synthetic inference. This happens across multiple models I've converted (including SCI, RTMPose, and RTMW) and affects multiple devices. The Environment

OS: macOS 26.3, iOS 26.3, iPadOS 26.3

Hardware: Mac14,6 (M2 Max), iPad Pro 11 M1, iPhone 13 mini

Compute Units: cpuAndNeuralEngine

The Numbers

When testing my SCI_output_image_int8.mlpackage model, the inference timings are drastically different:

Synthetic/Offline Inference: ~1.34 ms

Live Camera Inference: ~15.96 ms

Preprocessing is completely ruled out as the bottleneck. My profiling shows total preprocessing (nearest-neighbor resize + feature provider creation) takes only ~0.4 ms in camera mode. Furthermore, no frames are being dropped. What I've Tried

I am building a latency-critical app and have implemented almost every recommended optimization to try and fix this, but the camera-feed penalty remains:

Matched the AVFoundation camera output format exactly to the model input (640x480 at 30/60fps).
Used IOSurface-backed pixel buffers for everything (camera output, synthetic buffer, and resize buffer).
Enabled outputBackings.
Loaded the model once and reused it for all predictions.
Configured MLModelConfiguration with reshapeFrequency = .frequent and specializationStrategy = .fastPrediction.
Wrapped inference in ProcessInfo.processInfo.beginActivity(options: .latencyCritical, reason: "CoreML_Inference").
Set DispatchQueue to qos: .userInteractive.
Disabled the idle timer and enabled iOS Game Mode.
Exported models using coremltools 9.0 (deployment target iOS 26) with ImageType inputs/outputs and INT8 quantization.

Reproduction

To completely rule out UI or rendering overhead, I wrote a standalone Swift CLI script that isolates the AVFoundation and CoreML pipeline. The script clearly demonstrates the ~15ms latency on live camera frames versus the ~1ms latency on synthetic buffers.

(I have attached camera_coreml_benchmark.swift and coreml model (very light low light enghancement model) to this repo on github https://github.com/pzoltowski/apple-coreml-camera-latency-repro).

My Question: Is this massive overhead expected behavior for AVFoundation + Core ML on live feeds, or is this a framework/runtime bug? If expected, what is the Apple-recommended pattern to bypass this camera-only inference slowdown?

One think found interesting when running in debug model was faster (not as fast as in performance benchmark but faster than 16ms. Also somehow if I did some dummy calculation on on different DispatchQueue also seems like model got slightly faster. So maybe its related to ANE Power State issues (Jitter/SoC Wake) and going to fast to sleep and taking a long time to wakeup? Doing dummy calculation in background thought is probably not a solution.

Thanks in advance for any insights!

Boost

Answer 1

IAiSeed OP

10h

Experiencing significant performance degradation when running CoreML models on live AVFoundation video feeds compared to offline inference is a common issue, though the exact cause can vary based on several factors. Given the details you've provided, here are some insights and potential solutions to consider:

Potential Causes and Solutions

ANE Power State Management:

Observation: Your suspicion about ANE (Apple Neural Engine) power state issues seems plausible, especially with the noted variability in debug mode and when introducing dummy computations.

Solution: The ANE may enter a low-power state when idle, causing latency spikes when reactivated for inference. To mitigate this, consider keeping the ANE active by scheduling low-intensity tasks periodically. However, avoid unnecessary computations that could impact overall performance.

Approach: Use DispatchSourceTimer to run a very light task (e.g., a simple matrix multiplication) at regular intervals to keep the ANE primed. AVFoundation and CoreML Synchronization: Observation: Even with optimized preprocessing and model loading, synchronization overhead between AVFoundation and CoreML could contribute to latency.

Solution: Minimize the overhead by ensuring that video frame capture and model inference are tightly coupled. Consider using CMSampleBufferDelegate methods to process frames as soon as they are available, reducing the time spent waiting for buffer completion. Thread Management:

Observation: You've already set the dispatch queue to .userInteractive, but further thread management might improve performance. Solution: Ensure that all video processing and inference tasks are executed on a dedicated background thread to avoid blocking the main thread. Use DispatchQueue.global(qos: .userInteractive) for these tasks. Model Optimization:

Observation: While you've applied INT8 quantization, further model optimizations might yield better results.

Solution: Experiment with model pruning, quantization-aware training, or using smaller model architectures to reduce inference time. Additionally, ensure that the model's input/output shapes and data types are perfectly aligned with the AVFoundation feed.

Frame Rate Considerations:

Observation: Running inference at both 30fps and 60fps shows similar latency penalties, suggesting that frame rate is not the bottleneck.

Solution: Consider adjusting the frame rate to balance between quality and performance. Lowering the frame rate slightly might reduce the load on the ANE without significantly impacting user experience.

Debugging and Profiling:

Observation: Debug mode inference is faster, indicating potential runtime overhead in release mode.

Solution: Continue profiling in both modes to identify specific bottlenecks. Use Instruments to analyze CPU and ANE usage, focusing on any unexpected delays or power state transitions.

Frame Dropping and Latency:

Observation: No frames are being dropped, but latency remains high.

Solution: Ensure that the camera feed and inference pipeline are not overwhelming the system's resources. Monitor memory and CPU usage to prevent contention that could lead to increased latency.

Apple-Recommended Patterns

Preloading and Reusing Resources: You've already implemented these, but ensure that resources like pixel buffers and models are efficiently managed and reused throughout the session.

Adaptive Inference: Consider implementing adaptive inference strategies, where the model's complexity or resolution is adjusted based on real-time performance metrics or scene complexity.

Background Task Scheduling: Use UIApplication.beginBackgroundTask to extend the time available for processing frames if necessary, ensuring that the task completes before the app is suspended.

If none of these solutions resolve the issue, it may be worthwhile to reach out to Apple Developer Support, providing them with your detailed profiling data and reproduction steps. They may be able to offer specific guidance or identify potential bugs in the frameworks.

0