Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline #68

phial3 · 2025-03-01T12:55:55Z

Hardware-Accelerated Video Decoding

When enabled ffmpeg features with hardware acceleration support, the DataLoader's decoder should prioritize hardware-accelerated backends (eg: nvdec/cuvid for NVIDIA GPUs, qsv for Intel GPUs).

As an example, consider using rsmedia, which provides hardware-accelerated decoding and encoding.

This is a modifited i made to Dataloader to support hardware acceleration, only support cuda for now. This is a commit：

usls Dataloader support Decoder Hardware acceleration commit

Optimizing model.forward Speed with Underutilized GPU

I’ve noticed that GPU resources are significantly underutilized, but the inference speed remains very slow.
What optimization strategies can I apply?

This is my code example：

fn main() -> Result<()> {
    let options = args::build_options()?;

    // build model
    let mut model = YOLO::try_from(options.commit()?)?;

    // build dataloader
    let dl = DataLoader::new(&args::input_source())?
        .with_batch(model.batch() as _)
        .with_device(Device::Cuda(0))
        .build()?;

    // build annotator
    let annotator = Annotator::default()
        .with_skeletons(&usls::COCO_SKELETONS_16)
        .without_masks(true)
        .with_bboxes_thickness(3)
        .with_saveout(model.spec());

    let mut position = Time::zero();
    let duration: Time = Time::from_nth_of_a_second(30);

    let mut encoder = EncoderBuilder::new(std::path::Path::new(&args::output()), 1920, 1080)
        .with_format("flv")
        .with_codec_name("h264_nvenc".to_string())
        .with_hardware_device(HWDeviceType::CUDA)
        .with_options(&Options::preset_h264_nvenc())
        .build()?;

    // run & annotate
    for (xs, _paths) in dl {

        let ys = model.forward(&xs)?;

        // extract bboxes
        for y in ys.iter() {
            if let Some(bboxes) = y.bboxes() {
                println!("[Bboxes]: Found {} objects", bboxes.len());
                for (i, bbox) in bboxes.iter().enumerate() {
                    println!("{}: {:?}", i, bbox)
                }
            }
        }

        // plot
        let frames = annotator.plot(&xs, &ys, false)?;

        // encode
        for (i, img) in frames.iter().enumerate() {
            // save image if needed
            img.save(format!("/tmp/images/{}_{}.png", string_now("-"), i))?;

            // image -> AVFrame
            let raw_frame = RawFrame::try_from_cv(&img.to_rgb8())?;
        
            // realtime streaming encoding
            encoder.encode_raw(&raw_frame)?;

            // Update the current position and add the inter-frame duration to it.
            position = position.aligned_with(duration).add()
        }
    }

    model.summary();

    encoder.finish().expect("failed to finish encoder");

    Ok(())
}

End-to-End Pipeline with YOLO Detection + Hardware-Accelerated Encoding

Workflow: YOLO model detection → bounding box rendering → real-time streaming via. (eg. NVIDIA nvenc)

Consideration should be given to how to achieve resource efficiency，and Real-time streaming to ensure smooth, stable and clear picture quality.

The text was updated successfully, but these errors were encountered:

jamjamjon · 2025-03-01T14:11:17Z

Hardware acceleration sounds like a great solution! I’ll check out the suggestions and code you provided as soon as possible—thank you so much!

phial3 · 2025-03-03T07:27:08Z

Time consuming statistics:

2025-03-03T15:17:05.761789748+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.067834104s, min=1.039388029s, max=1.098278046s
Annotation: avg=440.746987ms, min=356.098842ms, max=487.375932ms
Encoding: avg=114.577614ms, min=86.224038ms, max=156.673274ms
Batch sender time: 372.879µs

2025-03-03T15:17:09.516312610+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.070150794s, min=1.039388029s, max=1.098278046s
Annotation: avg=438.701176ms, min=356.098842ms, max=487.375932ms
Encoding: avg=117.700642ms, min=86.224038ms, max=156.673274ms
Batch sender time: 289.202µs

2025-03-03T15:17:13.308906499+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.072397981s, min=1.039388029s, max=1.098278046s
Annotation: avg=433.072846ms, min=356.098842ms, max=487.375932ms
Encoding: avg=123.453407ms, min=86.224038ms, max=157.194381ms
Batch sender time: 235.147µs

2025-03-03T15:17:17.364376789+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.066150328s, min=1.022416759s, max=1.098278046s
Annotation: avg=428.853764ms, min=356.098842ms, max=487.375932ms
Encoding: avg=122.870734ms, min=86.224038ms, max=157.194381ms
Batch sender time: 371.402µs

2025-03-03T15:17:21.137066568+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.067140058s, min=1.022416759s, max=1.098278046s
Annotation: avg=430.064528ms, min=356.098842ms, max=487.375932ms
Encoding: avg=122.90406ms, min=86.224038ms, max=157.194381ms
Batch sender time: 309.676µs

......

The most time-consuming phases are Inference and annotation, which are roughly 1.07s and 430ms .
This time is taking too long for real-time push streaming.
I would like to ask what is a good optimization solution for this situation.

jamjamjon · 2025-03-04T00:28:52Z

From the results of your code execution, it appears that model inference indeed occupies a significant amount of time, and the annotate process also takes up a considerable amount of time. Here are several aspects for analysis:

1. Hardware and Model Details: What are your machine and GPU models? What is the size or parameter count of the YOLO model used for inference, and what is the batch size? Based on experience, with an RTX 3060Ti, a batch size of 1, and an input image resolution of 640x640, the YOLOv8-m-det model preprocessing time is approximately 1.5ms, model inference takes less than 20ms, and post-processing varies depending on the number of results and the CPU performance of the machine, typically requiring around 600µs. If the inference model is in ONNX format, you could try using the TensorRT provider and FP16 precision for further acceleration.
1. Annotator Performance: The plot() or annotate() methods of the annotator do not implement parallel strategies and rely heavily on CPU performance. The current implementation uses the imageproc crate for rendering results, which is somewhat slow. If speed is a priority, you could consider experimenting with other crate for result rendering.
1. DataLoader Considerations: After enabling hardware acceleration with an Nvidia device, have you tested the time taken for video stream decoding and encoding? These results could help analyze the Encoder’s performance and the time consumption of each iteration in the for loop.

I have been traveling on business recently and do not have access to a computer to test whole pipeline and rsmedia crate. Sorry for not being able to respond to your questions in a timely manner. You are welcome to leave further comments for discussion, and I will reply as soon as I see them.

jamjamjon · 2025-03-11T13:57:11Z

I tried the rsmedia project, and it seems that there are some issues with the ffmpeg6 features.

Compiling rsmpeg v0.15.1+ffmpeg.7.0 (https://github.com/phial3/rsmpeg?branch=light#13f8c554)
error[E0605]: non-primitive cast: `unsafe extern "C" fn(*mut c_void, *const u8, i32) -> i32 {write_c}` as `unsafe extern "C" fn(*mut c_void, *mut u8, i32) -> i32`
   --> /home/qweasd/.cargo/git/checkouts/rsmpeg-6e0a08a626b70a61/13f8c55/src/avformat/avio.rs:148:50
    |
148 |                 write_packet.is_some().then_some(write_c as _),
    |                                                  ^^^^^^^^^^^^ invalid cast

For more information about this error, try `rustc --explain E0605`.
error: could not compile `rsmpeg` (lib) due to 1 previous error

I see that this project is under rapid development. I will keep following it and wait for further testing. @phial3

phial3 · 2025-03-12T00:50:37Z

I tried the rsmedia project, and it seems that there are some issues with the ffmpeg6 features.

Compiling rsmpeg v0.15.1+ffmpeg.7.0 (https://github.com/phial3/rsmpeg?branch=light#13f8c554)
error[E0605]: non-primitive cast: unsafe extern "C" fn(*mut c_void, *const u8, i32) -> i32 {write_c} as unsafe extern "C" fn(*mut c_void, *mut u8, i32) -> i32
--> /home/qweasd/.cargo/git/checkouts/rsmpeg-6e0a08a626b70a61/13f8c55/src/avformat/avio.rs:148:50
|
148 | write_packet.is_some().then_some(write_c as _),
| ^^^^^^^^^^^^ invalid cast

For more information about this error, try rustc --explain E0605.
error: could not compile rsmpeg (lib) due to 1 previous error
I see that this project is under rapid development. I will keep following it and wait for further testing. @phial3

the default feature is ["ffmpeg7", "ndarray"]
If you use ffmpeg version 7.x, the default is ok.
like this:

rsmedia = { git = "https://github.com/phial3/rsmedia", branch = "rsmpeg" }

If you use ffmpeg version 6.x, the default feature need to be close.
like this:

rsmedia = { git = "https://github.com/phial3/rsmedia", branch = "rsmpeg", default-features = false, features = ["ffmpeg6", "ndarray"] }

jamjamjon added the enhancement New feature or request label May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline #68

Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline #68

phial3 commented Mar 1, 2025

jamjamjon commented Mar 1, 2025

Uh oh!

phial3 commented Mar 3, 2025

Uh oh!

jamjamjon commented Mar 4, 2025 •

edited

Loading

Uh oh!

jamjamjon commented Mar 11, 2025

Uh oh!

phial3 commented Mar 12, 2025

Uh oh!

Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline​ #68

Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline​ #68

Comments

phial3 commented Mar 1, 2025

jamjamjon commented Mar 1, 2025

Uh oh!

phial3 commented Mar 3, 2025

Uh oh!

jamjamjon commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamjamjon commented Mar 11, 2025

Uh oh!

phial3 commented Mar 12, 2025

Uh oh!

Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline #68

Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline #68

jamjamjon commented Mar 4, 2025 •

edited

Loading