Skip to content

A deadlock can occur when joining a new room after disconnection #704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
6bangs opened this issue May 20, 2025 · 2 comments
Open

A deadlock can occur when joining a new room after disconnection #704

6bangs opened this issue May 20, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@6bangs
Copy link

6bangs commented May 20, 2025

Describe the bug
Disconnecting from a room sometimes hangs. Then when joining a new room, creating a new peer connection hangs, locking the liveKitWebRTC queue. It doesn't happen every time, but we have reproduced it with a fork of the example project.

SDK Version
2.5.1

iOS/macOS Version
iOS 18 (various, but most recently 18.4.1)

Xcode Version
16.3, Swift 5 or 6

Steps to Reproduce

All connection and disconnection code remains the same, but the app automatically joins a room and disconnects every 10 seconds.

Expected behavior
All participants should be able to connect and disconnect without deadlocks.

As far as I can tell, this is the sequence of events:

  1. _pc.close hangs indefinitely when closing the subscriber connection.
    func close() async {
        // prevent debounced negotiate firing
        await _debounce.cancel()

        // Stop listening to delegate
        _pc.delegate = nil
        // Remove all senders (if any)
        for sender in _pc.senders {
            _pc.removeTrack(sender)
        }

        _pc.close()
    }
  1. When initiating a new subscriber connection, PeerConnectionFactory.peerConnection hangs because the previous one was never closed.
    static func createVideoTrack(source: LKRTCVideoSource) -> LKRTCVideoTrack {
        DispatchQueue.liveKitWebRTC.sync { peerConnectionFactory.videoTrack(with: source,
                                                                            trackId: UUID().uuidString) }
    }

At this point, all usages of DispatchQueue.liveKitWebRTC will wait indefinitely. Some of these calls happen from the main thread, so this can also result in freezing.

I suspect _pc.close is waiting to run on the signaling thread, but so far I haven't been able to identify what the signaling thread is busy with.

@6bangs 6bangs added the bug Something isn't working label May 20, 2025
@pblazej pblazej assigned pblazej and unassigned hiroshihorie May 20, 2025
@pblazej
Copy link
Contributor

pblazej commented May 20, 2025

@6bangs unfortunately I wasn't able to reproduce it yet (Xcode 16.3, iOS 18.4.1).

_pc.close() is indeed blocking, down to the Obj-C++ layer (e.g. doing n/w cleanup). Let's try to exclude the Swift layer first, then extract the root cause.

  • Are we sure the blocking part is exactly _pc.close()? Can you attach the stack trace? That should give us some info on the C++ part, especially:

Image

  • Can you use the latest version (2.6.0)?
  • Removing the suspension point inside actor Transport and making close() sync - does it change anything (reentrancy)? Or setting a very long debounce interval?
    func close() {
        // prevent debounced negotiate firing
        // await _debounce.cancel()

@6bangs
Copy link
Author

6bangs commented May 20, 2025

Hi Błażej, thanks for trying to repro! I upgraded our fork to 2.6.0 and the issue reproduced on one of my devices after only 2 connection attempts. I'm 100% sure it's blocking in _pc.close, here's the trace:

* thread #43, queue = 'com.apple.root.user-initiated-qos.cooperative'
    frame #0: 0x0000000103140014 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000010302aab8 libsystem_pthread.dylib`_pthread_cond_wait + 976
    frame #2: 0x00000001048442b4 LiveKitWebRTC`___lldb_unnamed_symbol9036 + 528
    frame #3: 0x000000010471adf8 LiveKitWebRTC`___lldb_unnamed_symbol5113 + 244
  * frame #4: 0x0000000105f694c0 LiveKitExample.debug.dylib`Transport.close() at Transport.swift:192:13
    frame #5: 0x0000000105ef33cc LiveKitExample.debug.dylib`Room.cleanUpRTC() at Room+Engine.swift:53:27
    frame #6: 0x0000000105f39748 LiveKitExample.debug.dylib`Room.cleanUp(disconnectError=nil, isFullReconnect=false) at Room.swift:441:15
    frame #7: 0x0000000105f3a970 LiveKitExample.debug.dylib`Room.disconnect() at Room.swift:422:15
    frame #8: 0x0000000105dfbeb4 LiveKitExample.debug.dylib`closure #4 in RoomView.body.getter() at RoomView.swift:616:24
    frame #9: 0x0000000105dfd814 LiveKitExample.debug.dylib`partial apply for closure #4 in RoomView.body.getter at <compiler-generated>:0
    frame #10: 0x00000001d32cd27c SwiftUI`(1) await resume partial function for partial apply forwarder for closure #1 @Sendable () async -> () in SwiftUI.FeedbackGenerator.body(content: SwiftUI._ViewModifier_Content<SwiftUI.FeedbackGenerator<τ_0_0>>) -> some
    frame #11: 0x00000001d32cd27c SwiftUI`(1) await resume partial function for partial apply forwarder for closure #1 @Sendable () async -> () in SwiftUI.FeedbackGenerator.body(content: SwiftUI._ViewModifier_Content<SwiftUI.FeedbackGenerator<τ_0_0>>) -> some
    frame #12: 0x00000001d3521d40 SwiftUI`(1) await resume partial function for generic specialization <()> of reabstraction thunk helper <τ_0_0 where τ_0_0: Swift.Sendable> from @escaping @isolated(any) @callee_guaranteed @async () -> (@out τ_0_0) to @escaping @callee_guaranteed @async () -> (@out τ_0_0, @error @owned Swift.Error)
    frame #13: 0x00000001d32cd27c SwiftUI`(1) await resume partial function for partial apply forwarder for closure #1 @Sendable () async -> () in SwiftUI.FeedbackGenerator.body(content: SwiftUI._ViewModifier_Content<SwiftUI.FeedbackGenerator<τ_0_0>>) -> some

Screenshot showing line 192:

Image

Here's another thread I've seen several times when this reproduced:

* thread #51, name = 'AURemoteIO::IOThread'
    frame #0: 0x0000000103140014 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000010302aab8 libsystem_pthread.dylib`_pthread_cond_wait + 976
    frame #2: 0x00000001048442b4 LiveKitWebRTC`___lldb_unnamed_symbol9036 + 528
    frame #3: 0x0000000104841920 LiveKitWebRTC`rtc::Thread::BlockingCallImpl(rtc::FunctionView<void ()>, webrtc::Location const&) + 384
    frame #4: 0x00000001047e7bdc LiveKitWebRTC`___lldb_unnamed_symbol7471 + 284
  * frame #5: 0x00000001061c2dd4 LiveKitExample.debug.dylib`RemoteAudioTrack.remove(audioRenderer=(object = 0x000060000177dd40 -> 0x000000010677cc18 type metadata for LiveKitComponents.AudioProcessor)) at RemoteAudioTrack.swift:75:24
    frame #7: 0x00000001065f49a4 LiveKitExample.debug.dylib`AudioProcessor.deinit() at AudioProcessor.swift:47:16
    frame #8: 0x00000001065f4a9c LiveKitExample.debug.dylib`AudioProcessor.__deallocating_deinit() at AudioProcessor.swift:0
    frame #9: 0x00000001959b904c libswiftCore.dylib`_swift_release_dealloc + 28
    frame #10: 0x00000001959b9a24 libswiftCore.dylib`bool swift::RefCounts<swift::RefCountBitsT<(swift::RefCountInlinedness)1>>::doDecrementSlow<(swift::PerformDeinit)1>(swift::RefCountBitsT<(swift::RefCountInlinedness)1>, unsigned int) + 148
    frame #11: 0x00000001803aaaf0 CoreFoundation`__RELEASE_OBJECTS_IN_THE_ARRAY__ + 112
    frame #12: 0x00000001803aaa30 CoreFoundation`-[__NSArrayM dealloc] + 144
    frame #13: 0x00000001800926dc libobjc.A.dylib`AutoreleasePoolPage::releaseUntil(objc_object**) + 212
    frame #14: 0x00000001800925a0 libobjc.A.dylib`objc_autoreleasePoolPop + 256
    frame #15: 0x0000000180092dc4 libobjc.A.dylib`objc_tls_direct_base<AutoreleasePoolPage*, (tls_key)3, AutoreleasePoolPage::HotPageDealloc>::dtor_(void*) + 168
    frame #16: 0x0000000103028324 libsystem_pthread.dylib`_pthread_tsd_cleanup + 616
    frame #17: 0x000000010302acbc libsystem_pthread.dylib`_pthread_exit + 80
    frame #18: 0x000000010302a5fc libsystem_pthread.dylib`_pthread_start + 116
Image

Not sure if related, but it is suspicious. I've attached the backtrace of all threads in case you need them, but these two are the ones that look most relevant to me.

I tried making Transport.close sync and commented the debounce - no dice. Calls to instance method 'close()' from outside of its actor context are implicitly asynchronous

backtrace.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants