(Original in org-mode format from: gist bt-ccl-debug.org)
We have a program with three threads:
cl-user> (list-threads) 1) #<process Telnet REPL Client #1 DOWN LAYER(22) [semaphore wait] #x302002A4903D> 2) #<process Telnet REPL Client #1(21) [semaphore wait] #x302002A469FD> 3) #<process Telnet REPL Server(20) [Active] #x30200291EB5D> 4) …
The server thread listens to connections and forks client threads for accepted connections.
The client thread forks a down layer thread that loops reading bytes from the client socket, and forwarding them up the protocol layers, up to a buffer in a TELNET-STREAM Gray stream.
The client thread then goes on into a REPL loop using the TELNET-STREAM Gray stream as *TERMINAL-IO*. Writing back to *TERMINAL-IO* goes down to the client socket in this client thread.
Unfortunately, when sending a byte to the upper layer, the down layer thread hangs waiting for the stream-lock. Who has locked this stream?
Neither ccl nor bordeaux-threads are very helpful in debugging that…
Recording the thread and function that holds an lock
What we'd want, is to know what thread are holding a lock. So we will implement a macro shadowing BT:WITH-LOCK-HELD, to record that information into a weak hash-table. Happily, ccl has native weak hash-tables so we don't have to use com.informatimago.clext.closer-weak.
In addition to recording the current thread, we also get the name of the caller function from ccl:backtrace-as-list.
We use a PRINT-OBJECT :AROUND method to print the locking threads when it's available
Then when the dead-lock occurs, we can have a look at the status of the various locks, and notice immediately our culprit stream lock that is held by not once but TWICE by the same thread! ccl only has recursive locks, and bt:with-lock-held uses the native locking mechanism, which is a recursive lock. But now, it is clear what functions are involved in this double locking and the solution will be obvious: split input-buffer-fetch-octet into an inner function that assumes the lock is already held, and use this in %stream-read-char. Problem solved.
(map nil 'print (com.informatimago.common-lisp.cesarum.utility:hash-table-entries *status*)) (#<recursive-lock "down-layer" [ptr @ #x605080] #x3020028D45FD>) (#<recursive-lock "Telnet REPL Server Lock" [ptr @ #x10D880] #x30200279729D>) (#<recursive-lock "telnet-stream" :status ((:locking (funcall "#<STANDARD-METHOD COM.INFORMATIMAGO.CLEXT.TELNET.STREAM::INPUT-BUFFER-FETCH-OCTET (COM.INFORMATIMAGO.CLEXT.TELNET.STREAM:TELNET-STREAM T)>" "#<TELNET-STREAM #x3020028D75CD>" "NIL") #<process Telnet REPL Client #1(21) [semaphore wait] #x3020028D192D>) (:locking (com.informatimago.clext.telnet.stream::%stream-read-char "#<TELNET-STREAM #x3020028D75CD>" "NIL") #<process Telnet REPL Client #1(21) [semaphore wait] #x3020028D192D>)) #x3020028D74BD> (:locking (funcall "#<STANDARD-METHOD COM.INFORMATIMAGO.CLEXT.TELNET.STREAM::INPUT-BUFFER-FETCH-OCTET (COM.INFORMATIMAGO.CLEXT.TELNET.STREAM:TELNET-STREAM T)>" "#<TELNET-STREAM #x3020028D75CD>" "NIL") #<process Telnet REPL Client #1(21) [semaphore wait] #x3020028D192D>) (:locking (com.informatimago.clext.telnet.stream::%stream-read-char "#<TELNET-STREAM #x3020028D75CD>" "NIL") #<process Telnet REPL Client #1(21) [semaphore wait] #x3020028D192D>)) nil
Naming Condition Variables
In addition, ccl condition-variables are not named; bordeaux-threads ignores the name parameter. So we shadow make-condition-variable and record the name of the condition variables in a weak hash-table, and add a print-object :around method to print this name when available. This is very convenient when *inspecting* threads, to see on what condition variable they're actually waiting.
Categories: Online Tutorial ccl Debugging Thread Keywords: debug debugger debugging tutorial ccl bordeaux-thread thread