Pathological QueryInterfaces #Delphi

After my last post (about getting an interface back out of a Variant), Sebastian asked a question about the code I had posted: "Have you tested that Supports() puts nil to the 'out Intf' parameter if the interface is NOT supported?"

That was a good question — I started to answer, and then realized that I didn't know for sure. I thought it was safe, but I hadn't done any detective work to verify that. So I went hunting. Here's the implementation of Supports (the overload that takes an IInterface, which is the one I was using):

function Supports(const Instance: IInterface; const IID: TGUID;
  out Intf): Boolean;
begin
  Result := (Instance <> nil) and
    (Instance.QueryInterface(IID, Intf) = 0);
end;

I could see two cases where the Intf parameter might end up being non-nil in the event of failure:

If the body of the method never assigns to Intf, and the compiler doesn't magically nil out the variable.
If QueryInterface returns false, but still assigns a non-nil value to Intf.

#1 first. I suspected that the compiler would magically nil out the variable, but I wanted to confirm that. First stop: the CPU view. I started by writing a trivial function with the same kind of method signature Supports deals with:

function TForm1.Foo(out Intf);
begin
end;

And called it from my FormCreate. Then I did the usual "look at the CPU view, figure out what the compiler is doing, and add more stuff to make sure I've got it right" drill.

And yes, the compiler does indeed generate magic code to nil out Intf. I was expecting that magic code to be put inside my Foo() method, but it turned out to be in the caller:

Unit1.pas.36: Intf1 := nil;
0046038C 8D45FC        lea ex,[ebp-$04]
0046038F E8BC5BFAFF    call @IntfClear
Unit1.pas.37: Foo(Intf1);
00460394 8D45FC        lea ex,[ebp-$04]
00460397 E8B45BFAFF    call @IntfClear
0046039C 8BD0          mov edx,eax
0046039E 8BC3          mov eax,ebx
004603A0 E8CBFFFFFF    call TForm1.Foo

Notice those repeated lines. When it compiles the call to Foo (and sees that the interface variable is being passed to an out parameter), the compiler automatically calls @IntfClear before actually calling Foo. So I was wrong about the where, but right about the idea; the compiler does do some magic to nil the out parameter (when it's a magically-memory-managed type, at least).

(Now that I think back on it, of course the interface-clearing happens at the caller. The Foo method has no idea what type is being passed on it, so it can't call @IntfClear — the parameter might just as well be a string, a dynamic array, a float, whatever. When it's an untyped out parameter, only the caller knows the type, so it's the caller's responsibility to clear it if need be.)

Okay, so question #1 is answered. Now how about question #2? More hunting...

I found the MSDN docs for QueryInterface, and they state: "If the object does not support the interface specified in iid, *ppvObject is set to NULL." So according to the specs, any well-behaved QueryInterface, if asked for a GUID it doesn't support, should both return an error code, and set its out parameter to nil.

Next question: how sure am I that every QueryInterface in existence will always be well-behaved? This gave me pause. I mean, I can look at the way Delphi implements its own QueryInterface stuff, but what about COM objects? Can I be sure that any COM object, written by anybody in any language, is guaranteed to pass back nil? (This is the problem with duplicated information — it's obvious from the out parameter whether QueryInterface was successful, but Microsoft chose to also return a flag saying the exact same thing, opening the possibility that those two return values could be inconsistent. Thanks, Microsoft.)

So I thought through it some more. If QueryInterface is asked for a GUID that a particular object doesn't support, it will return E_NOINTERFACE. In that case, there are four meaningful things it can do with its out parameter:

Set the out parameter to nil (or leave it alone, which amounts to the same thing, since the caller already set the value to nil).
Set the out parameter to a garbage value.
Set the out parameter to point to a valid interface, but don't increment that returned object's refcount.
Set the out parameter to point to a valid interface, and do increment that returned object's refcount.

Now consider this code:

procedure X(Intf1: IInterface);
var
  Intf2: IInterface;
begin
  if Supports(Intf1, IFoo, Intf2) then
    DoSomething(Intf2);
end;

There's a local variable of an interface type, so at the end of this method, the compiler automatically generates some teardown code (which checks to see if the reference is nil, and if not, calls Intf2._Release). So what happens with that teardown code in each of those four conditions?

When X returns and its teardown code executes, Intf2 is nil. Everything runs successfully.
When the teardown code executes, Intf2 is a garbage value, but is not nil. So _Release gets called, and... boom! Virtual method call on a garbage value. If you're lucky, this would result in an Access Violation. More likely, we're talking "Do you want to send this error report to Microsoft?" and an infinite series of exception messages, until you finally get fed up and kill the process.
The specs say that QueryInterface should increment the refcount of the object it's returning. If a QueryInterface doesn't do that, then it would return an object with a refcount of zero. When our perfectly reasonable code calls DoSomething, the refcount gets incremented to one, and when DoSomething returns, the refcount gets decremented — to zero, so the object frees itself. Then we get to the finalization code. Calling a virtual method on a freed object isn't really any better than calling a virtual method on a garbage reference, so — boom!
When X returns and its teardown code executes, Intf2 is non-nil, so its _Release gets called. The object's refcount gets decremented to zero, and it gets freed. But we never called DoSomething, because QueryInterface claimed that it wasn't returning a reference.

Cases #2 and #3 are wrong in painfully, immediately obvious ways. If somebody actually implemented a QueryInterface that way, they would find out about it, real quick, because all the code that used their object would crash in a hurry.

That leaves us with case #1, which makes perfect sense, and case #4, which is ridiculous — but still possible. Code could implement QueryInterface to return a valid reference, but still return E_NOINTERFACE. That's just screwy, and it's not what the specs say you should do, but it could happen.

I suspect that there's some C++ code out there that wouldn't bother to call Release if they'd been told they weren't getting anything back. But that would lead to a memory leak, not a crash. So it's possible that there's some otherwise-valid QueryInterface code out there, somewhere, that implements case #4. It would be stupid, but it probably wouldn't be immediately obvious.

Sigh. My variant-to-interface code assumes case #1. It will do the wrong thing in case #4; namely, it will return the interface that QueryInterface lied and said it didn't have. And if QueryInterface is already lying to you, it's entirely possible that the interface it's returning is the wrong type. Then you get back into the virtual-method-calls-on-garbage-values thing. Ack.

Okay, here's the bottom line. If you use a reasonable programming language (like Delphi), and you use the built-in QueryInterface stuff that's already been written for you and doesn't have evil bugs like case #4, then you can use my variant-to-interface code without modification. It'll be safe (I just checked; TInterfacedObject.QueryInterface implements case #1, as all reasonable QueryInterfaces should). But if you're dealing with COM objects from unknown sources, or if you just like to be cautious, you may want to take a look at the revised code Sebastian posted. It shouldn't be necessary, but if you ever end up dealing with a pathological QueryInterface, the extra checking may be safer.

Moral: Don't design stupid APIs that return the same information twice!