TLDR: We train a detector to detect objects based on a plain text description, and apply it to a variety of multimodal understanding tasks.