Research & Theory

I wonder if multi modal can just work with any vision model though? It’s just taking...

I wonder if multi modal can just work with any vision model though? It’s just taking the visual input and translating it to text for better context