Research & Theory
I wonder if multi modal can just work with any vision model though? It’s just taking...
I wonder if multi modal can just work with any vision model though? It’s just taking the visual input and translating it to text for better context