As for the Phi-3 Vision model, it’s trained on 4.2 billion parameters. It means that the model is fairly lightweight. This is the first time a mega-corporation like Microsoft has open-sourced a multimodal model. It has a context length of 128K and you can feed images as well. Google did release the PaliGemma model, but it’s not meant for conversational use.
Apart from that, Microsoft says that the Phi-3 Vision model was trained on publicly available, high-quality educational and code data. Microsoft has also generated synthetic data for math, reasoning, general knowledge, charts, tables, diagrams, and slides.Image Courtesy: Microsoft
Despite its small size, the Phi-3 Vision model performs better thanClaude 3 Haiku, LlaVa, and Gemini 1.0 Pro on many multimodal benchmarks. It even comes pretty close to OpenAI’s GPT-4V model. Microsoft says that developers can use the Phi-3 Vision model for OCR, chart and table understanding, general image understanding, and more.
If you want to check out the Phi-3 Vision model, head over to Azure AI Studio (visit).
Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.