ARTIFICIAL INTELLIGENCE is moving away from simple large language models (LLM) and image generators to more sophisticated technologies involving image comprehension and response
Alibaba Group’s digital technology and intelligence backbone, Alibaba Cloud, launched two open-source large vision language models (LVLM) it calls Qwen-VL and its conversationally fine-tuned Qwen-VL-Chat. These models can comprehend images, texts and bounding boxes in prompts and facilitate multi-round question answering in both English and Chinese.
Qwen-VL represents the multimodal iteration of Qwen-7B, an impressive 7-billion-parameter model from Alibaba Cloud’s large language model Tongyi Qianwen. Qwen-VL stands out by comprehending both images and text prompts, showcasing its versatility in performing various tasks. It can respond to open-ended queries related to diverse images and even generate image captions.
The technology is elevating conversational AI in the realm of complex interactions. Qwen-VL-Chat is designed to handle multiple image inputs and engage in multi-round question answering. Leveraging advanced alignment techniques, this AI assistant boasts a wide range of creative capabilities. From crafting poetry and stories based on input images to summarizing content from multiple pictures and solving mathematical questions depicted in images, Qwen-VL-Chat exemplifies the future of conversational AI.
Open source and full accessibility is part of Alibaba Cloud’s commitment to democratizing AI technologies. To this end, the company has generously shared the model’s code, weights, and documentation with the global academic, research, and commercial communities. Accessible via Alibaba’s AI model community, ModelScope, and the collaborative AI platform Hugging Face, these resources empower developers and institutions to explore the potential of these AI marvels. For commercial applications, companies with over 100 million monthly active users can request a license from Alibaba Cloud.
The introduction of Qwen-VL and Qwen-VL-Chat marks a significant stride towards revolutionizing how we interact with visual content. These models possess the remarkable ability to extract meaning and information from images, opening up possibilities like providing assistance to visually impaired individuals during online shopping.
Qwen-VL’s prowess lies in its image comprehension capabilities. Unlike other open-source models limited to processing images at 224×224 resolution, Qwen-VL can handle image inputs at an impressive 448×448 resolution. This results in superior image recognition and comprehension, enhancing its utility across various applications.
Both Qwen-VL and Qwen-VL-Chat have demonstrated exceptional performance across a range of benchmarks. From zero-shot captioning to general visual question answering and text-oriented visual question answering, these models have excelled. Additionally, their object detection capabilities have earned acclaim in the AI community.
Qwen-VL-Chat has set a benchmark in text-image dialogue and alignment, achieving levels comparable to human performance in both Chinese and English. The benchmark test, encompassing over 300 images, 800 questions, and 27 categories, showcases the model’s prowess in bridging the gap between AI and human-level understanding.
Alibaba Cloud’s dedication to pushing the boundaries of AI technology underscores the potential for these models to reshape industries and enhance human-machine interactions in unprecedented ways.