SenseTime Pursues Multimodal AI, mirroring Google’s Strategy Amid LLM Debate
Table of Contents
Shift Towards Multimodal AI
SenseTime, a Hong Kong-listed company adn leading facial recognition provider, is strategically shifting its focus towards multimodal AI, a move occurring during a period of increasing scrutiny regarding the limitations of large language models (LLMs). This transition follows the emergence of ChatGPT three years ago, in 2021, which sparked a wave of generative AI growth.
Lin, an associate professor of data engineering at the Chinese University of Hong Kong and a key figure at SenseTime, explained the company’s approach is similar to Google’s in the United States. Both prioritize multimodal AI, specifically beginning with vision capabilities and then integrating language abilities to create complete systems. According to Lin, this strategy aims to build “real multimodal systems,” going beyond simply adding language to existing vision models.
Building the Infrastructure for AI Ambitions
SenseTime’s decision to invest in large-scale data centers as early as 2018 has proven crucial in supporting its current AI ambitions. This foresight mirrors Google’s own investment in infrastructure, including its Tensor Processing Units (TPUs) designed specifically for training AI models. The company’s proactive infrastructure build is intended to provide the necessary computational power for developing and deploying advanced AI applications.
As of August 2024, sensetime’s total computing power reached approximately 25,000 petaflops, representing an 8.7% increase since the beginning of the year. This follows a significant 92% surge in computing power throughout 2023, demonstrating a significant and sustained investment in hardware capabilities. This growth in computing power is essential for training increasingly complex AI models.
The Rise of Multimodal AI and the LLM Debate
The move towards multimodal AI comes at a time when the capabilities and limitations of LLMs are being actively debated. While LLMs excel at text-based tasks, they often struggle with understanding and interacting with the real world. Multimodal AI, which combines different types of data like images, video, and audio, aims to overcome these limitations by providing a more holistic understanding of the surroundings. This approach is seen as a potential pathway to more robust and reliable AI systems.
Google has been a prominent advocate for multimodal AI, exemplified by projects like Gemini, which is designed to process and understand information across various modalities.SenseTime’s alignment with this strategy suggests a belief that the future of AI lies in systems that can seamlessly integrate and reason about different types of data. The “Nano Banana Pro” mentioned by Lin is likely a reference to a specific hardware component or model within Google’s multimodal AI ecosystem.
