Transforming Computer Vision with AI and Generative AI

While conventional computer vision techniques were driven by manual feature extraction and classical algorithms to interpret images and videos, modern computer vision has been influenced by end-to-end deep learning models and generative AI (GenAI). This means greater possibilities for use cases like autonomous driving, object identification, and workplace safety.

By 2032, the global computer vision market size is projected to grow more than eight times from USD 20.31 billion to a whopping USD 175.72 billion.^[1] The fast-evolving landscape of AI and computer vision is resulting in remarkably diverse applications across industries, such as camera-equipped patrol robots for the Singapore Police Force^[2] and Abu Dhabi’s first multimodal Intelligent Transportation Central Platform, implemented as part of the capital’s urban transportation strategies.

Evolution of Computer Vision

Early computer vision systems relied on manually engineered features, which, though effective for certain tasks, were limited by their inability to adapt to dynamic and unstructured environments.

The advent of deep learning marked a pivotal shift in the timeline. Leveraging large datasets and powerful GPUs to train neural networks that can automatically extract features from images reduced the need for manual intervention.

Since 2020, early integrations of vision and text models began to reshape the computer vision landscape. By 2022, the success of the Transformer architecture and massive data pre-training brought GenAI into the spotlight.

The synergy between vision and text models has since strengthened considerably, revolutionising computer vision tasks. This fusion has enabled more sophisticated image understanding, object detection, and scene interpretation.

Building Reusable Common Modules

At ST Engineering, we harness the power of next-gen technologies within our Group Engineering Centre (GEC), where advanced video analytics and AI models are developed, tested, and integrated into practical solutions to address real-world challenges.

Wang Shuya, Engineer, Video Analytics, GEC, explores how AI can contribute to computer vision, a subset of AI focused on making sense of visual content.

"Over the years, our technologies have evolved from simple contour recognition to complex scene understanding. The field is rapidly growing, but we strive to keep track of all the latest technologies, models, datasets, and hardware."

"We do this by leveraging a multitude of sensors, including RGB cameras, infrared cameras, depth-sensing, LiDAR, and radar, to develop sophisticated AI models."

At GEC, these technologies are developed into what we call reusable common modules, an ensemble of tools and platforms that serve as the key drivers of our computer vision evolution. These modules can then be customised and scaled to power diverse applications across all our solutions, accelerating innovation throughout our organisation.

Bridging the Gap from Concept to Product

User experience is a big part of ensuring the success of visual AI solutions in the market. Nah Wu, Principal Engineer, Video Analytics, GEC, focuses on the practical aspects of AI integration.

"We take our reusable common modules and analyse how to productise them with other tech stacks for specific use cases. Our goal is to make these technologies functional and easy to use for end users on a daily basis."

"Be it for border access control, biometric access, or customer attendance tracking, our task is to identify use cases and ensure scalability, cost-effectiveness and seamless functionality," Nah Wu shared.

Ensuring that AI models are not just theoretically sound, but also practically deployable, we carry out rigorous testing to achieve low latency and high scalability.

“The challenge is to make sure everything flows properly and can scale when users perform searches. It’s about rethinking older technologies to achieve higher accuracy and lower costs.”

AGIL® Vision: A Leap in Video Intelligence

Powered by our GEC’s VisionX reusable common module, AGIL Vision combines AI and GenAI capabilities in a revolutionary tool that can be deployed across surveillance use cases, including security and threat management, crowd management, object detection, and smoke and fire detection.

Before this module was built, such a model would require a lead time of six months. But now, it can be set up for partners and customers within a month.

Tasks that previously required multiple steps, such as searching for an object across a video feed, can now be completed with a single command. AGIL Vision simplifies complex search processes into manageable tasks, streamlining the process.

Traditional video analytics engines offer basic tracking and detection capabilities, often limited to predefined scenarios. AGIL Vision enhances these with advanced features like innovative object detection and automatic video understanding, making it more accurate, flexible and capable of handling complex scenarios.

"It’s a significant leap from previous object detection and classification engines. Our models are pre-trained on vast datasets, enabling capabilities like open vocabulary object detection without excessive training."

"These capabilities have shortened data collection and customised training time, cutting implementation and deployment from months to weeks," explained Shuya.

The Way Ahead in Agentic AI

Imagine a world where artificial agents can emulate thought, adapt and even improve on their own. Agentic AI, autonomous systems designed to set complex goals and take actions to reach those goals, opens a world of possibilities for computer vision and beyond. AI agents can learn to understand the real world with the help of advanced computer vision algorithms.

The opportunities are endless. With agentic AI, we are teaching machines to understand 3D objects and automate human tasks. To unlock its full potential, we are tapping into the power of agentic tools to explore a wider range of use cases that can advance search and retrieval capabilities, automate work processes and more.

Welcoming the next chapter in computer vision, Shuya is positive about agentic AI’s impacts on visual AI innovations for real-world challenges.

"Instead of just detecting, we are moving towards active interpretation and response. The next era will see computer vision systems that not just observe, but understand and react.”

These systems will interpret visual data in context, make informed decisions and initiate appropriate responses. Essentially, we can leverage the purpose-driven nature of agentic AI to complement our daily tasks and elevate computer vision capabilities to new heights.

^[1] Fortune Business Insights. (2025, January 13). Computer Vision Market Size, Trends | Forecast Analysis [2032]. View article.
^[2] Sun, D. (2023, June 16). Police robots to be deployed across Singapore; two now patrolling Changi Airport T4. The Straits Times. View article.

Transforming Computer Vision with AI and Generative AI

Evolution of Computer Vision

Building Reusable Common Modules

Bridging the Gap from Concept to Product

AGIL® Vision: A Leap in Video Intelligence

The Way Ahead in Agentic AI

Explore our capabilities and other stories of innovation

Explore our AI-powered capabilities

Shaping Urban Mobility with AI

Embracing Smarter Manufacturing

Leveraging Robotics and Automation for Engine MRO

Subscribe to our newsletter