DTEN D7X AI and the Next Paradigm of Video Conferencing

DTEN D7X AI and the Next Paradigm of Video Conferencing

Share

When Systems Became Smarter—and Still Needed Our Help

Modern video conferencing systems are often described as intelligent.

They can track speakers, frame participants automatically, recognize faces, and adapt to different room sizes. They are far more capable than anything we had just a decade ago.

And yet, if you have ever deployed or managed one, you may have noticed something slightly paradoxical: the smarter these systems become, the more they seem to ask of the user.

You are asked to select modes, draw boundaries, exclude areas, and decide what the system should or should not pay attention to. These steps are usually justified—and often necessary—but they hint at something deeper. The system can process video extremely well, but it does not truly understand the space it is operating in.

A good example of this is Video Boundary.

Video Boundary exists to solve a real problem. Cameras today are wide, powerful, and capable of capturing far beyond the physical meeting room. Without constraints, they may include people walking past glass walls, reflections, or activity outside the intended space. Asking the user to define a boundary is a practical and effective solution.

But it is also revealing.

The very need for Video Boundary tells us that the system itself does not know what the meeting room is. It sees pixels, not space. It reacts to motion, not context. The boundary is not a feature born from preference—it is a workaround for a lack of spatial understanding.

This pattern appears again and again across the evolution of video conferencing- When the system cannot infer intent or environment, we compensate with rules. When it cannot distinguish what matters, we provide configuration.

For years, this was not a flaw—it was the only viable path forward.

At DTEN, much of our work over the past several years followed this same, pragmatic approach: identify a concrete problem, define it clearly, and design a solution that users could control. That philosophy helped drive meaningful progress in video quality, framing stability, and meeting equality.

But the emergence of AI changes the nature of the problem itself.

AI does not simply allow us to add smarter features. It introduces a fundamentally different possibility: systems that can perceive, interpret, and understand the room they are placed in. Not through manual rules, but through learned models of space, depth, and people.

This article is not about a single feature, nor is it a catalog of AI capabilities. It is about a paradigm shift—from rule-driven video systems to systems that can understand their environment—and how that shift shaped the design of DTEN D7X AI.

Video Boundary is just one example. The real story is what comes after drawing boundaries are no longer needed.

From Features to Systems

Traditional video conferencing systems are designed around outputs: frames, crops, switches, and rules that define how those outputs should behave. In contrast, the new generation of AI-driven systems should be designed around perception—how the system understands the environment before any decision is made.

But true understanding does not come from a single algorithm or feature. It emerges only when multiple capabilities work together at the system level. In our design, this can be summarized with a simple equation:

Understanding the Room = Understanding Space + Understanding People

Understanding space means more than detecting shapes or motion. It means forming a continuous sense of the room itself—its layout, boundaries, depth, and spatial relationships. It allows the system to distinguish between foreground and background, inside and outside, presence and absence.

Understanding people goes beyond detecting faces or voices. It requires recognizing individuals consistently over time, knowing where they are located in the room, and being able to associate what is seen with what is heard.

Together, these two forms of understanding allow the system to move beyond reacting to signals and begin reasoning about context. Behavior no longer needs to be prescribed in advance through individual settings. Instead, it emerges from what the system perceives and understands about the room and the people within it.

This shift—from features to systems—is not driven by software alone. It is built across hardware and software together, each designed to support continuous perception and recognition. The sections that follow describe how this understanding is constructed, starting from the most fundamental requirement: compute.

Compute: Making Continuous Understanding Possible

Modern AI workloads are both continuous and concurrent. In a real-world conferencing system, video, audio processing, and platform-level requirements all run in parallel. At the same time, platforms like Microsoft Teams impose their own performance and stability demands. These workloads do not take turns—they compete.

In D7X AI, the goal was not only to maximize raw performance, but also to ensure predictable capacity for continuous understanding. When perception and recognition are treated as background tasks competing with application workloads, their behavior becomes unstable. Latency increases, decisions are delayed, and the system is forced to simplify its reasoning.

To avoid this, D7X AI adopts a dual-processor architecture. An Intel Ultra CPU and a dedicated 8-core ARM processor work together to share AI workloads, while maintaining clear responsibility boundaries: one focused primarily on the collaboration platform and system orchestration, the other on audio-visual processing and perception.

This separation allows the system to sustain parallel streams of perception and recognition without interruption. AI workloads are no longer opportunistic; they are guaranteed. The system does not need to pause its understanding of the room in order to perform other tasks.

With compute constraints addressed, the system can begin to focus on how it perceives the room itself.

Seeing the Room: Global Awareness as a Foundation

Primary video cameras, including the main camera and Vue Pro, are optimized for clarity, perspective, and visual quality. They are designed to present people clearly and naturally. By design, however, they observe only portions of the room at any given moment, in different directions.

This partial view is great at seeing each individual, but insufficient to maintain context. When cameras switch, when people move, or when framing changes, the system risks losing continuity, which leads to instability, duplication, or misinterpretation.

For a system to understand a meeting room, it must retain awareness of the space as a whole. The AI cameras exists for this reason. Not to produce video output, but to provide a persistent reference of the room itself—one that remains stable regardless of how the visible frame changes.

This global awareness establishes the foundation for spatial understanding. But awareness alone is still two-dimensional. But awareness alone is still two-dimensional. To reason about the room rather than react to it, the system must understand depth.

Depth: From 2D Frames to 3D Rooms

For most of the history of video conferencing, cameras have done exactly what cameras were invented to do: flatten the world.

They take a three-dimensional space—defined by distance, separation, and physical relationships—and compress it into a two-dimensional image. For human viewers, this compression is rarely a problem. We intuitively understand perspective. We know that a smaller face is usually farther away, that a reflection on glass is not a person, and that someone passing behind a window is not joining the meeting.

A video system does not have that intuition, at least not inherently.

Without depth, the system has no direct way to understand space. Distance must be inferred from size. Presence must be inferred from motion. Identity must be reconstructed from persistence. These shortcuts often work well enough to pass unnoticed—until they fail.

When they do, the failure is subtle but revealing. A reflection draws attention. A static image is mistaken for a participant. A person briefly blocked from view appears to have left the room. Individually minor, these errors point to the same underlying issue: the system is reacting to images, not interpreting a space.

For a long time, this limitation was simply accepted. Reliable depth perception required specialized hardware—LiDAR, structured light, or dedicated sensing modules—none of which fit naturally into meeting rooms at scale. Rather than resolving ambiguity, the industry learned to manage it. Boundaries were drawn. Thresholds were tuned. Exceptions were added. These measures reduced visible errors, but they never addressed the source of uncertainty.

Advances in machine learning made it possible to infer depth directly from vision—allowing spatial understanding to emerge from cameras rather than from specialized sensors.

When depth becomes available through vision, the system no longer has to guess where things are. People are no longer just shapes on a screen; they occupy positions in space, at measurable distances from one another and from the room itself. Foreground and background separate naturally. Interior space and exterior movement no longer need to be inferred.

This shift is subtle, but consequential. The system begins to interpret scenes rather than respond to frames.

As spatial understanding takes shape, many situations that once required correction simply stop occurring. Reflections lack consistent spatial presence and fade into irrelevance. Static images do not occupy space and lose significance. People outside the room remain outside—not because a boundary was defined, but because they are physically elsewhere.

The system does not become more aggressive or more complex. It becomes more certain.

With depth, the meeting room is no longer treated as a collection of pixels. It begins to resemble what it actually is: a three-dimensional space with structure, distance, and continuity. And once the system operates on that basis, understanding the people within that space—and how they move through it—becomes possible.

Understanding People in Space

A meeting room is not just a space. It is a space filled with people—moving, speaking, turning away from cameras, leaning back in chairs, walking to whiteboards, and occasionally disappearing behind one another at exactly the wrong moment.

At this point, the system already knows what constitutes the room. With spatial awareness in place, it can distinguish physical presence from reflections, images, or activity outside the space. What remains is a more subtle problem: recognizing the people who occupy that space as individuals.

For humans, this comes naturally. We recognize others not from a single snapshot, but from a combination of cues. We remember what someone looks like, where they are sitting, how they move, how they sound, and how all of this aligns with what we observed moments—or meetings—before. Identity is built from consistency across time.

Video systems have traditionally struggled here because they relied on isolated signals.

Face recognition alone is fragile. A change in angle, lighting, or posture can easily break continuity. Audio-based speaker tracking has a different limitation. Techniques such as beamforming can estimate where sound comes from, but they cannot reliably determine who is speaking—especially when multiple people sit along the same direction, one in front of another.

Recognizing people requires more than direction. It requires identity.

In D7X AI, identity is formed by correlating multiple cues over time. Visual appearance provides the first anchor for identity. Spatial position, derived from depth and global perception, grounds that identity in the physical layout of the room. Voiceprint recognition completes the profile, allowing the system to form a persistent understanding of each individual present.

Together, these cues form a participant profile that remains stable even as individual signals fluctuate. Identity is no longer reconstructed frame by frame or inferred from isolated events.

This kind of understanding cannot be hard-coded. It must be learned. Rather than matching fixed patterns or relying on rigid thresholds, the system uses machine learning to associate appearance, voiceprint, and spatial position across time. The result is a coherent representation of each person as a continuous presence within the room.

When Understanding Comes First

As the system’s understanding of the room and its participants became more complete, something unexpected happened. Many of the controls that once felt necessary began to feel redundant. Adjustments that once required careful tuning simply stopped being relevant.

Framing becomes steadier because the system already knows what it is framing. Identity remains consistent because people are no longer rediscovered camera by camera. Boundaries fade into the background because physical presence is no longer ambiguous.

Audio, attribution, and future forms of interaction can all build on the same underlying understanding, rather than requiring separate logic from scratch.

When a system understands the space it operates in, understands the people within that space, and understands how those people relate to one another over time, behavior no longer needs to be specified step by step. The system does not ask the user to explain the room. It already knows it.

What remains is a quieter experience. One with fewer interruptions, where the technology stops asking questions, and the meeting is allowed to proceed as a meeting.