如何将多媒体设计从语音，音频和视频扩展到AI |电子设计

Facial recognition and voice control have landed, and it’s everywhere. Police officers are plucking offenders out of crowds of 60,000 or more; retail stores are enabling their high-definition displays with high-resolution cameras to monitor customers’ facial expressions; and, of course, smartphones are using it for user authentication.

The applications are myriad, yet facial recognition is essentially a form of advanced pattern recognition, which itself is being enabled by neural-network-based, deep-learning algorithms for artificial intelligence (AI). These are all of a class similar to the powerful algorithms used for autonomous vehicles and medical imaging, as well as simpler defect-detection applications on the factory floor or intruder detection in the home.

后两个系统正在接受培训，以对制成品和消费者行为模式进行更高级的缺陷检测和分析。古典计算机视觉系统根本无法在这样的级别上执行。

深度学习使这种令人震惊的进步的主要原因是由于GPU和人工神经网络（ANN）及其变体的结合，例如卷积神经网络（CNN）。神经网络试图模仿人的大脑，但本质上是具有多个加权输入和单个输出的简单，相互联系的处理元素(Fig. 1)。That output is fed to another hidden layer, and the process is repeated.

赞助资源：

1.人工神经网络（ANN）包含多个简单的处理元素，每个处理元素都带有加权输入和一个输出。该单个输出在下一个（隐藏）层上形成了多个元素的输入。（来源：Viasat）

In an image-processing example, an image is fed to the input. Subsequently, the first layer could perform edge detection, the second layer could do feature extraction (such as an ear and nose, or a STOP sign, or a type of defect), and the next layer could do Sobel edge detection, followed by contour detection at the next layer, and on it goes, depending on the application.

The layers through which the data propagated and transformed is called the credit assignment path (CAP). Deep-learning systems have a relatively large CAP, or number of layers through which the data is transformed. However, “large” isn’t defined, so the difference between shallow learning and deep learning also remains undefined.

无论层数多少，输出的公式都很简单：

Output = f₂(f₁(Input × W₁）×w₂）×w₀

Given that an ANN can have hundreds of layers and thousands of nodes, the calculations grow rapidly. That’s where GPUs factor into the equation.

GPU在智能中的作用

GPUs are the foundation of gaming platforms, because unlike CPUs, they’re specifically designed to process data in parallel using hundreds or thousands of simple, dedicated cores(Fig. 2)。这是图形处理的理想选择，其中必须在大量数据上执行相同的处理功能。

2. GPU可以扩展到数百或数千个与基于ANN的深度学习并行作用的简单处理元素。（来源：数学）

对于GPU，与CPU不同，该数据处理非常可行，该CPU通常按串联执行计算。虽然添加核心和多线程使CPU在并行数据处理方面变得更好，但GPU简单处理元素的可扩展性使其自然地适合神经网络处理，并且最近对于加密货币挖掘了。

Before a deep-learning algorithm can be embedded in a GPU for a given application, it must be trained and then optimized for performance, low power, and the smallest possible memory usage.

但是，此培训过程需要大量的后端处理，这就是为什么Google提出Tensorflow的原因，Tensorflow是一种基于云的培训框架，该框架使用大量GPU银行通过迭代和数据标记来生成受过训练的模型的原因。当然还有其他框架。在流行的是Caffe中，它使用C ++（TensorFlow使用较慢的Python语言）以及Theano，它也使用Python，是竞争对手的直接竞争对手，以及Microsoft的C ++基于CNTK。

在寻找框架时，不仅要考虑使用的语言，还要考虑界面的简单性；预训练模型的数量；培训后需要多少编程；开源与专有框架；以及模型如何转移到其他框架以及新的架构和处理器。

一旦生成，最终模型就会转移到GPU，该模型现在成为“推理引擎”。他们被称为inference发动机而不是fact引擎是有原因的：整个过程被理解为推断或指向极端可能的结果。尽管准确性正在提高，但在任何应用程序中，将其准确性达到100％将是未来许多年的圣杯。

保持简单：在一个多媒体平台上对AI进行IOT

尽管AI及其在视觉处理中的应用引起了很多兴趣，但设计师知道这只是长矛的尖端。对于从工厂展示到家庭以及建筑自动化和娱乐的应用程序，到网络边缘的汽车信息娱乐和简单的扫描仪，图像处理只是开始。

在工厂地板上，人机接口（HMI）越来越依赖于高分辨率，交互式触摸屏，并具有快速无线或连接连接到连接的网关，该连接也正在汇总传感器数据。然后，网关要么在本地执行分析，要么将输入发送到云中以进行更深的分析或对较大系统的一般反馈。

Likewise, in the home, security and entertainment systems are combining both vision processing and voice control, along with sensors for climate control, presence sensing, and remote surveillance. More and more, 4K HDR video (3840 × 2160 resolution) is a baseline for displays, for both TVs as well as interactive touchscreen displays for climate control and home monitoring. This has implications for devices ranging from set-top boxes to Internet of things (IoT) sensors and cameras as to the amount of on-board processing and communications capability is required.

零售是不被排除在外的地方，其中数字标牌将高清显示屏以及上下文和客户意识广告，信息和有用的寻路功能组合在一起。数字标牌还嵌入了相机，以分析用户的表达方式和行为，并将它们与行业内部称为Omni-Touch或OmniPersence Marketing的在线配置文件相匹配。

最后，自动驾驶汽车既已成为数据消费者和发电机，又具有信息娱乐，以及从气候监控到高级驾驶员辅助系统（ADAS）以及各种自治级别的内部内部和外部感应，以实现用户舒适性和安全性。

这就是发生的事情。现在，设计师最合适的回应是什么？

设计要求

Clearly, each of these applications have wildly varying processing, memory, I/O, communications, environmental tolerance, security, software, and real-time performance requirements. Digital-signage players and home video may need two or 4K HD streams, while a home automation controller and interface may only need a 1024 × 768 graphics touchscreen. An IoT device may require face recognition, or simple cameras; a Bluetooth connection or full Wi-Fi to connect to the cloud for instant response.

However, as speech recognition, voice control, gesture control, and high-end audio become default requirements in many applications, designers have to be able to quickly scale up and down the functionality, power, communications, and performance curve, while simultaneously reining in costs and development time.

虽然一个尺寸显然不合适，但智能路线是选择一个可以接近的支持平台。目的是找到具有多个处理能力的级别的目标；随附用于视频和图形的GPU；具有高端音频，快速内存，无线和有线通信以及适当的操作系统支持；以及广泛而积极的支持生态系统。

目标是避免重新发明轮子，并专注于您的真实价值添加，无论是在其他硬件还是软件中。

另一个关键的设计要求是寿命。部署后，系统可能会在现场多年，因此需要高的恶劣环境容忍，并且能够随着时间的推移执行更新的能力。后一点意味着，设计中内置了足够的利润，以便随着功能的添加功能，同时仍确保低功耗。

3. I.MX 8M是对语音，音频，视频和AI应用程序的缩放和寿命问题的有趣解决方案，从家族IoT设备到高端4K HDR视频。（来源：NXP半导体）

A good example of a platform solution that can kickstart this approach is theI.MX 8M来自NXP半导体(Fig. 3)。这实际上是一个处理器家族that has up to four 1.5-GHz Arm Cortex-A53 and Cortex-M4 cores. It’s a part of32和64位解决方案的广泛阵容based on Arm technology.

除了其高性能和低功耗外，I.MX 8M家族还具有灵活的内存选项和高速连接界面。该处理器还具有完整的4K Ultrahd分辨率和HDR（Dolby Vision，HDR10和HLG）的视频质量，以及最高水平的Pro Audio Fidelity，具有多达20个音频频道和DSD512音频。

重要的是，该解决方案支持Android，Linux OS和Freertos及其生态系统，并且也可扩展到两个ARM Cortex-A53核心。VPU和其他功能可以删除以降低成本和功率。

To help get a design off the ground quickly, the i.MX 8M is also supported by an EVK, as well as a range of交叉处理器。These high-performance MCUs solve specific problems without scaling to a full Linux machine. Included is theI.MX RT 1050具有实时功能的高性能处理器。

在绩效和功能方面，满足广泛应用的需求，同时将成本和开发时间降至最低，这是设计师在某个时候面临的问题。诀窍是要很好地学习一个平台，但要确保它可以随着需求的变化而扩展，并且支持它的公司将在10年内大约10年。

在此处阅读有关此主题的更多文章TechXchange:AI在边缘

赞助资源：

Related Resources: