Univa Solution Overview for GPU-Enabled Deep Learning

The term “artificial intelligence” was coined by American cognitive scientist John McCarthy in 1955 as part of his proposal for the Dartmouth Conference in 1956; the first AI conference. McCarthy, known as the “father of artificial intelligence”, developed the programming language LISP in 1958, the standard language of AI. “McCarthy said the breakthrough might come in ‘five to 500 years’ but never dismissed it.”

What is artificial intelligence and how do machine learning and deep learning relate?

In a Nvidia blog post by long-time tech journalist, Michael Copeland described the relationship this way: “The easiest way to think of their relationship is to visualize them as concentric circles with AI — the idea that came first — the largest, then machine learning — which blossomed later, and finally deep learning — which is driving today’s AI explosion —  fitting inside both.”

Image source: https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/

An article on MIT Technology Review [home of the first AI Lab founded by McCarthy and Marvin Minsky], defines “deep-learning” as “software [that] attempts to mimic the activity in layers of neurons in the neocortex, the wrinkly 80 percent of the brain where thinking occurs. The software learns, in a very real sense, to recognize patterns in digital representations of sounds, images, and other data.” This article discusses why Ray Kurzweil joined Google; “It quickly became obvious [to Kurzweil] that such an effort [to build a brain] would require nothing less than Google-scale data and computing power.”

Training complex deep learning virtual neurons can take thousands of computer processors, something that is easily beyond the technical skill prowess or even remains out of reach today for many organizations. Provisioning computing infrastructure at that scale and managing utilization can be daunting.

Frameworks for Deep Learning, such as TensorFlow, abstract away architectural and implementation specifics from the level of individual neurons, however, the need to exploit compute resources effectively remains. General Purpose Graphics Processing Units (GPGPUs) have become well established as the de facto standard resource for Deep Learning due to heir efficiency in executing certain types of code.

With the increasing customer interest and use of TensorFlow Univa has recently released into open source the integration code to manage distributed TensorFlow via Univa Grid Engine; you can find the details via GitHub.

How does HPC and HPC Cloud relate to machine learning and deep learning?

Infrastructure Requirements

As mentioned above, the computing power required to train complex neural networks can easily overwhelm most organizations’ available data center resources. Further, the need for specialized hardware, like the latest Nvidia GPUs, can introduce delays and sizable capital outlays without consideration of leveraging hybrid cloud.

Here is an overview of infrastructure considerations and how Univa solution components add value.

Control of Jobs and Resources

Univa Grid Engine is the de-facto standard workload manager for enterprise-class deployments; ultimately, it ensures workloads execute in concert with the resources they require. Grid Engine offers fully integrated support for Distributed TensorFlow (you can learn more about this integration here) and GPUs; this enables optimal placement and utilization of these expensive resources. In addition, Grid Engine provides workload control so that application scientists have reliable workload runs and can be sure resources are cleaned-up when workloads finish. Finally, Grid Engine accounting offers dependable resource consumption reporting that allows one to inspect and tune the environment for more throughput and performance.

Monitoring, Accounting and Reporting

Unisight is Univa’s powerful and highly scalable solution that monitors and presents historical data on jobs, applications, container images, users, GPUs, software licenses and hosts. Unisight can provide application scientists with insight to better understand how their Deep Learning application behaves, including how it runs (like what server resources it actually uses).

Tying Policy Automation to Workload Attributes

Navops Launch automates the deployment and configuration of Univa Grid Engine and Univa Unisight in a hybrid cloud architecture and provides support for end-user supplied images that are pre-configured with a GPU-enabled toolchain for TensorFlow. With its restAPI integration to Grid Engine, the Navops Launch policy engine has access to deep time-based visibility into the workload management system’s internal environment and attributes. This provides Navops Launch full access to all Grid Engine “objects” and the use of those attributes as part of the scaling or reaping rules can easily be created in a simple pick and choose GUI. This enables Navops Launch to more effectively scale up or down cloud-based infrastructure dynamically by tying those actions to workload attributes such as queue depth, wait time and to make trigger the reaping of cloud instances as resources finish execution workload.


Machine Learning applications continue to grow in popularity and business context, so much so, that market analyst Gartner now forecasts the global “business value” derived from the disciplines of AI will pass $1.2 trillion this year and account for 2.3 million jobs in 2020, an impressive increase in value.

Taking advantage of this business value requires careful consideration of application frameworks and infrastructure choices. With the value Univa solutions bring to the AI discussion, it’s no wonder organizations including The Wharton School at the University of Pennsylvania and QUML have turned to Univa to enable their machine learning and deep learning applications.