Podcast: Storage for AI needs scale, hybrid cloud and multiple integrations

In this podcast, we look at artificial intelligence (AI) and data storage with Grant Caley, UK and Ireland solutions director for NetApp.

He talks about the need for storage scalability and performance, as well as hybrid cloud, access to all three hyperscalers, and the ability to move, copy and clone data for wrangling prior to inference runs.

Caley also talks about the importance of application programming interface (API) integration, a standardised data layer that can connect into Kubernetes, integration with Python, workflow platforms such as Kafka, and Nvidia microservices and frameworks such as NIM and NEMO.

Antony Adshead: From the point of view of storage, what’s different about AI workloads?

Grant Caley: Traditional enterprise workloads are fairly well-defined as to the characteristics of that workload, the requirements for that workload.

With AI, it’s completely different. AI starts off being very small in terms of development, but it can rapidly scale to multi-petabyte production installations that span not just on-premise but the cloud as well.

When you’re looking at it from an AI workload perspective, it’s almost completely different from a kind of siloed, focused enterprise application. That means you’re having to cope with different performance requirements. The capacities you have to host for AI from a data perspective go from just gigabytes to petabytes of data, which has its own challenges.

From an AI workload perspective, you’re often having to wrangle large datasets, move them around, clone them, copy them, get them ready for cleaning and inputting, and then use them for inferencing.

There’s a lot of high maintenance that goes around the kind of requirements that sit with AI as well. And another interesting fact is that we see now that AI is not just an on-premise play. It’s AWS [Amazon Web Services], Azure and Google Play, as well.

Customers are developing and leveraging all of those environments as well as their datacentres to deliver AI. And from what we’ve seen recently, AI is becoming the IP of the company, the data it leverages and the output it produces. Security of that data is critical, being able to evidence the data, checkpoint it, version it, because of some of the laws that are coming in around AI.

All of that makes a massive difference to how we have to treat it. And then ultimately, if you look at AI in general compared with any enterprise workload, the actual workflow is really complex and you have to kind of factor that into how you deliver for AI. So, there’s a lot going on that’s different about workloads in an AI context. 

What does storage need to cope with AI workloads?

Caley: It kind of builds on the last answer I gave. As customers start developing AI, they often start off in the cloud because the tool sets are there – the platforms – they don’t have to spend a lot of money building environments. So, you have to be able to leverage the cloud.

But equally, a lot of customers are doing it on-premise. They’re building small GPU [graphics processing unit] platforms in servers, they’re developing into bigger DGX or Nvidia SuperPods and those types of configurations.

What’s key underneath all of that from a storage perspective is the data that drives the outcomes they’re trying to do. Whether it’s the early development stages in the cloud or moving to first step production on-premise, to how they push out data for inferencing where it’s actually needed.

That could be small factories, remote sites, whatever that happens to be. So, data mobility from the storage layer is actually key, and that means you have to not build storage silos for each of those use cases.

You have to really try to straddle those use cases and deliver something that delivers data mobility. We used to talk about delivering a data fabric, but it’s that kind of interconnectivity that’s really important.

I think the other thing for AI is that it starts off low-performance when you’re doing your first early stages of training, but that can rapidly scale.

So, performance is a big factor. You need to know that the storage can deliver from the small requirements through to the productionised and the scale requirements. And, a lot of companies forget about that when they go to production. They have created these silos of different types of storage, not realising that ultimately at some point they’re going to have to scale those significantly.

And scale is another factor the storage has to deliver. As I said, it could be gigabytes in the early days, but rapidly that can become petabytes, particularly as companies bring datasets together to try and maximise the training value and the outcomes they can deliver.

But, of course, the data is the IP of the company.

You have to put that into a storage infrastructure that delivers zero-trust administration. [So] that [it] delivers security encryption of the data, that it can make – if you’re doing versioning and kind of evidence-based [work] – those results immutable or indelible so that you can potentially prove the data as it was and the stages it went through.

There’s a lot of things you need to do. And I think the final thing on what data storage needs to deliver is you need to be able to deliver integration into all the tools the customer is looking to use.

They’re looking at Kubernetes workloads, delivering it through Kubernetes. They’re looking at using different frameworks on-premise in the cloud. Your storage layer, if it’s going to deliver real value, has to be able to API integrate into all those different environments to maximise the capabilities that can be delivered from the storage layer itself. 

Types of storage

Looking at the ways data is held for AI – the type of data, such as vectors, the needs of checkpointing, the frameworks that get used like TensorFlow and PyTorch – is there anything in those that dictate the way we need to hold data in storage, the type of storage?

Caley: I think there are a couple of things. One is that the AI community doesn’t adhere to a lot of standards. Each developer or data scientist has their own set of tools they prefer to use.

It’s only as these things scale to production that you start to get standards forced in, in terms of, “OK, we’re going to use these frameworks, we’re going to use these technologies.” And consequently, the storage layer that sits underneath has to be able to accommodate all of those. Otherwise, you’re buying different types of requirement for different types of customer, different types of use case.

Absolutely key is the fact that the data storage can integrate into the Kubernetes platform for a start. Most outcomes like PyTorch and TensorFlow are using Kubernetes to scale their environment, so integrating into Kubernetes to be able to make that an automated, seamless capability is important.

But there are other tool sets, as well. For a lot of developing stuff that leverages Python, you need to do API integration into that. We developed our own toolkits for Python integration to make that easier for customers.

But then there are ancillary technologies around AI, like Kafka, and how you do the data flows, the data cleaning, the data cleansing, the processing, etc.

All of these can benefit from the advantages that storage can bring if it can integrate storage features – like instant cloning, instant checkpointing, instant rollbacks – into those different tools that customers have. You need flexibility because you need to be able to deliver for AI on-premise, at the edge and in the cloud.

Having a standardised data layer such as NetApp could deliver, for example, can really help reduce that complexity.

It’s back to security of data. It’s almost one of the top priorities we get asked about in AI, particularly recently, with a lot of the legislations that have been raised – can we secure this data? Can we put zero-trust? Can we make it available and highly available? All of these are concerns that you need to consider, depending on the tools, the frameworks.

It doesn’t really matter which tools or frameworks you’re using. All of these kinds of things are important. Integrating into Nemo, integrating into Jupyter notebooks, GPU direct with Nvidia, Python, Kubeflow, all of these technologies.

If the storage layer can integrate into those and provide value into that, that massively helps reduce complexity and provides better go-to-market outcomes for the customer.