#007: Pick the Right Temperature for Data Storage

Jul 03, 2022

Imagine you have many gigabytes (or more) of photos & videos on your laptop.

It’s taking up all of your local storage but you don’t want to just delete it.

You could store them on an external hard drive.

But you risk losing or damaging it.

You could also back them all up to the cloud.

But this is going to cost you a lot of money each month that you don’t want to spend.

…. Not necessarily!

This is the exact scenario I recently encountered with my YouTube recordings.

But by understanding the different temperature options I was able to store ~100GB for less than $0.90/month.

And maybe could have gotten as low as $0.30/month.

While there will be more costs once the time comes to retrieve the data, until then I will only pay this amount while it stays at rest.

In today’s edition, I’m going to break down:

  • The benefits of cloud data storage
  • The 3 common temperatures of data storage
  • Things to consider when deciding which temperature feels “just right”

 

Benefits of cloud storage

Data storage is often overlooked by other more trendy topics in the modern data stack.

But it is critically important and can have huge impacts on costs and performance.

The rise of cloud platforms not only drastically increased the volume of data that could be stored, but also created new options for doing so.

Let’s talk about some of the biggest benefits for using cloud storage.

 

Increasingly less expensive

Take my example of just 100GB of videos and photos, but multiply it many times over.

A business no longer needs to have its own data center (or external harddrive) to store it all.

Instead, they can tap into the massive existing infrastructure of data centers and servers maintained by the likes of Microsoft, Amazon and Google.

The sheer scale of these servers combined with improvements in technology has allowed these providers to offer amazing rates on data storage.

Truth be told, the real money maker for these platforms comes from the compute and networking services on top of the data, rather than storage itself.

But as we’ll see later, there are ways to keep your storage-specific costs to a minimum.

 

Availability

Data on the cloud can be automatically replicated and backed up in servers all over the world.

This makes sure your data is available 99.9999% of the time (or more).

Any issues that happen are the responsibility of the provider, not you, to fix.

And they are well-equipped to do so.

 

Main component of modern data stack

Modern data architectures often require big data storage.

And cloud options give you scale.

The last thing you want is to bottleneck your process because you ran out of storage.

Plus, the rise of ELT (vs ETL) approaches also means you’ll need an initial storage layer.

Whether as a full-on data lake or just a replacement for a network drive, cloud data storage is a critical component of any modern data stack.

 

The 3 storage temperatures

When we think about storing data on our computers, we usually think of one generic file system.

But cloud providers offer different “temperatures” for storage.

Generally speaking, the colder you choose to store your data, the less you’ll pay per GB.

Below is an example of Google Storage costs per GB by temperature (Standard = hot, Archive = coldest)

This is because you’re agreeing to not interact with it for a set amount of time.

Therefore the provider allocates less resources to do things like download, update, delete, etc.

Each platform has their own terminology for these different temperature levels, but they all more-or-less follow the same guidelines.

 

Hot

Data that requires frequent reads, updates, inserts, downloads, etc. is considered hot.

When it comes to data pipelines, there is going to be a lot of data that moves in and out of storage every day (maybe every second).

This is hot data.

Common examples here will be data streams, transaction data, weekly/monthly reporting data.

Basically anything you’d need daily, weekly or monthly access to.

You’ll pay more per GB to store hot data in exchange for having processing readily available.

But you won’t have minimum “untouched” time requirements like you do with colder data.

 

Warm

Next is warm, or sometimes called “colder” data.

This is data that you’d likely only need to access roughly once a quarter or so.

When I uploaded my YouTube recordings, I decided to go with this level of storage.

That’s because I knew I wouldn’t need to access them frequently, but every now and then I may need to reuse some footage.

As a result, I was able to secure a cheaper monthly cost for this colder storage.

I may have even been able to store them as the coldest option, which is the third level.

 

Cold

Naturally, the “coldest” option will be for data you rarely need but would still like to store.

It’s data you wont need more than maybe once a year.

Examples here would be old audit information, reports, dashboards, etc.

While you’ll get the cheapest rates, remember you are agreeing to not touch them for a minimum amount of time.

 

How to choose

Actually setting the storage option is super easy.

You organize your files into different buckets or containers, and indicate what type of storage that container should be (hot, warm, cold).

The default is typically “hot” or sometimes called “standard”.

But once you decide, you’re expected to follow those rules if you want to make the most of your potential discounts.

Here are things to consider when making your selection:

 

How often is it needed?

This is the most important and obvious part.

Less than once a year? Go for the coldest.

Once a quarter? Go warm.

Anything less than that? Hot

But make sure you are thoughtful about the selection.

Below is Google Cloud's minimum storage durations:

Let’s say you set it to cold but then access the files the next week.

Yes, you’ll be able to access them, but you’ll still be charged the full minimum term amount.

 

Potentially a full year in this case.

 

How much data is there?

These costs are typically charged by GB.

The more data you have, the more you’ll be charged for.

If you don’t have a lot of data and are unsure how often you’ll need access, then you’re probably safe to stick with the default option for a while.

Alternatively, if you have a LOT of data, then the savings can really add up if you’re able to keep it cold without any issues.

 

Can you split it up?

Lastly, if you are caught in between these options, see if there are ways to split it up based on use case.

This hybrid approach is common but requires attention to detail.

Make sure you are diligent to avoid accidentally triggering unnecessary costs.

 

Data storage isn’t glamorous but without it, there is no data engineering.

By understanding the differences in temperature options you can be more strategic with how you store your data and even save yourself some money.

 


 

That's all for this edition. One (or more) tips to help you level up your skills as a data engineer.

If you found this helpful, the best way to say thanks is to share it with somebody else.

Thank you for reading and I'll see you next time.

- Mike

 

New to data engineering? Check out my FREE Starter Guide PDFs.

 

Level-up your abilities as a Data Engineer, faster.

Learn new data engineering tips, tricks and best practices every Wednesday.

Other Recent Posts

Data Automation (CI/CD) with a Real Life Example

May 17, 2023

3 Ways to Deploy Data Projects

May 10, 2023

The Importance of Virtual Environments

Apr 26, 2023

How to Create a Virtual Machine on GCP

Apr 19, 2023