Cloud Crunch
Cloud Crunch

Episode · 1 year ago

S2E10: 5 Strategies to Maximize Your Cloud’s Value: Strategy 1 - Create Competitive Advantage from your Data

ABOUT THIS EPISODE

AWS Data Expert, Saunak Chandra, joins today's episode to break down the first of five strategies used to maximize your cloud's value - creating competitive advantage from your data. We look at tactics including Amazon Redshift, RA3 node type, best practices for performance, data warehouses, and varying data structures.

...involve Solve, evolve, welcome to cloud Crunch The podcast for any large enterprise planning on moving to or is in the midst of moving to the cloud hosted by the cloud computing experts From Second Watch, Ian will be chief architect Cloud Solutions and Skip Berry, executive director of Cloud Enablement. And now here are your hosts of Cloud Crunch. Welcome back to Cloud Crunch. Season two Today I have a couple of guests with me. Rob Whelan with second Watch and AWS data expert shown up Shandra. Welcome, guys. Very excited to have both of you here. Here. Absolutely. Last week we gave you the cliff notes versions of five strategies you should consider to maximize the value of being in the cloud today. In the next few episodes, we're going to examine each of these strategies and more detail, starting with creating competitive advantage from your data to add to our discussion. Today we have a very special guest Chinook Shandra from AWS and I want to give a little bit of your background of the audience so that that we know. So I'm going to read my little can speech here. Shaddock is a senior solution architect specialized in data and analytics with AWS. He has over 15 years of experience in designing and building scalable and secure data like data warehouse solutions. Chinook helps customers bill data strategy from proof of concept to the final architecture design using Big Data and AI and ML Technologies. He is an advocate of data lake and machine learning and have written several blog's and get hub codes in the data area. Welcome again to the show. So we want to get into this today. I'm really excited about both of you being here. Rob. Rob is our practice manager of our data analytics and big data practice and s Oh, this is gonna be great. So you all have tremendous experience in this area, and this is this is gonna be really fun. So why don't we go ahead and open it up? Rob, I think you've got some questions, so let's just jump right in. Sounds good. And thank you. And shut up. So great to be with you again. We've worked together on so many projects you've always been so accessible on dso knowledgeable, but let's get right to it. When it comes to red shift. We want to use it to analyze large amounts of data, whether just querying and or visualizing a dashboard. So what are some tips you have for us to minimize the time between loading data into the red shift Quester and visualizing data? First of all, Thanks, Rob. Thanks, Young. Thanks for inviting me. It's a pleasure to talking to you and share some of my best practices and my experience that I have had with lot off customers and partners while talking to them about data warehouse. Best practices were shipped to be very specific, but any AWS data technologies just in general. So when it comes to data visual ations on data warehousing...

...using I was on Red Ship, the very first and most common approach off data ingestion point is Amazon s D. That's the defect of standard for data ingestion or landing zone. If we will for data, you know, bringing in from your CSB or Excel file or maybe data coming from traditional rd BMS databases. So s to being in the first in the landing and landing zone. Ingestion point on the reason for SC being so prominent for data loading is it improves your object Transfer from a stray into Amazon red shipped. Improve your s three read throughput. Um, as well as you maximize your parallelism. Which red shift is very good at It also improves your processing, especially around the data ingestion perspective. If you can spread out the files in Amazon s three in multiple objects, when we talk about multiple objects, you can think it off as you know, putting multiple files, whether it's the CSB or any other kind of text delimited file. Or maybe it could be just on file as well as more advanced in a column. Reform and files like in the Park A or C so you can think. Think it off as uploading in all of these files off similar, you know, schema structure into a folder, if you will. On this folder, we call it in Amazon s three as prefix. So you upload all of these files in Amazon, prefix about honesty graphics and then run through a glue crawler, which is AWS Blue service to recognize the format of the data instead off you telling you know what the data structure look like? What are the different columns and data types you let Amazon blue to discover your data on that makes the next step, which is the data loading in a much more, you know, simplified. So that's that's something that we really, really talk our, you know, customer or partners that, hey, make your files spread across in as many files as possible instead of creating just one single file. One single in large file, which may get bottlenecked to load your data and I was understand, may not utilize the wretched parallel processing. So what happens if you try thio loaded into red shift without having crawled it first? So if you if you don't crawl through Amazon aws glue what you could, you could spend some time, especially if the data or the file is coming from your provider, which you're not aware of the structure, and it would be really, really painful if the file structure is not something that you can read, which is essentially any kind of columnar structure like, you know, like not character coded files such as parking and or so there's no point for...

...you to understand the structure. And if it is provided by a third party, you don't know what the columns are, what are the different data types are, and that's when inedible. This glue crawler comes really, really handy for you or for the glue crawler to discover the data structure behind the file. So if you don't have the blue collar, you need to know the column types, the columns, the schema off your data, and you need to manually create the table in Amazon. Red Shift and the Lord. The data from those fights in Amazon. Redshift, That's great. Now it's you've really kind of honed in on two services. Before you get the red shift, one is gonna be s three. Which amazing to me, how much it keeps improving every year. I mean, it's objects store right now. I would always think, How does it get? Any better? Seems to be keep getting faster, more capabilities, those types of things so very excited about that. And I cannot wait to learn what's coming out at reinvent on that one as well. Every year there's a little bit of a surprise and glue, so let's talk about a couple of things. One is how hard it is to use these services, so because what I hear is that you really want to split these files up. You wanna prefix? Um is that difficulty? Is there a strategy? You know, kind of where can people go for best practices with that? And secondly, how hard is glued to learn? You know, particularly. I guess this is just the crawler side because there's a lot of capabilities inside of glue as well, and some of the pricing models associated with that. That's a good question. So in terms off, ease of use in everybody's familiar with Amazon is three, right? So you could applaud. You can log intuitiveness console First service, right. First service AWS launched the first service and its object storage. So you don't have to really think about structuring your folder. It just upload the file and it will. It would be placed somewhere in Amazon s three on. You can download the file. You can even launch a website and static website using Amazon history. But toe to this point into this regards this has bean really, really helpful or really, really easy to upload the fire industry. You just need toe have a lot of this Management council access or it could upload the files through CLI or FBI. There's a lot off options out there if you want to do that. Programmatically right, um, on in terms available this group crawler One of the benefits of the group crawler is it can recognize the format, and it can recognize formats of various different types. For example, you know, if it is a CSB file. Yep, it will. It will understand out of the box that is the CSB file and will ignore the first column. Typically, that's the case for CSB file, which is the header file or header. Grow on greed the hetero and then recognize the column names from there. Um, if it is any other format such as it takes deliberate file with, you know, tab delimited right, it will also recognize that start of the box and more Maura in structured or semi structured data such as, you know, Parky or or see or Jason. It can do...

...that job seamlessly, so that's that's that's really, really easy. And for for a customer, for user, um, you just log intuitively. Glue create a glue crawler specifying the location of forestry Object on. Off you go. It's provide the name of the database that it restraining it. Let's look at look, Andi, run the crawler on. Did you recognize the structure and schema on? Register your table inedible, excluded a catalog and once it is included, a catalog you can bring in your host of services to use your secret skill to create that data, whether it's an Amazon Latina with AWS glue, spectrum, EMR or Quicksight. It's great because you a lot of flexibility there. So there are different node types, and things along those lines is just different ways to kind of launched a red shift cluster. And can you tell us about Well, obviously there's different node types, but we're gonna hone in on an R A. Three here. So what are some of the recommended patterns of using an R, a three node type compared to other cloud data warehouses? Sure, I think we need to understand why either three came in the first place and one of the things that our product in Red Shift team recognized talking different customers that a lot of the customers trying to modernize their existing data warehouse what we have experienced talking to these customers that they host a large volume of data, Uh, usually in the form of effective ALS that does not get created that often on that constitute a huge volume off their storage. Um, in their data house, which could contribute to somewhere between, you know, 60 to 90% off that volume of data on one of the things that boy in red shift that it provided is spectrums. If you have in a lot of these historical data that you don't quit that frequently, just offload their data into spectrum. But there are some some overhead associated with maintaining and managing spectrum. For example, you need to create us external schema. You need to off load the data into S three. You need to partition it. Andi, create a view union, all of you, to be able for your end users or business users to create the data seamless developed a historical leader or in a more current data or hard data, as you call it. So that's when you know our three, you know, comes into picture and our our service team and our product enjoy eating, you know, part outside the box and came up with a novel solution, which is a new instance type, which is called our three Great. So once you're using these nodes and you've got a lot of people using your red shift cluster Um, but my question next questions around concurrency And how do you keep things efficient? Basically, when we talk to customers, it's pretty common for them to ask how to reduce sort of...

...query wait times. And they sense that there is getting this collision between multiple parties using the cluster. We always tell them to just, you know, start, let's start by taking a look at what workload management accuse. We generally say, Hey, you should have at least three or four different ones. Just one is reserved for an admin. But what else do you have any other thoughts beyond workload management for? For more advanced ways, Thio manage concurrency. Yeah, So let me let me go to that point. But let me finish up or thought on the rt no type I think we just mentioned about why aren't you came down in the first place? On what are the different patterns that we have been seeing our customers are using, especially for the ones where they had being already using Amazon. That shift in one of the legacy formats instance types, which is, you know, DS two and D C. Two on for For the customers were already running on DS two instance types. Um, it's been a natural choice for them to move to our three because there's nothing to lose. There are three runs on SST, which de esto the previous generation was hard, right? So obviously you get in a much higher, higher throughput there. And Andi on our three comes within a 64 terabyte dipshit manage storage, which is, you know, quite quite a large amount of storage. The other benefit with our three is customer does not need to pay for the additional storage that it comes with each note, which is 64 terabyte partnered, they only charged, are build for the amount of stories that they occupy. So let's say the customer has, you know, 20 terabytes up off active space data space on. They have to to note in our three cluster they're not paying for the 1 28 terabyte off storage. Instead, they're paying only for the 20 terabytes storage. What benefit this brings is if their performance gets in a bottle net, for example, they have, you know, reach the cap off Cebu dilation. They can add additional, not without paying for additional storage that's completely separate. So eso that's the That's the one. Sort off, you know, difference from the Legacy Incidents types, which is the S, P and D C to when it comes to D. C. To customers like the customers has been already running on discipline stands type Andi have very steady in a safe utilization, which probably under in the 50 secure stipulation most of the time what we have founded for them, it's best to move to our three as a cost saving measure, obviously, in a D. C. Two is done by Isis the artisan biases de eso. There's not really in a much benefit if you move to our three from D C to, but if your circulation is below, you know, 50 50%. Maybe maybe there's a good possibility for you toe. Make some cost saving moving on to our three. However, if you have a sip...

...utilization, you know at the higher site, close to, let's say, 80 or 90% most of the time. It does not make sense for you Toe move to our three on, um, required to consider some other factors there. Yeah, now coming back to the original question on you know, auto W L M or W L m for measurement or best practices for performance under heavy conference. See what we have. We have seen auto W l m is suitable for new customers who are, you know, new user off Amazon. That ship they haven't used Amazon achieve before. Maybe they haven't used any kind of data house before, right with very little knowledge on their workload types. Whether it's a et l lt heavy or it's mostly in regional inquiries like, you know, be I visual ations and stuff. Um, after certain days off, setting up the auto W L, um, they create a priority queues and the private accuser way for making some kind of workloads to get, you know, prioritized before other workers. Because in auto w l. And we have just one in a flat out in the slots that I did not have any any any slots that you can create on your own. Instead, AWS or Amazon Red ship creates those slots for you. Andi, after, you know, running your data warehouse for a certain time. Um, customers tend to create certain pride because because they see there's a there's a conference issues across different workload types, whether it's the L, T and B I. And certain processes, such as being queries, you know, started to struggle there because there's a lot of detail jobs running on. At that point, they set out a priority queue on this priority queues obviously. But as the name suggests, you know certain cues are get more priorities over other cues. And in this particular case, if you know, be agreed is a struggling because off in a heavy ET AL processes as and when the bike please, you know, coming, they will get prioritized on the biggest advantage of red shipped to efficiently use the cluster under heavy concurrency is the use off conferences scaling, which is a feature that we launched around, I think 2019, that has been very effectively being used by a lot of our customers on the benefit off conferences. Scaling is it's really, really applicable when you have, you know, kind off very, you know, in the spiky work Lord, but the spike of water does not in a state for longer period of time. Or maybe it's, uh, you know, one hour, a couple of hours, a week or month has been really effective for customers with the conferences killing usage for, you know, heavy conference work, Lord. So how does that work on concurrency Scaling? Is that just an option? You take in it and it sort of works for you and what's happening behind the scenes? Sure. So contentious scaling does create a brand new cluster behind the scene without you knowing that the cluster needs to be launched. Right, So you set up a conference Is...

...killing in your men cluster assed part of the W L, um, Set up Andi Red ship will determine went to launch the this parallel transient cluster that as we call it, which is the conferences getting cluster. It will launch the conferences scaling cluster. It will, before launching the cluster will take a quick, quick snapshot off your current workload and it will launch the conferences scaling cluster for the tables that these Cui's which are, you know, waiting for a cue slots. They will those tables air getting loaded into the conference is killing Cluster and York, we will get, um, in. Acquitted from that conference is getting cluster instead of the main cluster, so it's completely transparent. As the Indians. I would not see me. You know, when conferences getting clusters created as long as you are running the query, but as an administrator. Or if you have access to your AWS management console, there's a plenty off metrics that you can watch on AWS red shipped Amazon red shipped console or those are available as your cloudwatch metrics so that you can create your own dashboard or events and kind of monitor. Set up a monitoring job, too. Do auditing or, you know, set out your admits that there are some Use it, you know, spike happening, and you may need to look at it right. So for handling concurrency, the recommendation is be sure tohave workload management cues set up that makes sense for your usage patterns. And also look at concurrency scaling, which which is actually interesting, that it's the concurrency benefit is is perk you right? It's not like all across the entire cluster. Is it per queue Yeah, so right now the conferences killing is applicable to read a liquid, so it's not applicable for retail. Quit s. So that means if you're running any any et al queries, I mean, obviously, those will not be eligible for conferences scaling. But at the same time, if you're running any read only queries of the queries that would required toe relight your queries behind the scene by red shift, uh, in a certain format that would otherwise execute much faster for, for example, if it requires to create attempt table, right. So those are also not going to be eligible for conferences killing. And in other words, if you have multiple accuse set up, those are quite independent off confiscating. As long as there you're quit is eligible and your cue has been set up for conferences, can you have a not option for each? You you can set the a que as conferences killing, eligible or not. That's ah like administrator level set up that you set up at the W. L M. So once Q is eligible, you have set up the queue for eligible for conferences. Killing on a quid in that queue is eligible then it will. It will be leveraging. The conference is killing.

So we have seen, like most of the conference is killing Crees. Are you know, be I grease, you know, reading degrees, You know, mostly using the select queries and simple slick grease. If it is a very complicated, select ways involving creation off, um, you know, temp tables just to simplify the query processing, it will not be eligible for conferences getting, but most of the time, uh, conferences scaling kicks off for B i queries and we have seen that customers almost 97 to 98% of the customers do not pay any additional fees for conferences Scaling because you get a one hour credit off your main cluster, use it every day free of cost. That's that's pretty amazing. I started using Red Shift a little over five years ago, and it is absolutely amazing how much much it's progressed, and particularly in the workload management side. I was talking thio friend of mine at a customer, and, uh, one of the nice things about all the set up to is that now you can put people who don't understand how to do good queries into a particular Q, and it protects everybody else. So, uh, it really does protect that. That's you know, we don't talk about that feature a whole lot, but you could put kind of people who don't really understand how the query data put him over there, and they kind of isolate themselves and not impact the whole organization. So that's until they learn how toe do it much better. So thank you for sharing all that. So you've got a bunch of different kind of things going on here when we talk about Post Chris Athena and other CSP data warehouses. They all support semi structured data, things like structure maps in a race. What is AWS is recommended pattern on how to use red shift for such data structures. Let ship does support. So in structure data, a such such destructs map on there is, as you mentioned Onley through the external table. But before going into that, let me tell you that local tables also does support Jason formatted in a structure data as a source on that that that support is done through the copy command, which is the common way to load the data into rich local table instead of external level, which is pretty much in S three. You don't require to load data, but in coffee command you load the data into Amazon restive local storage. With that say is, did he hard drive or, you know, SS Depart drive. You can also have adjacent column in the table. For example, if you have a you know, 10 different columns. Timestamp, you know, sores and all those things. And one particular column off data contains adjacent structure data set. For example, if it's silly, you know, sparse kind of metric, and you capture that in a Jason in the structure that can be part of your table column as well. Andi. It can be parsed using Jason functions that particular column, which is containing the Jason Day that can be parsed using Jason functions like Jason Extract Area Element text or just an extract part...

...text. Now let's talk about the same I structure data, which is applicable for data generated by, you know, telling metric systems such a Several logs sometimes also like gaming platforms and applications, generate those kind of data click streams very common way to generate this kind of data Andi Gillette. Those that are in enormous volume with a sporadic frequency on this kind of data are common. Toby represented Ingesson format most of the time, which consist off. You know, the structures that dimension extract, array and map kind of data leadership support. In just off this kind of data, we are the spectrum table. So when when is when it is, except or used through the spectrum and external table? There could be different forms off the data, like if it is an instruct form, it is a struck. From then read Ship does support accessing those struck column using a dot notation. For example. If you have a customer table, for example, and customer has, you know, many different, um, in orders placed by the customer, and you want to represent all of those different orders in the distance structure. And some orders might have certain columns. Some some orders might not have those columns, so it's very common to represent those in orders that up par customer in India. Since structure and some customers have, you know no order, some customers have, you know, one or two other. Some customers have hundreds of orders. They are, well, tweeted for representing ingestion in terms off array, right? And the way that external table or spectrum can quit those data is very simple. You do not have to represent you know those structure out of the time instead of create a customer table, Um, in external schema, Andi, start clearing the data using regular Jason or regular you know, sequel syntax on. It's very simple. All the sub structure, all the, you know, structure behind the main main. You know, Jason structure. Just access those to the dark notation it is create under your sled query. Use the table name in the front claws and any sub structure within that within that Jason, use it as an alias on then access the sub structure within dark magician. So that's as simple as that. So if if you have in a more complex structure like areas I mentioned, our ladies have multiple in orders for customer on. You want to capture all the customers. Even if the customer has an order or customers multiple orders you could probably use in a giants, for example, um, if you want to capture all the orders for customers who has orders, so you can use the inner giants using the Jason data the same way the dark protection works. Andi, if you want tohave capture all the customers, even if they don't have any orders, you can use the left Giant. It's the very common use case for in a giant principles, right? So all those kind off the left giants in a giants out of giants? Um, can we applied on this on...

...structured and and very kind of data? So, yeah, so we could weigh, do see in the customers using an external tables, especially with the data coming from tele metric systems or, you know, gaming systems to access those data from from external level. That's great. So, yeah, one of a kind of explore some technical tips to better understand the nuances of locking, blocking a deadlock, deadlock operations and red shift, which can have performance ramifications. What can you tell us? Kind of like, you know, high level. What can we do to kinda make sure we don't get into any bad situations with those things? Sure. I think the good point is Red shift has been very friendly in terms off locking. If you compare with any other R T B. M s, where the table locks or the roadblocks are very common. Red ship does give you a lot of flexibility. Andi. It does apply locks, paying the scene without letting you know that there is some lock happening because it's so super fast on immutable. No data storage or block storage. So Amazon ritual allows tables to be read while they're incrementally being loaded or modified quickly. Simply see the latest committed version, which we call in a snapshot of the data, rather than waiting for the next version to be committed. Some applications require not only concurrent quitting and loading, but also the ability to write to multiple tables off the same table concurrently. And this mechanism by which wretched, allows content writing into a single table. We call it serialize obliged relation, which essentially preserves the illusion that a transaction running against the table is the only transaction that is running against that table. And it could also lead to periodic deadlock situation for content right transactions. Whenever a transaction involved updates off more than one table, there's always the possibility of concurrently running transactions becoming deadlock when they both try to write toe the same table our same set up tables, and a transaction releases all of its stable locks at once. When it either commit or roll back, it does not relinquish locks one at a time. For example, suppose there is. There are a set of transactions T one and T to start at roughly the same time. If Taiwan starts writing to a table A and T two starts writing to Table B, both transactions can proceed without conflict. However, if Taiwan finishes writing to Table A and needs to start writing to Table B, it will not be able to do because Tito still holds a lock on be it still writing on Table B. Conversely, if Tito finishes writing to Table B and needs to start writing to table A to not be able to proceed either because Taiwan still holds lock on it because neither transaction can release its lock until all its light operations committed, neither transaction can proceed. So that's kind of very common kind of...

...deadlock situation. That may happen, and it's one of the worst practice it could involve in your you know, it'll processing. So how do we avoid this kind off deadlock, you need to schedule contained right operations very carefully. You should always update tables in the same order in transactions. And if specifying, locks lock tables in the same order before you perform any Deimel operations. And there's several ways that you can find. Locking happens at the transaction level. And which tables are, you know, creating locks. Which tables are, you know, being blocked for? You know, being a lot has been applied on the table. You can identify those locks on the table and type of locks by session following tables. The tables that are needed to be created is with the transactions and PG locks. From these table, you can find lock mode, you know, blocking P I D table I d. And granted. Um, if granted colonies false, it means they're transaction in another session is holding the law. And if that's the case, you don't have any other choice because everything is in deadlock. Um, you may end up, you know, terminating the session on DFO terminating the session you use PG terminate back end with the p I D. On. That's very unfortunate. If everything you know gets you know, blocked. Nobody's moving. That's the ultimate step you need to take. Great. Great. Well, yeah, we've covered a lot today, and I do encourage our listeners. If you have looked at red shift in the past, maybe 23 years ago. Look at it again. It is really, really evolved. And it is solving a lot of data warehousing situations. It is extremely vibrant product, I would say, and particularly the way it auto tunes on the back end with the workload management, all those types of things. It just keeps getting smarter and easier to use. So before we go, I want to ask both Rob and Chinook. What is the best way for somebody who's interested in learning more about red shift? How can they get started in their knowledge? I'll try that. So to get started with red Shift, you need two things. You need a data set, and you need a goal for what you want to accomplish with that data set, and it's from there. It's very simple, And that data set, you know, really, the bigger the better. And in terms of your analysis, just be open minded as to what you can find in that data set. I think if you just have a data set, but no goal, then you could you know, you're not going to get across the finish line if you have a goal. But no data set. Obviously, you can't get going. So those two things together with Red Shift will will be a fantastic way to get started. Sure on. Yeah, I I agree that you have to have the use case, right? And Richard being the fastest, you know, data warehouse product on the cloud you can leverage your data Analysts is much faster if...

...you have, you know, tens off, you know, gigabytes of data or if you have, you know, hundreds off terabyte or even petabytes skill of data you can use in Arctic cluster north type if you have big data sets. So if you want experiment, you can get started by launching a D c tu rds to instance type. We don't increase to use DST anymore, But you can get started with the C two and start loading the data from his three right away or the thousands off other it'll tool. If you are bringing data from you know our D B. M s databases on duh you can bring in in a quick site if you want to quickly analyze the data. Quicksight has data discovery or red shirt discovery option. If it is in the same account, it will quickly, you know, identify your achieve cluster on it will. You know, start importing your data sets from the tables that you want to log or analyze the data Onda even if you want to use in the Jupiter notebooks Sudden like a lot off ML users nowadays are very familiar with in a visualizing and analyzing that I use in Jupiter notebook because you have been a lot many different, you know, packages that you can use to do a lot of visuals. So even now we do have a wretched data FBI, which is very flexible way toe plug in your Jupiter notebook to redshift cluster. You don't need any driver. You do not need any networking, set up like security groups or even manage any credential. You can unload the data in your ste and the lower into pandas data from or you can directly load into your pandas data from from your wretched clusters. That's very flexible. Great. Well, I want to thank you for your time. Chinook And thanks again for joining us, Rob. It's always good to see your smiling faces. Well, next week we'll look at the second strategy to increasing your clouds. Value increasing application development with Dev Ops. Thanks again for joining us, and if you have any feedback or comments, we welcome them. Please email is that cloud crunch at second watch dot com? Talk to you said you've been listening to Cloud Crunch with Ian Willoughby and Skip Very. For more information, check out the blogged. Second watch dot com slash company slash vlog or reach out to second watch on Twitter.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (33)