Cloud Crunch
Cloud Crunch

Episode · 2 years ago

S1E12: Diving into Data Lakes and Data Platforms

ABOUT THIS EPISODE

Data Engineering and Analytics expert, Rob Whelan, joins us today to dive into all things data lakes and data platforms. Data is the key to unlocking the path to better business decisions. What do you need data for? We look at the top 5 problems customers have with their data, how the cloud has helped solve these challenges, and how you can leverage the cloud for your data use.

Involve, solve evolved. Welcome to cloud crunch, the podcast for any large enterprise planning on moving to, or is in the midst of moving to, the cloud, hosted by the cloud computing experts from Second Watch, Ian will be chief architect cloud solutions and Skip Berry Executive Director of cloud enablement. And now here are your hosts of cloud crunch. Hello, everybuddy, welcome back to cloud crunch and welcome to skip Berry, my wonderful cohost. How are you today? The gradient thanks for having me again. Great, but today we have another one of our special guests, a colleague of ours, Data Engineering and analytics expert, Rob Wheeland. It's joining us today to dive into all things data, likes and data platforms. As many of you know, data is the key to unlocking the path to a better business decisions. What do you need for that? Well, we're going to look at the top five problems that customers have with her data, how cloud has helped them solve these challenges and how you can leverage the cloud for your data use. Welcome, Rob Hey, thanks for having me in and skip welcome. Yeah, Rob, your the practice manager for data engineering and analyst Second Watch. Can you give us a little bit of background on how you ended up in this wonderful role. Yeah, I've always been interested in data. Size come easily to me and as the cloud has become more friend center and are in our industry and a lot of different markets that I've worked in, in consulting and startups and and otherwise, it's become clear to me that, man, there's some there's some great opportunities to help people make much better decisions driven by data. Things are moving too fast to go by your gut anymore and that is where data comes in. So actually joined Second Watch because I wanted to see how data and machine learning could be used at the quote unquote, normal everyday companies. So Uber and Netflix and those guys have kind of have a corner on machine learning and AI and at least applying it to something real these days. So can we take any of that, those kind of best practices and bring them into normal, everyday companies that are just trying to make better decisions? That's great. So, Rob why do customers even need data? Well, customers need data for a few reasons and, whether or not they know it, they would like to use data for decision support. So, like I said, things are just moving too fast out there in any competitive industry for there to be, you know, just people going by their guts. So when sometimes people talk about being data driven, you get this impression that the data POPs up on the screen and it's a metric and the decision is made for you. But let's not really what I'm talking about by being data driven and supporting decisions. It's more about augmenting and supporting decisions. So a decision could be should we hire somebody? Should we try to get this project? How many resources should we put on this prospect? If you're in sales, how hard should I go after this? What's the probability of getting it? Well, all those things are normally informed by gut feel, which is very important, but you can make better decisions in a fast paced environment if you have data. So, for example, if you're looking at an opportunity that you're pursuing and sales, well, have you gotten any deals from this customer in the past? How much does the couplet customer typically spend? What is a sales cycle look like? Those are all pieces of data that can inform that decision. So decision support is really the number one reason people need...

...data. Another big reason is, as again, things this is the theme here, that things are just moving faster. In all industries. You might have have at the sea level someone like a chief digital officer or chief data officer or even the CEO who makes it a high priority for the company to become more data centric. So then you've got a strategic initiative coming from the top where everybody needs to be involved with making data more central. So you know you can you can see the effects of that if you talk to managers all of their goals and there you know in sentences are going to be aligned to whatever the sea levels saying. So and often the things that a sea level executive puts out will show up on the Internet. So you can be going to find out if they're being data driven from the top. So with that in mind, what are some of the if you had a top five of problems that now exist in this exponentially explosive growth of data? Right, I read somewhere a couple weeks ago was at one point four billion devices and those three hundred hours a new video every hour was from youtube coming in. So you just think about those little segments alone. I guess you know there's got to be some trending data. No Pun intended on where problems lie around this. So yeah, well, your point there skiper round, just volume of data. There's too much. So that's one of the top five that I see. Is the data that I have? Am I making any kind of value out of it? Is it relevant? So that's you know, the youtube examples is kind of a fun example. There's tons of data coming up. That just expresses that there's a ton of videos. Well, what is Youtube doing with that? So can they get any value out of it? What are the trends of what people are posting and why? So, you know, relevance. That's one of the top five. On the opposite spectrum, I would says, can I even see my data? You may not even know what you have, and in fact that is a more common thing that I've seen in the marketplace. Managers being frustrate. I can't see the data that I would like to have. So can I see it? Now, let's say you can see your data and you got the dashboard in front of you. Can I even trust it? So untrustworthy data is, I'd say, the third biggest problem. You've got the date in front of you, but for some reason you don't trust it and so you hesitate to make decisions based on it. Kind of a lesser I guess related to that is is this data even recent? Is it fresh data? If data is old, not, it doesn't it may not matter. In some situations the recency of data down to the second might really count. In other situations you don't care if it's, you know, just fresh from yesterday. So data recency, not knowing how recent or not getting it in time. And finally, you know, I hate to use a term, but is my data siload? So sidload data is a massive problem in and there's a reason people use that term often. I like to go on a rant that for every SASS application you sign up for, your creating a new silo. So if you sign up for sales FORCECOM, you sign up for mail chimp, you sign up for Trello at last see and so on, all great products, but your data is siload there. So what that does is just kind of...

...slow you down when it comes to analysis. Yeah, with some of the silos it you just mentioned, I've had the pleasure, of course, working with you in the past. You've identified some of the other areas that things get siload and how they get siload. It kind of follows some of it, maybe organizational structure of the company. Yeah, that's and I've enjoyed working with you. To you actually, I thank you that. I heard a great I stole this from Google, but someone in the data department at Jesus P said, look, if you want to know how a data siload, look at the or chart of a company. That's very true. So if you look at maybe the HR organization, they've got Work Day Silo Marketing, they've got maybe hub spot, another silo, sales, have sales for silo operations and so on. So that is kind of an easy way to navigate the problems or earlist know what the problems have people have right off of that, awesome, that's great. And moving on from that, those are great problems and having the experience of working with you, it's seems to be repetitive as well. What are some of the ways the cloud has solved these problems? Right, so, if you think about those five problems, is my data relevant? Can I see it? Trust it? Is it recent or is it siload? The cloud is great at making things faster and more flexible. So again, if you have a problem with data trustability, there are tools Galore for cleansing your data. If you have problems with even seeing your data, if it's across silos, again tools Galore for breaking down those, those barriers. But I'll say the cloud is great at that, with two caveats. One, it does take some work. So again there's all sorts of libraries out there to let you get data out of sales force, but that takes on elbow grease. You have to actually work at it. But before you even apply that elbow grease, you need to know why you're doing it, and that's that's the business objective that you're trying to get to. Why do we even care about this data? What do I need to get out of it? It's pretty common for data projects to fall into this, I guess, a trap where you just throw a bunch of resources out a problem. Hey, get me some of that data. And you've got the data, but now you're not sure what to do with it. So I really like to tell people, Hey, if you want to undertake some kind of data project, you really need to start with the business objective more than, I would think, any other initiative in the cloud for data you really need to start with a defined business objective. Let me add to that rant a little bit. So if you're migrating out of a data center, that business pitch is pretty simple. My data center is going out of business or my lease is coming up. I don't want to reinvest on a contract. I'd like to pay as you go. In the cloud, easy business pitch. But for data the business pitch is different for every customer. Are you trying to reduce operational costs? Are you trying to increase your speed to market? I mean it just like it kind of splays out to many different pathways. So I just think it's a little bit more sophisticate. And we're talking about data, you do have to have a bit more of a business mindset. As a cloud actually solved that in these problems in your experienced and exposure to some of these. The cloud has provide a ton of tools to do it. But there's still the problem, at the end of the day, of dirty data. Everybody's got dirty data and as long as you are maybe afraid to or unwilling to get in there and clean up that data, all...

...the tools in the world won't help you. So I again I think that there's lots of tools in the cloud, but they're really for developers. So if your developers and engineers don't have a business mindset or someone engage in sort of the business value of an effort, then they're just tools. So I don't think the cloud has solved the problem of just unearthing value out of data, although there are some man issues around that. It's called augmented analytics. We're probably a few years down the road for that sort of thing. You brought up two interesting points there. That I want to drill in a little bit more. One is how do you know your data is dirty or claim? Let's go start with that one. Yeah, very, very simple. Around dirty data is is kind of data validation. Are the A and usually there's problems around dates. Okay, so dates, time stamps are dates missing? And then also anything you're getting from a web form. So think of filling out a form on sales force. If you're typing in a lot of text fields, humans make mistakes, so there's errors and spelling, there's errors in, you know, referencing customers and that sort of thing. So to the extent that you're not validating the data, coming in like doing quality checks on it, then you're just going to you're going to have problems, it's going to interfere with analysis. The net effect is that it just takes longer to come to value at you. Then the second part is you kind of touched on this a little bit. Is that data and they processing behind it. May Not just be a data engineering role. Sounds like you alluded to the fact that business side really needs to be involved with that. And and how wouldn't you, you know, obviously every clients a little bit different, but how would you give some advice to somebody to the kind of structure that relationship? I think it goes back to the initial question. What a date customers need to data for? Well, we needed to support decisions. So if you're clear about the decisions that you need to make, that is what I call your guiding North Star for any data initiative and you kind of need to pound that over and over and over. So if you want us that up a team to explore something or just, you know, answer some question, you should have some engineers and at least someone from the business side. Maybe it's a business analyst, maybe it's a project manager, but someone who's in touch with the business objective. I think that's a good way to set up a team. So again, if you can go back to hey, what decisions do I need to support and inform? That tends to be the guiding or star. Is that kind of what you're going after? You? Yeah, I was going to say it sounds like it comes back again to making sure that you have your business objectives nailed down right. Yeah, back to your soapbox that you're saying. Yeah, now that you that a better. Yeah, you don't defy physics. Source speak at the end of the day, right. So, yeah, yeah, it's interesting. What's your advice for people exploring cloud now, just for the use the ease of data where it's ubiquitous within an organization, say even a financial institution? But they were just, you know, old back office spreadsheets and what have you, right, you know, if someone was to go and maybe this gets back to the the business objective again, but be interesting to hear your thoughts if they would just on a voyage today setting out and they have, you know, rolls and rows of data, what would be some advice that you would give them how to keep yeah, yeah, those early days intact to get to an end state. I've got some high level vice, I've got some tactical advice and I've got some maybe, maybe, words of wisdom and encouragement. So that the high level advice is go through those five categories of problems once you once you decide you want to get into data and maybe you've identified...

...the data that would help to form those decisions, go through those five problems. Can I can I see it? Can I trust it? Is it recent? All those problems drive the actual physical task that it takes to to get that data and get value out of it. So that's that's kind of my high level thing. Hey, what will probably trying to solve? On the tactical side, I like to tell people, you know, data is just like think of a bunch of a big pile of CSV files. So a CSV file is you open up and Excel spreadsheet application and it's a table. It's a table. That's what data is, no matter in what formats, and you can basically think of it as being in a table. So on the tactical side, can you get that CSV file into the cloud and into some sort of charts in the cloud and do that as fast as possible with as with as few barriers and decisions as possible. If you do that, then you're your you've leap frogged over all sorts of technical challenges. I guess my words are encouragement. Are a data project is not done until you can see it. So really get that data again, even if it's just a you know, a CSV file, a single CSD file. Put it in a chart somewhere on the cloud and in so doing you're going to familiarize yourself for a whole host of tools and I think the possibilities will really excite you. That's interesting. Yeah, it's really great. Now I would it maybe a little bit of a challenge. You date. It's been there for a long time. Customers have always had data, clients, customers, organizations. Again, obviously cloud is a motivating factor, but specifically with the cloud and in the rate of experimentation, you know, why do you feel now as kind of the nexus in order to kind of start unlocking the potential here? Yeah, okay, I've got two answers on that. One is related to the fact that you said, hey, dave has been around for a long time. Well, I mentioned at the beginning like we're not necessarily working with UBER, but we're working with media companies, manufacturing companies, and these companies have been around a long time, ten twenty years, so they actually do have a ton of data. That's kind of what's interesting about data. That's different from applications and migrations. You can really throw away applications and kind of restart, but data is always valuable, no matter how old it is, it's always valuable. So that's why I like working with these companies that have been around a long time. The already have a ton of data. And when it comes to your question about experimentation, this goes back to my idea that, man, everything is going so fast in almost every industry and that is driven by technology. It's been, it's being driven by consumer choice. Customers have choices beyond I mean, they're so demanding and that drives the entire process in so many industries because they have so much more selection and convenience. That is making things go really fast. Since things are going fast, the idea of a twelve or eighteen or twenty four month business plan is is almost laughable. I don't think anyone really does that unless you're on a very high level. For the most part, most of US need to operate in like two, three, four, maybe six months time horizons. That's a short time horizon. We don't know what's going to happen, which means we need to we need to experiment. So when I say experiment, I say I mean come up with with some sort of question you want to answer and try to answer in a in expensive way. Experimentation should be cheap. It...

...shouldn't cost you much in terms of time and dollars, and that is one of the things that cloud really has brought. It has lowered the cost of experimentation dramatically. The only thing in the way of that, of course, at this point is cultural acceptance and embracing of experimentation. The old fail fast, failed cheap model is still prevalent even in data right. It's still relevant. Yeah, yeah, that's interesting. What do you think, just as an aside discussion point, in the in the world that we live in today, what do we think the effects of Covid nineteen and would have you will push this area along? You know it's probably hard to throw a dart at it, but but just in your assessment, I know you and I've talked personally a lot about that, but just interesting to get your thoughts on where this will push the industry and effectively the will say, the evolvement in the space. HMM, it's been so interesting and thankfully I've been very lucky to just sort of sit here and relative safety and be an observer. But I think what covid nineteen has done is maybe three things. One is everybody's aware of data. Now you know the idea of flattening the curve. I mean I'm sure everybody is sort of contemplated what that actually means. So data is going to be forefront people's minds. People we've got an invisible enemy and so we need to sort of trust the data. Another thing that I've seen is I get a lot of newsletters for machine learning and a lot of people in the WHO are sort of practitioners of AI and, by the way, the reason ai and data sort of come together are usually lumped in the same categories. You need data to create artificial intelligence. A lot of people are getting really interested in in healthcare and the problem of Covid nineteen and maybe ways to develop a vaccine. But the irony of all this is that still the most effective way to sort of slow the spread is just to stay away from people like you don't need any data for that. I've found that to be sort of ironic. But the last thing that I think will be really, really great is, you know, healthcare is in industry that needs a refresh and healthcare, of all industries, has just amazing troves of data and we're just not using it. So my hope is that the healthcare industry can, you know, can get us feathers ruffled a little bit, whether from the inside or from, you know, just people demanding more, more from their healthcare, and I think data will be central to that, because I'm I've got a fitness track, or maybe you guys have one too, but that idea of knowing precisely you know about your body and your health, I just think the demand is going to ratch it up over time. People are going to need to know more and more. Yeah, it's interesting. What I'm probably looking out for the future is when does the data just become noise, like how do you filter the signal to noise ratio? And then you get a small sense of security with a data is not there right. So if we become so entrenched into making some we think we're being comforted by that data. What happens to human behavior when it's not readily available or doesn't tell us what we wanted to so yeah, that'll be yeah, and this is sorr. I think. You know, data is not a it's not a panacea, doesn't cure everything. It should augment and support our decisionmaking. The same with with Ai, you know, like self driving cars. Will always have this. You know, I'm not an expert in that field, but I'm sure there's always going to be some sort of like hey, I'm going to take it over now. But you know, like we don't want to put all our faith in these algorithms and data. We want the data support us. You know, we're creative beings. Data is not creative. It should tell...

...us what is going on in the world and make recommendations, but we should still be making the decisions. That's great. Love it bonus for our audience if you can offer some tactical advice for those who are interested in getting into this field. What would be some good areas for them to learn about, study some skills, programming, languages and what have you. Great. So, if you're interested in getting the field, I would learn python, python is the programming language that you know. You can use it for data engineering, machine learning algorithm develop you can still build apps with python. Just really versatile and very forgiving. Even I can learn it. So learn python. Then I would try the machine learning course on course Sarah by Andrewing, and this is this is sort of the classic. Everyone's got to go through it because you're going to be exposed to so many different algorithms in a practical way. And then I would say just don't don't spend a lot of time on the theory. There's plenty of people working on the theory and they are really pushing the industry forward. But and if you want to do that, then go do it. Go get a PhD in math. It's going to take some work, but if you'd like to do something practical, a little bit closer to to now, then go build some stuff on any of the major cloud platforms. They all are supporting machine learning and trying to make it easy and I know for sure it is valuable to any company, any if you're someone who understands data and can actually build something to present data to an executive. So Python machine learning course from andering and build something. I definitely second that class. It is takes. There's a lot of theory in there and I think it'll make you really appreciate what is available in the cloud so you don't have to re engineer that. It's right. Yeah, Rob Thank you so much for your time today. It's been a pleasure to not only interview you but continue to work with you. I likewise skip. Yeah, likewise likewise yeah. Great. Thanks in that rock. Thank you also from the you know, great to work with you and great to heavy on the podcast. Oh Man, this is this is a blast. Yeah, really appreciate you guys time and doing this and being on the team. Audience. We want to hear from you. Email is at cloud crunch at second watchcom with comments, questions and ideas. Until next time we'll talk to you then. You've been listening to cloud crunch with Ian Willoughby and skip Berry. For more information, check out the block second watchcom company block or reach out to second watch on twitter.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (33)