What is Data Science? (introduction for beginners)

In this article Ill answer a very simple question: What is data science?

Well, the question is simple, sure But the answer is rather complex.

The problem is that there are no generally accepted definitions. But in this article, Ill show you many important aspects of data science. And by the end, youll have a pretty clear mental picture about what it is exactly!

Do you prefer watching this in video format? Here you go:

What is data science? A broad definition would be something like this:

You have a large amount of data and youre trying to extract something smart and useful from it.

Thats abstract, I know Maybe a bit of an oversimplification, too.

So heres an everyday example to help you understand it.

Note: Itll be an everyday example intentionally, but read it carefully and youll see the business parallels, too!

Okay, lets see it!

Im sure you have seen smart watches or maybe you use one, too. These smart gadgets can measure your sleep quality, how much you walk, your heart rate, etc.

Lets take sleep quality, for instance!

If you check every single day, how did you sleep the night before, thats 1 data point for every day. Lets say that you enjoyed excellent sleep last night: you slept 8 hours, you didnt move too much, you didnt have short awakenings, etc. Thats a data point. The day after, you slept slightly worse: only 7 hours. Thats another data point.

By collecting these data points for a whole month, you can start to draw trends from them. Maybe, on the weekends, you sleep better and longer. Maybe if you go to bed earlier, your sleep quality is better. Or you recognize that you have short awakenings around 2 am every night

By collecting the data for a year, you can create more complex analyses. You can learn whats the best time for you to go to bed and wake up. You can identify the more stressful parts of the year (when you worked too much and slept too little). Even more, you might be able to predict these stressful parts of the year and you can prepare yourself!

We are getting closer and closer to data science Lets go even deeper!

If you have enough data, you can discover not only trends but correlations, too!

You can check out, for instance, how your sleep quality is affected by how much exercise you got in the given week. (Walking, running, biking, swimming, etc. These can also be measured by smart watches.) A simple correlation would be to see this: on the days you took more than 5,000 steps, your sleep quality was excellent. This is more than an analysis This can be the basis of an action plan: lets walk at least 5,000 steps every day!

Note: Although, I have to mention that in real life, a data scientist does much more research to get to a conclusion like this one.

And there are even more levels.

Just imagine the data that the producer of this smart watch can collect. In theory (lets not consider legal aspects for now) they could see all the data of all their customers. And with that, they can produce analyses of their data that you a single customer could never even imagine.

Can the symptoms of depression be mitigated by walking 3,000 steps a day for most people? Are people really healthier in some countries than in others? Can weather conditions strongly impact certain social movements? And there are many, many more interesting questions. And these companies might already have enough data points to research these.

Note 1: Most of them are probably doing research around these questions already.

Note 2: Lets not talk about the legal and ethical aspects of these things here. While these are incredibly interesting and important questions, theyre a whole article by themselves.

So as you see: the more data (and the more detailed data) you have in a data science project, the more complex, exciting and useful analyses and predictions you can create.

In essence, this is what data science is about.

Except that all these cant only be done with smart watches and by individuals.

But with many other tools that produce and collect data in many other fields of life

Data science, of course, conquered the world of online businesses first.

Why online businesses? Because thats the #1 place where you can collect data about every single movement of a user. (Some companies, of course, abused this opportunity. But again: we wont dive into the legal and ethical aspects in this article.)

Also, parts of data science have been present in different social sciences for decades!

And in the last few years, its started to gain a foothold in fields like:

Okay, so far Ive written about how data science can be useful.

Lets talk about what skills and tools you need to do data science!

If you have ever read my blog, Im sure youve seen this Venn diagram:

I show it quite often and its really important.

It says that if you want to be a data scientist, you have to be good at three things:

Why are they so important?

Coding is inevitable, because thats the tool you need to work with your data. Its like the piano for the pianist, the brush for the painter, or the pen for the poet. If you want to make your ideas come true, you have to know and use your tools as a professional. (The most popular data science languages are: SQL, Python and bash. I write about all of them on my blog. You can also get access to free cheat sheets and video courses by joining the Data36 Inner Circle.)

Statistics is the actual science of your data science projects. After all, data is about numbers. And when you work with numbers, you should be confident with mathematical and statistical concepts, right?

I know that many people are afraid of (or even more: they hate) statistics. But statistics is not boring nor extremely difficult. Its only that it has bad marketing. To become a data scientist, you have to be familiar with statistical concepts like: statistical averages, statistical biases, correlation analysis, probability theory, functions machine learning algorithms, of course and so on

The third topic is business knowledge. This is a soft factor. For example, lets say that you are working for a bank as a Data Analyst. You can be the best coder and the best statistician, but if you dont understand the business concept behind interest rates or how mortgages work, you will never be able to deliver a meaningful data analysis. I wrote more about the business aspect of data science in this article: Data Science for Business.

So data science is an intersection of three things: statistics, coding and business.

Note 1: Of course, to be successful in the long-term in data science, you have to build other soft skills like: presentation skills, project management skills or people skills.

You can learn more about how to become a data scientist by taking my free course. You also can download all my Python, SQL and bash cheat sheets if you join the Data36 Inner Circle.

I wish I had a dollar for every time mainstream media (e.g. news portals, magazines, even conferences) misinterpret the different data-science-related terms.

Well, in everyday use they are buzzwords.

But they have real meanings and a certain place within the field of data science, too. So its time to clarify what means what.

Usually, you will use your data for 3 major things in your data science projects:

The word data analysis refers to the most conventional way of using your data. You run analyses to understand what happened in the past and where are you now. Lets say you have this chart outlining the first 16 months of your product sales:

Now, predictive analytics refers to projects where you use the same historical data that you see above but this time you try to predict the future. So youll answer the what will happen question. Lets use the same dataset (blue line) to estimate how your product sales will do through the 20th month (red line):

Thats a prediction.

However, its not really accurate, is it?

Is this model better:

Or is there an even better one? Any of these maybe?

When you ask the what is data science question, most data scientists would say that at least this is where the science part of it starts.

But this is really just the tip of the iceberg

When a computer fits the lines on the above examples, it tries to find a mathematical formula (red line) that describes well enough the relationship between the real-life data points (blue line) that have a natural variance anyway.

Now you might ask: how the heck can computer find that mathematical formula?

By using Machine Learning.

Machine Learning is the general name for all the methods by which your computer fine-tunes a statistical model and finds the best fit for your dataset. And the blue-line-red-line example is only one of many. There are tons of machine learning methods for all the different typical data science problems. This model fitting machine learning method is called regression or more precisely: linear and polynomial regression. But there are classification problems (popular machine learning algorithms: decision tree, random forest, logistic regression, etc.), clustering tasks (popular machine learning algorithms: K-Means Clustering, DBSCAN, etc.) and many more.

I wont go into detail here but I will write more about these on Data36. So stay tuned!

Actually, Id like to talk about one particular machine learning method.

Its called deep learning and its gotten very popular in the last few years but many still dont know what it is and what it is good for.

Deep learning is nothing but one specific machine learning method. As I mentioned, there are a lot of machine learning models and all of them are good for solving different data science problems. Deep Learning is only one of them thats recently been widely used for image and voice recognition projects. The way it works is quite interesting, by the way. It gets input values and it turns them into output values after filtering through many layers by creating automatic correlations. It works very similarly to how the human brain works. (More about deep learning in another article.)

Note: The best explanation of deep learning that Ive heard so far was by Andrej Karpathy, Director of AI at Tesla. In his presentation, he introduced how Tesla cars learn to drive. He also explained the general concept of deep learning and he showed how they are using it. Its part of a bigger presentation, and you can find the full video here Andrejs talk starts at 1:52:05 and ends at 2:24:55.

Oh, boy.

Well, I wrote a long paragraph about incompetent wannabe data professionals, clickbait journalists and ignorant managers (who read articles from those clickbait journalists) and of course, companies who try to market their simple data-based products with the AI tag (that recently sells everything) But I just deleted it because I dont want to offend anyone.

Its enough if you know that AI doesnt exist yet. And if humanity does ever create one, it wont happen in the next few years. Right now, there is no computer that would be capable even of imitating creativity, intuitions, ambitions, inspiration or anything else that makes us human.

Sure, there are very advanced bots like the one that Google presented in mid-2018. (Check out the video here.) But if you think about it its nothing but a combination of an advanced chatbot, an advanced voice recognition software (like the one that you have in your smartphone) and an advanced speaking engine.

Note: Plus, you have to know that most of these bots work only in very narrow situations. As soon as they fall off their script they are useless. Also, show me a bot that has its own ambition to learn Chinese or Spanish because it feels that it will be important for its career Right? Todays AI is not even close to real human intelligence.

Okay, you get the theory. But how will all these get useful and profitable for your business? There are so many ways that it could barely fit this article.

But Ill leave here five examples to give you a few ideas, at least. (As this is an introductory article, Ill start with these simpler ones. But Ill add more, later. If you want to get notified, subscribe to my newsletter!)

The first example is a classic data project in a classic online business.

Lets say that you have an e-commerce business and you want to create reports for internal usage. (Many companies are doing this, by the way although whether they are doing it right is a whole nother article.)

In a project like this, the goal is always to help the decision makers and managers see more clearly before they make an actual decision. The job of the data scientists and analysts is to provide analyses, reports and charts supporting these folks.

The data scientist goes and checks out what happened in the last weeks, months or years. What are the trends? What changed? What is the typical customer journey? What can we expect in the future based on the data from the past? And the management makes decisions based on these.

Lets say that we see that people are buying more and more red socks and fewer and fewer yellow T-shirts. Obviously, youll try to align your offers with these trends.

This is a dead simple example of using data in a business yet, done right, it can provide a lot of value.

Note: I mean, its simple to talk about. Putting it into practice is always harder. The devil is in the details, you know

The second example is a slightly more advanced and complex data science project.

Were looking at the very same e-commerce company as before. But now, lets focus on the advertising costs. For the sake of simplicity, lets talk about Google Ads only.

Lets say, we get a question from management: what should our budget be in the next quarter for Google Ads? Its not so simple to set the right value!

If the budget is too high, thats not good because we will overspend and the profit will start to go down. If the budget is too low, thats not good either because then we dont spend enough money on advertising, sales go down, so does income, so does profit.

Got the problem?

Heres where data science comes into play!

The data scientist will know how to estimate the optimal spending limit that results in the most profit.

I mean, at many companies, a senior marketing manager can do this, as well based on best practices and industry benchmarks even in spreadsheets! And sometimes with pretty great results! But data science offers an even better and more precise solution: by using machine learning and predictive analytics!

Lets see the case:

The data scientist of this company will work with data from the last few years. This can include cost, income, website traffic, sales, and many other input variables. Using these data points from the past, the data scientist tries to fit a machine learning model to the dataset. She will get a mathematical model that she will be able to use creating a super accurate prediction and eventually an optimal budget for the next quarter.

This is great because its more accurate compared to the manual, human-created predictions. But its also more scalable! You can always make these things more complex by including 6-8-10 more marketing channels where you can spend money on advertising. A human would struggle with maintaining an overarching view of all these. For a computer and a machine learning algorithm, thats just another few variables in the formula.

Okay, Im not saying that every company should build their advertising budgets on data science and machine learning But theres a certain size and complexity above which it is extremely profitable.

My third example is a grocery stores data project.

Food is usually not a long lasting product, right?

Ive heard this from a friend who works for a well-known grocery giant as a data scientist: For big grocery stores and food companies, predicting how much of different products they should order and stock is a huge challenge.

Its the same dilemma as in the previous example: if they order too much food, it goes bad on the shelves. If too little, they wont have supply and their shoppers will be dissatisfied or even worse: go to a competitor.

Either way: they realize a loss. The only way to win is to find the perfect balance!

The solution for this problem can again be a fine-tuned predictive analytics model that makes predictions based on past data. This can be done based on multiple input variables by using mathematical models. Once you can predict the demand, it becomes much easier to align the supply.

By the way, data science is quite popular in this segment. For instance, there are numerous cool stories about how Walmart predicts trends and finds correlations between different variables in their business. Unfortunately, I cant find trustworthy resources on this topic, so I dont know whether Walmart has really done these or these are just urban legends But one things for sure: with the technology we have in 2020, these data science projects could be done easily.

Originally posted here:

What is Data Science? (introduction for beginners)

Related Posts

Comments are closed.