Data Science: So you want to be a Data Scientist?

Spread the love
So you want to join a Data Science team?
So you want to join a Data Science team?

The Basic

Data Science is a trending topic among recent years.  And a Data Scientist is the #1 Best Job in America. But before going further on this topic, let’s go back to a basic question: What is Data Science?

Data Science in Google Trends
Data Science in Google Trends

Data becomes air in modern day, it is everywhere. When you take a picture, your picture contains data of your camera model, location, date and color information. When you go online shopping, your preference and buying behavior would be served as data for the business owner. And of course, while you are reading this post, Google Analytics is recording your data as well.

Data grows rapidly day after day, it provides valuable information for making business decisions. And at the same time, it just turns out being too big to handle. Likes diving into the big deep Data Ocean to find our answers. People then deserve a more effective and scientific way to handle data, thus we have Data Science.

Key components of Data Science

Problem in a specified domain

(oh wait, are you expecting me saying “Data” is the first component of Data Science? :]] )
We use Data Science to look for a solution. But there is always no solution without a problem. That is why we need a problem to solve before we start our “science”.  It could be “How to boost the sale on certain items”, “How to increase the success rate for certain rescue operations” or others.

Data set

Back to our key word in Data Science, data. A data set is a collection of data entities with attributes and behaviors in certain events. For example, we would like to boost our sale volume, so we look for our data set: the sale transaction record. The data set contains:

  • entities: customers, products
  • attributes: customers’ age, gender, location; product categories, product price
  • behaviors: customer paid by cash, customer paid by credit card, customer asked for refund; product was damaged
  • events: regular sale, monthly mega sale; delivery delay

Data collection

It would be easy if all the data sets are stored in a digital and structured storage, likes tables in database. But sometimes, things just do not happen well. You may encounter records stored in eMails, text files or even hand writing notes. Then data filtering and gathering become challenging tasks in data collection process.


We use data to make certain decisions because we know the real world relationship among the data we collected. When we design a health insurance product, we know age and weight are factors to particular diseases. Then we adjust the product price according to customer’s age/weight attributes. We know the data-real world relationships by observation and experience, but it would be costly if we do all the analysis by our own. In Data Science, we let computer learn the data-real world relationships, i.e. the situation on when age goes up, the chance of disease A goes up as well. Therefore computer can operate certain decision making processes, according to the real world relationships it has learnt. It also leads to another trending topic for us — Machine Learning. (we would discuss on this one in other post)

Model building

In order to find out an answer for a problem, we need to build a model to represent our real world scenario. There is the laboratory for a computer doing the “Data Science”. The computer has learnt data-real world relationships from human, then it turns those relationships  into mathematical equations like “chance of having disease A = age * 0.123 + weight * 0.456“. Thus it can provide rational results according to data we have collected.

Outcome prediction on the problem

Problem: check, Data sets collection: check, Model: check. We can then retrieve outcome predicted by the model. That is the end? No. Like other science subjects, we have to measure our result and see rather it can fully solve our defined problem. We may need to modify the way we collect data, teach computer on new data relationships and build a new model. We may also need to repeat above steps several times in order to get more accurate outcome.

Data Science Life Cycle

  1. Define a problem
  2. Collect data
  3. Learn data relationships
  4. Build a model
  5. Review outcome
  6. Repeat step 2 to 5 until getting the satisfied outcome

This is the basic concept of Data Science. Unlike previous post with hands-on explained line by line, this post does need time to read and digest. Please feel free to ask or leave any comment if you find any question.