There is a lot of hype in tech world about data science. There are many startups emerging as analytics solution providers to businesses. Many IT professionals shifting their careers towards data science. So, what exactly is data science. What kind of work a data scientist does? This short guide is meant to answer these questions.
Data science is a multi-disciplinary field in which engineers, software developers, statisticians make use of data to draw useful business insights. These insights can come in form of visualizing patterns in data, hidden patterns in the data, or future value predictions.
Below are the typical steps involved in a data science problem
Data collection — Data is the main ingredient of data science. Without data it is impossible to do data science. Data can be collected from various sources. It may be readily available to download, it can extracted from a database, sometimes data is not available readily, in that case a data scientist need to scrape data from web.
Data cleaning — As data comes from various sources, it may not be used directly for analysis purpose.Often public data need cleaning, missing value treatment, anomaly handling, validation, and transformations. Some of these steps can be done with the help of SQL or Excel. But for more complex operations programming knowledge is required.
Exploratory data analysis — This step involves data visualization, creating summaries, segmentation, and find answer to other business questions. Tools that can create summaries, combine variable to form composite variables, plotting utilities etc are required here. Excel, Matlab, R, Python or any other tool with these functionalities is required.
Predictive analysis — Many business problems (not all) demands prediction of future values. It can be sales, churners, or any other variable. This step involves feature engineering, feature selection, model selection etc. For this knowledge of machine learning algorithms is required. Python, R etc. provides efficient libraries for machine learning.
Communication — Once you are done with data exploration and predictions, final step is to communicate the findings. A data scientist creates summaries, plots, graphs to easily tell stories to stakeholders. Help them understand the causal variables and how they can improve their business. Data scientist will tell the business about key performance indicators, and predictions.
Tools used by a data scientist
- Excel, SQL, SAS etc for data exploration.
- Python, Java, C++ etc. for data collection, data scraping.
- Matplotlib, R, Matlab, Tableau, D3, etc. for data visualization.
- scikit-learn, R, tensorflow, torch etc, for machine learning.
- Hive, Spark, Hadoop etc. for big data processing.
Data scientists use many different tools for their work. As you can see from above points that what kind of tools a data scientist use is not that important. Any tool that help handing and processing data will work. Important thing is that a data scientist needs strong analytical skills to be good at data science.
About the Creator
Romee
Engineer | Blogger | Musician
Comments
There are no comments for this story
Be the first to respond and start the conversation.