Data – building blocks for information.
- take data and transform it into meaningful information that can help us make decisions.
- interdisciplinary and combines other well-known fields such as probability, statistics, analytics, and computer science.
Statistics is the practice of applying mathematical calculations to sets of data to derive meaning. Statistics can give us a quick summary of a dataset, such as the average amount or how consistent a dataset is.
- Descriptive Statistics – describe a dataset using mathematically calculated values, such as the mean and standard deviation.
- Inferential Statistics – statistical calculations that enable us to draw conclusions about the larger population.
Probability is the mathematical study of what could potentially happen.
A set of instructions is called a program or an algorithm, and is written in a computer programming language, which we usually refer to as code for short.
Domain Expertise refers to the particular set of knowledge that someone cultivates in order to understand their data.
- Data Science—the field of taking data and transforming it into meaningful information that can help us make decisions
- Descriptive Statistics—statistics that describe the data in objective terms
- Inferential Statistics—inferences for the overall population based on data
- Probability—the likelihood that an event will happen
- Programming—the act of giving the computer instructions to perform a task
- Domain Expertise—the particular set of knowledge that someone cultivates and brings with them in order to understand their data
The Data Science Process
The process generally goes as follows:
- Ask a question
- Determine the necessary data
- Get the data
- Clean and organize the data
- Explore the data
- Model the data
- Communicate your findings
1. Formulate a Question
- VARIABLE RELATIONSHIPS – finding the effect that different things have on each other. how is x related to y? Is eating dinner late related to your ability to fall asleep early?
- SCOPE – A question should be specific enough that we know it is answerable, but it shouldn’t be too specific to the point where no relevant data exist, and we are unable to draw any real conclusions.
- CONTEXT – Part of doing data science is having professional expertise – or a significant amount of background knowledge in the area that you want to explore. Gaining context requires doing research, such as looking at any relevant data that you already have.
2. Determine the Necessary Data
Make an educated guess about what you think the answer might be or hypothesis.
In science, it’s actually impossible to prove that something is true. Rather, we try and show that we’re really, really confident that it’s not false. That’s because the only way we can say we’re 100% positive our hypothesis is correct is to collect all the data for an entire population – and that’s pretty much impossible!
- Determine what data could disprove our hypothesis.
- Figure out how much data to collect. Collect a sample set of data, a smaller amount of data that are representative of the entire population. Figure out the necessary number of samples that have similar descriptive statistics to the entire population.
The larger the sample size and the more diverse your dataset is, the more confident you’ll be in your results.
Sample Size Calculator
determines sample size based on the following information:
- Margin of error—The amount that the results of our survey will differ from the real population value. The larger the error, the less confidence we should have in the results.
- Confidence level—The probability that if we were to run another survey with the same metrics that it would return the same results. We want a high confidence level (like 95%) that our results are repeatable with another group.
- Population size—Size of the population we’re collecting data on. A common number used in sample size calculations is 100,000.
- Likely sample proportion—The percentage of people surveyed whose results we anticipate matching the expected outcome. If we do not have historical data, we normally use 50%.
3. Get the Data
- Active data collection—you’re setting up specific conditions in which to get data. You’re on the hunt. Examples include running experiments and surveys.
- Passive data collection—you’re looking for data that already exists. You’re foraging for data. Examples include locating datasets and web scraping.
- Size of Dataset – we usually can’t get data from an entire population, so we need to have an appropriate sample that is representative of the larger population. If you’re ever unsure about the size of your dataset, use a sample size calculator.
- Data Collection Error
4. Clean the Data
An important part of the data science process is to clean and organize our datasets, sometimes referred to as data wrangling. Processing a dataset could mean a few different things. For example, it may mean getting rid of invalid data or correctly labeling columns.
The Python library Pandas is a great tool for importing and organizing datasets. You can use Pandas to convert a spreadsheet document, like a CSV, into easily readable tables and charts known as DataFrames. We can also use libraries like Pandas to transform our datasets by adding columns and rows to an existing table, or by merging multiple tables together.
new_df = pd.merge(user_data, pop_data)
5. Explore the Data
There are two strategies for exploring our data:
- Statistical calculations
- use descriptive statistics to get a sense of what it contains.
- Descriptive statistics summarize a given dataset using statistical calculations, such as the average (also known as mean), median, and standard deviation.
- We can immediately learn what are common values in our dataset and how spread out the dataset is (are most of the values the same, or are they wildly different?).
- We can use a Python module known as NumPy to calculate descriptive statistics values. NumPy (short for Numerical Python) supplies short commands to easily perform statistical calculations, like
np.mean(), which calculates the mean of a dataset.
- Data visualizations
- enables us to see patterns, relationships, and outliers, and how they relate to the entire dataset.
- particularly useful when working with large amounts of data.
age = new_df["age"] sns.displot(age) plt.show()
location_mean_age = new_df.groupby("location").age.mean() print(location_mean_age)
plt.close() sns.barplot( data=new_df, x= "location", y= "age" ) plt.show()
plt.close() sns.violinplot(x="location", y="age", data=new_df) plt.show()
6. Modeling and Analysis
To analyze our data, we’ll want to create a model. Models are abstractions of reality, informed by real data, that allow us to understand situations and make guesses about how things might change given different variables.
- A model gives us the relationship between two or more variables.
- allow us to analyze our data because once we begin to understand the relationships between different variables, we can make inferences about certain circumstances.
- useful for informing decisions, since they can be used to predict unknowns.
Models can be expressed as mathematical equations, such as the equation for a line. You can use data visualization libraries like Matplotlib and Seaborn to visualize relationships. If you pursue machine learning, you can use the Python package scikit-learn to build predictive models, such as linear regressions.
x = new_df["population_proper"] y = new_df["age"] plt.scatter(x, y, alpha=0.5)
7. Communicating Findings
Two important parts of communicating data are visualizing and storytelling.
import codecademylib3_seaborn import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Import CSVs: user_data = pd.read_csv("user_data.csv") pop_data = pd.read_csv("pop_data.csv") # Merged tables with location data: new_df = pd.merge(user_data, pop_data) new_df.loc[new_df.population_proper < 100000, "location"] = "rural" new_df.loc[new_df.population_proper >= 100000, "location"] = "urban" # Plot linear regression: sns.regplot(x="population_proper", y="age", data=new_df) plt.show() # Paste code to change the figure style and palette: plt.close() sns.set_style("darkgrid") sns.set_palette("bright") sns.despine() sns.regplot(x="population_proper", y="age", data=new_df) plt.show() # Paste code to change the axes: ax = plt.subplot(1, 1, 1) ax.set_xticks([100000, 1000000, 2000000, 4000000, 8000000]) ax.set_xticklabels(["100k", "1m", "2m","4m", "8m"]) plt.show() # Paste code to title the axes and the plot: ax.set_xlabel("City Population") ax.set_ylabel("User Age") plt.title("Age vs Population") plt.show()
Reproducibility and Automation
It’s important that your work can be reproduced or reproducibility.
Reproducibility – enables you to reuse and modify experiments, but it is also how the scientific community can confirm your analysis is valid.
Big data refers to massive datasets that can be used to reveal patterns and trends. Any large quantity of information, from sports scores to social media posts, can be considered big data. And since big data is so, well, big, we need optimized algorithms and high-powered computers to sift through it.
Netflix uses machine learning, a subset of artificial intelligence, to help their algorithms “learn” without human assistance. Machine learning gives the platform the ability to automate millions of decisions based off of user activities.
The need for recommendation engines and personalization is a result of a phenomenon known as the “era of abundance”.
A basic implementation of a recommendation engine would be the editorial method. In the editorial method, the platform would make recommendations based on a relatively small amount of individuals. Another easy one is the aptly named simple collection method where the platform makes suggestions based on the top products across the platform.
Netflix uses the personalized method where movies are suggested to the users who are most likely to enjoy them based on a metric like major actors or genre.