Probability Distribution Function PMF, PDF&CDF

Data science

By ajay mehtaPublished about a year ago • 9 min read

What are Algebraic Variables?

Ans. In Algebra a variable, like x, is an unknown value •

x+5=10 here x is a algebraic variable

What are Random Variables in Stats and Probability?

A Random Variable is a set of possible values from a random experiment.

for example X={1,2,3,4,5,6} #set of all possible outcomes of throwing a dice

• Types of Random Variables?

Discreate random variable

A discrete random variable is a variable that can take on a finite or countably infinite number of values, with gaps in between. For example, the number of heads in 10 coin tosses is a discrete random variable because it can only take on the values 0, 1, 2, …, 10. and rolling a dice

Continuous random variable

A continuous random variable is a variable that can take on any value within a specified range or interval, with no gaps in between. For example, the height of a randomly selected person is a continuous random variable because it can take on any value within a certain range, such as between 4 feet and 7 feet.

What are Probability Distributions?

A probability distribution is a list of all of the possible outcomes of a random variable along with their corresponding probability values.

Example:

Problem with Distribution?

In many scenarios, the number of outcomes can be much larger and hence a table would be tedious to write down.

Example - Height of people, Rolling 10 dice together.

Solution -To establish a mathematical Function?

What if we use a mathematical function to model the relationship between outcome and probability

A probability distribution function (PDF) is a mathematical function that describes the probability of obtaining different values of a random variable in a particular probability distribution

y=f(x) #where y represents the probability

Hence, it's called probability distribution function.

Why are Probability Distributions important? -

Gives an idea about the shape/distribution of the data. And if our data follows a famous distribution then we automatically know a lot about the data.

- A note on Parameters Parameters in probability distributions are numerical values that determine the shape, location, and scale of the distribution.

Different probability distributions have different sets of parameters that determine their shape and characteristics, and understanding these parameters is essential in statistical analysis and inference

There are 2 Type of probability distribution function

probability mass function
Probability Density function

Probability Mass Function

A probability mass function (PMF) is a function that gives the probability of each possible outcome in a discrete random variable. In simpler terms, it tells us the probability of getting a certain value when we roll a die or flip a coin.

For example, let's say we have a fair six-sided die. The PMF for this die would give us the probability of rolling each number from 1 to 6. Since the die is fair, each outcome has an equal probability of 1/6. So the PMF would be:

P(X=1) = 1/6

P(X=2) = 1/6

P(X=3) = 1/6

P(X=4) = 1/6

P(X=5) = 1/6

P(X=6) = 1/6

This tells us that the probability of rolling a 1 is 1/6, the probability of rolling a 2 is 1/6, and so on. The sum of all the probabilities in the PMF should always equal 1,and probability should be greater than or equal to 0

There are different types of probability mass functions (PMFs) based on the characteristics of the random variable they represent. Here are some common types:

Uniform PMF: This PMF represents a discrete random variable where all possible outcomes have an equal probability. For example, a fair coin flip or a fair six-sided die.

Bernoulli PMF: This PMF represents a binary random variable that takes on one of two possible outcomes with a certain probability, such as a coin flip that lands heads with probability p and tails with probability 1-p.

Binomial PMF: This PMF represents the number of successes in a fixed number of independent Bernoulli trials. For example, the number of heads in 10 coin flips or the number of defective items in a sample of 20.

Poisson PMF: This PMF represents the number of events that occur in a fixed interval of time or space, where the events occur independently at a constant rate. For example, the number of calls received by a customer service center in an hour or the number of cars that pass through an intersection in a minute.

Geometric PMF: This PMF represents the number of Bernoulli trials needed to get the first success, where the trials are independent and have the same probability of success. For example, the number of times a person needs to roll a die until they get a 6 or the number of times a salesperson needs to make calls until they make a sale.

These are just a few examples of PMFs, and there are many other types that represent different types of random variables and distributions.

Cumulative Distribution Function(CDF) of PMF

The cumulative distribution function (CDF) F(x) describes the probability that a random variable X with a given probability distribution will be found at a value less than or equal to F(x) = P(X≤=x)

Probability Density Function (PDF)

PDF stands for Probability Density Function. It is a mathematical function that describes the probability distribution of a continuous random variable.

Why Probability Density and why not Probability and What does the area of this graph represents?

Probability Density Functions (PDFs) are used for continuous random variables because unlike discrete random variables, they have an infinite number of possible outcomes within a given range. The probability of a continuous random variable taking on a specific value is zero since there are infinitely many possible values it could take. Instead, we use the concept of probability density, which is a measure of how likely it is for the variable to fall within a certain range of values.

To find the probability that a continuous random variable takes on a value within a certain range, we integrate the PDF over that range. The area under the PDF curve between two points on the x-axis gives the probability that the variable takes on a value within that range. So, the probability of a continuous random variable taking on a specific value is essentially zero, but the probability that it falls within a certain range can be calculated using the PDF.

In contrast, for discrete random variables, we use Probability Mass Functions (PMFs) to assign probabilities to each possible outcome since there are only a finite number of possible outcomes. We can find the probability of a discrete random variable taking on a specific value by looking up the value in the PMF.

In summary, we use PDFs for continuous random variables because they allow us to calculate the probability of the variable falling within a certain range of values, whereas PMFs are used for discrete random variables to assign probabilities to each possible outcome.

example :

Here's an example that illustrates the difference between a PDF for a continuous random variable and a PMF for a discrete random variable.

Suppose we have a continuous random variable X that represents the heights of adult males in a certain population. The PDF of X might look like a normal distribution, which has a bell-shaped curve with a peak at the mean height and standard deviation that measures the spread of the heights.

On the other hand, suppose we have a discrete random variable Y that represents the number of cars a person owns. The PMF of Y would be a function that assigns probabilities to each possible value of Y. For example, the PMF of Y might look like this:

The PMF tells us the probability that a randomly chosen person in the population owns a certain number of cars. For example, the probability that a person owns 1 car is 0.3.

In summary, a PDF is used for continuous random variables like heights, weights, and temperatures, while a PMF is used for discrete random variables like the number of cars a person owns, the number of children in a family, or the number of heads obtained when flipping a coin.

Famous Probability density function

Normal distribution - Wikipedia

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a…en.wikipedia.org

Log-normal distribution - Wikipedia

In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random…en.wikipedia.org

Poisson distribution - Wikipedia

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses…en.wikipedia.org

How is graph calculated?

1. Density Estimation

Density estimation is a statistical technique used to estimate the Probability Density Function (PDF) of a continuous random variable based on a set of data points. It is often used in data analysis and machine learning to model the underlying distribution of the data.

The goal of density estimation is to estimate the PDF of a continuous random variable that generated the observed data points. This can be useful for understanding the properties of the data, such as its central tendency, spread, and skewness, and for making predictions about future observations.

There are several methods for estimating the PDF of a continuous random variable from data, including kernel density estimation, parametric density estimation, and nonparametric density estimation.

1 .Kernel density estimation involves estimating the PDF by placing a kernel function at each data point and summing the kernel functions to create a smooth estimate of the PDF.

2 .Parametric density estimation

3. Nonparametric density estimation

The choice of method depends on the properties of the data and the goals of the analysis. Density estimation is a powerful technique that can provide insights into the properties of a dataset and improve the accuracy of statistical modeling and prediction.

Parametric Density Estimation

Parametric density estimation is a method of estimating the probability density function (PDF) of a random variable by assuming that the underlying distribution belongs to a specific parametric family of probability distributions, such as the normal, exponential, or Poisson distributions.

For example, let's say we have a sample of titanic data that represents the age of Passengers in a certain population. We can assume that the distribution of age follows a normal (Gaussian) distribution with unknown mean and variance. We can estimate these parameters using the sample mean and sample variance, respectively, and then construct a normal distribution with those parameters as the estimated PDF.

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

from scipy.stats import norm

# Load Titanic dataset and extract Age column

data = pd.read_csv('titanic.csv')

age = data['Age'].dropna()

# Estimate mean and variance of normal distribution from data

mu = np.mean(age)

sigma2 = np.var(age)

# Calculate density of normal distribution at a range of values

x = np.linspace(0, 100, 1000)

pdf = norm.pdf(x, loc=mu, scale=np.sqrt(sigma2))

# Plot histogram of Age column and estimated density function

plt.hist(age, density=True, bins=30)

plt.plot(x, pdf, 'r-', lw=2)

plt.xlabel('Age')

plt.ylabel('Density')

plt.show()

---

Non-Parametric Density Estimation (KDE)

But sometimes the distribution is not clear or it's not one of the famous distributions. Non-parametric density estimation is a statistical technique used to estimate the probability density function of a random variable without making any assumptions about the underlying distribution. It is also referred to as non-parametric density estimation because it does not require the use of a predefined probability distribution function, as opposed to parametric methods such as the Gaussian distribution.

The non-parametric density estimation technique involves constructing an estimate of the probability density function using the available data. This is typically done by creating a kernel density estimate.

Non-parametric density estimation has several advantages over parametric density estimation. One of the main advantages is that it does not require the assumption of a specific distribution, which allows for more flexible and accurate estimation in situations where the underlying distribution is unknown or complex. However, non-parametric density estimation can be computationally intensive and may require more data to achieve accurate estimates compared to parametric methods.

Kernel Density Estimate(KDE)

The KDE technique involves using a kernel function to smooth out the data and create a continuous estimate of the underlying density function.

Kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. In simple terms, it helps to visualize how the data is distributed and how likely it is to observe a particular value.

KDE works by placing a kernel (a mathematical function) at each data point, and then adding up these functions to create a smooth curve that represents the density of the data. The kernel function controls the shape and width of the curve.

Here's an example of how to use KDE to visualize the age distribution of passengers on the Titanic using Python and the Seaborn library:

import seaborn as sns

import matplotlib.pyplot as plt

import pandas as pd

# load Titanic dataset

titanic = sns.load_dataset("titanic")

# plot KDE of passenger ages

sns.kdeplot(data=titanic, x="age")

plt.show()

This code loads the Titanic dataset from the Seaborn library and plots a KDE of the "age" column. The resulting plot shows the distribution of ages among Titanic passengers, with the highest density of passengers in their 20s and 30s.

Overall, KDE is a useful tool for exploring the distribution of data, and it can help identify patterns and relationships that may not be obvious from raw data alone.

student interview high school degree courses college

About the Creator

ajay mehta

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments (1)

Akansha Dubeyabout a year ago
Keep it up

Keep reading

More stories from ajay mehta and writers in Education and other communities.

Probability Distribution Function PMF, PDF&CDF

Data science

• Types of Random Variables?

What are Probability Distributions?

There are 2 Type of probability distribution function

Probability Mass Function

Cumulative Distribution Function(CDF) of PMF

Probability Density Function (PDF)

Why Probability Density and why not Probability and What does the area of this graph represents?

How is graph calculated?

About the Creator

ajay mehta

Reader insights

Be the first to share your insights about this piece.

Comments (1)

Keep reading

Non-Gaussian Probability Distribution

ToXSL Technologies: Building a Brighter Future Through Digital Transformation

Lifelong Learning? Exploring Ongoing Education Options for Business Professionals

The Fruits of Revenge

Probability Distribution Function PMF, PDF&CDF

Data science

• Types of Random Variables?

What are Probability Distributions?

There are 2 Type of probability distribution function

Probability Mass Function

Cumulative Distribution Function(CDF) of PMF

Probability Density Function (PDF)

Why Probability Density and why not Probability and What does the area of this graph represents?

How is graph calculated?

About the Creator

ajay mehta

Reader insights

Be the first to share your insights about this piece.

Comments .css-19zxm0z-Text{display:inline-block;color:var(--text-default-mute);}(1)

Keep reading

Non-Gaussian Probability Distribution

ToXSL Technologies: Building a Brighter Future Through Digital Transformation

Lifelong Learning? Exploring Ongoing Education Options for Business Professionals

The Fruits of Revenge

Comments (1)