Software Tools

Mastering Data Analysis with Python: A Step-by-Step Tutorial

A hands-on tutorial covering the data analysis pipeline in Python: cleaning messy data with pandas, detecting outliers/typos, and using linear regression to explore variable relationships. Includes code examples and common pitfalls.

Published 2026-05-03 02:11:10 • 085878 Stack Staff

Overview

Data analysis is a cornerstone of modern decision-making, and Python has become a go-to language for analysts thanks to its powerful libraries like pandas, NumPy, and scikit-learn. This tutorial guides you through a complete data analysis workflow, from importing raw data to drawing insights using regression. You'll learn how to clean messy datasets, identify outliers and typos, and build a regression model to explore relationships between variables. By the end, you'll have a practical framework for tackling your own data projects.

Mastering Data Analysis with Python: A Step-by-Step Tutorial — Source: realpython.com

Prerequisites

Before diving in, ensure you have:

Python 3.7 or later installed
Basic familiarity with Python syntax (variables, loops, functions)
The following libraries: pandas, numpy, matplotlib, seaborn, scikit-learn (install via pip install pandas numpy matplotlib seaborn scikit-learn)
A dataset to work with (we'll use the classic “Auto MPG” dataset, available from the UCI repository, but any CSV will do)

Optionally, a Jupyter notebook environment (e.g., JupyterLab, VS Code with Python extension) for interactive exploration.

Step-by-Step Instructions

1. Importing Libraries and Loading Data

Start by importing essential libraries and loading your dataset. For reproducibility, use pandas to read a CSV file.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

df = pd.read_csv('auto-mpg.csv')
print(df.head())

This snippet gives you a quick preview of the data structure: column names, data types, and initial values. Always check df.info() to spot missing entries and incorrect data types early.

2. Understanding the Dataset

Perform exploratory analysis to grasp the variables. Use df.describe() for summary statistics and df.shape for dimensions. For the MPG dataset, columns include mpg, cylinders, displacement, horsepower, weight, acceleration, model year, and origin. Note that horsepower may be stored as object due to missing values (e.g., '?') – a common obstacle.

3. Cleaning Raw Data with Pandas

Data cleaning is often the most time-consuming step but critical for accurate analysis. For the MPG dataset, handle the horsepower column:

# Replace non-numeric entries with NaN
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')

# Check for missing values
print(df.isnull().sum())

# Impute missing values with the median
df['horsepower'].fillna(df['horsepower'].median(), inplace=True)

Also remove duplicates if any (df.drop_duplicates(inplace=True)) and ensure correct data types (e.g., integers for cylinders and year). For categorical variables like origin, you might convert them to numeric codes or one-hot encode later.

4. Spotting Outliers and Typos

Outliers can skew regression results. Use boxplots and z-scores to detect extreme values:

# Boxplot of mpg
sns.boxplot(x=df['mpg'])
plt.show()

# Identify outliers using z-score (threshold 3)
from scipy import stats
z_scores = np.abs(stats.zscore(df['mpg']))
outliers = df[z_scores > 3]
print(outliers)

Typos often appear as inconsistent entries in categorical columns. For example, the origin column might have '1', '2', '3' but also 'usa' typed manually. Use df['origin'].value_counts() to spot anomalies and correct them with mapping.

5. Feature Engineering and Selection

Prepare features for regression. Create new variables if helpful (e.g., power-to-weight ratio) and select relevant predictors. For simplicity, we'll use displacement, horsepower, weight, and acceleration to predict mpg.

features = ['displacement', 'horsepower', 'weight', 'acceleration']
X = df[features]
y = df['mpg']

Scale numerical features (optional but recommended for some models):

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

6. Splitting Data for Training and Testing

Divide the dataset into training and test sets to evaluate model performance.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

A 80/20 split is common. Use random_state for reproducibility.

7. Building a Regression Model

Use linear regression to model the relationship between features and mpg:

model = LinearRegression()
model.fit(X_train, y_train)

# Coefficients
coeff_df = pd.DataFrame(model.coef_, features, columns=['Coefficient'])
print(coeff_df)

Interpretation: a positive coefficient means an increase in that feature raises mpg (unlikely for weight), while negative means it lowers mpg.

8. Evaluating the Model

Predict on test data and assess performance:

y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2:.2f}')

# Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted MPG')
plt.ylabel('Residuals')
plt.show()

An R² near 1 indicates a good fit. Check residuals for homoscedasticity (constant spread) and randomness.

Common Mistakes

Ignoring data types: Numeric columns stored as strings (like horsepower) will cause errors. Always verify with df.dtypes and convert when needed.
Overlooking missing values: Dropping all rows with NaNs can reduce sample size significantly. Instead, impute strategically (mean, median, or using other features).
Failing to detect outliers: Outliers can be genuine extreme cases or data entry errors. Investigate them before removal – sometimes they carry valuable insights.
Leaky data splitting: Scaling should be applied after splitting to avoid data leakage from the test set. Fit the scaler only on training data, then transform both.
Misinterpreting regression coefficients: Correlation does not imply causation. A coefficient shows the average change in target for one unit change in predictor, assuming all else is constant.
Skipping residual analysis: High R² doesn't guarantee a good model. Residual plots reveal patterns like heteroscedasticity or non-linearity that suggest the model isn't appropriate.

Summary

This tutorial walked through the core stages of a data analysis project using Python. You learned to load data, clean it with pandas, identify outliers and typos, engineer features, and build a linear regression model. The workflow – from raw data to interpretable results – is applicable to virtually any dataset. By mastering these steps, you're equipped to extract meaningful insights and make data-driven decisions. Keep practicing with different datasets to sharpen your skills.