🤖 Data Analytics — Python & AI Guide
AI-Enhanced Data Workflow — Python Complete Guide
Python is the language of data science, analytics, and AI. With libraries like Pandas, NumPy, Matplotlib, and Scikit-learn, you can clean messy data, run statistical analysis, build stunning charts, and even train machine learning models — all in one environment. This guide is built for the practical learner.
▶ Python for Data Analytics — Getting Started
Python is free, open-source, and has the world’s largest data science ecosystem. The fastest way to start: use Google Colab (free, browser-based, no installation) or install Anaconda (includes Python + Jupyter + all libraries).
🐍 Python → 📦 Pandas → 🔢 NumPy → 📊 Matplotlib → 🤖 Scikit-learn → 🚀 Deploy
- 🆓 Start here: colab.research.google.com → New Notebook → you’re coding in 30 seconds, zero setup
- 📦 The core stack: Pandas (data manipulation), NumPy (numerical computing), Matplotlib & Seaborn (charts), Scikit-learn (ML)
- 📓 Jupyter Notebook: Interactive environment where code, output, charts, and text live in one document — the standard format for data analysis
- 📁 First dataset: Load a CSV file of your company’s sales data, or download free datasets from Kaggle.com — then follow along with every section below
🔷 Pandas — DataFrames & Data Cleaning
Pandas is Python’s Excel. A DataFrame is a table — rows and columns — that you can filter, sort, group, merge, and transform. 80% of real-world data analytics in Python uses Pandas.
import pandas as pd# Load datadf = pd.read_csv('sales_data.csv')# Exploreprint(df.shape) # rows, columnsprint(df.head()) # first 5 rowsprint(df.info()) # column types & nullsprint(df.describe()) # summary statistics# Clean datadf.dropna(subset=['revenue'], inplace=True) # remove rows where revenue is nulldf['order_date'] = pd.to_datetime(df['order_date']) # convert string to datedf['revenue'] = df['revenue'].str.replace(',','').astype(float) # fix number format# Filter and groupsouth_sales = df[df['region'] == 'South'] # filtermonthly = df.groupby('month')['revenue'].sum() # group by month, sum revenuetop10 = df.nlargest(10, 'revenue') # top 10 by revenue
📌 Data cleaning is 80% of the job. Key cleaning tasks: handle nulls (fillna/dropna), fix data types (astype), remove duplicates (drop_duplicates), standardise text (str.lower(), str.strip()), fix date formats (pd.to_datetime).
🔷 NumPy — Numerical Computing
NumPy (Numerical Python) is the foundation of all numerical computing in Python. It provides arrays (like lists but faster) and hundreds of mathematical functions. Pandas DataFrames are built on NumPy arrays underneath.
import numpy as nprevenues = np.array([45000, 67000, 23000, 89000, 54000])print(np.mean(revenues)) # Average: 55600print(np.median(revenues)) # Median: 54000print(np.std(revenues)) # Std Deviation: 22720 (spread)print(np.percentile(revenues, 75)) # 75th percentile: 67000# Financial calculation examplegrowth_rates = np.array([0.05, 0.08, 0.12, -0.03, 0.09])# Compound growth: start with ₹10,00,000portfolio = 1000000 * np.prod(1 + growth_rates)print(f"Portfolio after 5 years: ₹{portfolio:,.0f}") # ₹13,76,286
🔷 Matplotlib & Seaborn — Python Visualisations
Matplotlib is the base charting library. Seaborn (built on Matplotlib) produces beautiful statistical charts in fewer lines. Together they cover every visualisation need:
- 📊 Bar chart:
plt.bar(x, y)orsns.barplot(data=df, x='region', y='revenue') - 📈 Line chart:
plt.plot(dates, revenue)— perfect for time series trends - 🔵 Scatter plot:
sns.scatterplot(data=df, x='cost', y='profit', hue='category')— colour by category - 🌡️ Heatmap:
sns.heatmap(df.corr(), annot=True)— shows correlation between all numerical columns at once - 📦 Box plot:
sns.boxplot(data=df, x='region', y='revenue')— shows distribution, outliers, median
import matplotlib.pyplot as pltimport seaborn as sns# Monthly revenue trend with annotationfig, ax = plt.subplots(figsize=(12, 5))ax.plot(df['month'], df['revenue'], marker='o', color='#3B82F6', linewidth=2)ax.set_title('Monthly Revenue FY 2026-27', fontsize=14, fontweight='bold')ax.set_xlabel('Month')ax.set_ylabel('Revenue (₹)')# Annotate the lowest pointmin_idx = df['revenue'].idxmin()ax.annotate('Supply disruption', xy=(df['month'][min_idx], df['revenue'][min_idx]), xytext=(min_idx+1, df['revenue'].max()*0.8), arrowprops=dict(arrowstyle='->', color='red'), color='red')plt.tight_layout()plt.show()
🔷 AI/ML Intro — Scikit-learn Basics
Machine Learning (ML) teaches computers to find patterns and make predictions from data — without being explicitly programmed for each case. Scikit-learn makes ML accessible with just a few lines of Python.
📥 Data → 🧹 Clean → ✂️ Split (Train/Test) → 🤖 Train Model → 📊 Evaluate → 🔮 Predict
- 🔮 Regression: Predict a number — “What will next month’s revenue be?” (Linear Regression)
- 📂 Classification: Predict a category — “Will this customer churn? Yes/No?” (Logistic Regression, Random Forest)
- 🔵 Clustering: Group similar items — “Segment our customers into 4 types based on behaviour” (K-Means)
from sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import r2_score, mean_absolute_error# Predict next month revenue from marketing spend + headcountX = df[['marketing_spend', 'sales_headcount', 'month_number']]y = df['revenue']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)model = LinearRegression()model.fit(X_train, y_train) # Train on 80% of datay_pred = model.predict(X_test) # Predict on 20% unseen dataprint(f"R² Score: {r2_score(y_test, y_pred):.2f}") # 0.85 = 85% variance explainedprint(f"MAE: ₹{mean_absolute_error(y_test, y_pred):,.0f}") # Average error# Predict next monthnext_month = [[500000, 12, 13]] # ₹5L marketing spend, 12 salespeople, month 13prediction = model.predict(next_month)print(f"Predicted Revenue: ₹{prediction[0]:,.0f}")
✅ End-to-End Python Analytics Project — Sales Analysis
Here is a complete mini-project that ties everything together — from raw CSV to insight to prediction:
1️⃣ Load CSV → 2️⃣ Clean & Explore → 3️⃣ Analyse by Region/Product → 4️⃣ Visualise Trends → 5️⃣ Predict Next Month → 6️⃣ Export Report
- 1️⃣ Load:
df = pd.read_csv('sales.csv', parse_dates=['order_date']) - 2️⃣ Clean: Drop nulls, fix types, remove duplicates, create new columns (month, quarter, profit margin %)
- 3️⃣ Analyse: Top 10 products by revenue, revenue by region, monthly trend, customer retention rate
- 4️⃣ Visualise: Line chart (monthly trend), bar chart (region), heatmap (product × month), scatter (spend vs revenue)
- 5️⃣ Predict: Train linear regression on 11 months → predict Month 12 revenue
- 6️⃣ Export:
df.to_excel('sales_report.xlsx', index=False)or save charts as PNG for presentation - 🏆 Share: Upload Jupyter Notebook to GitHub → link in resume → instant portfolio evidence
🎯
Teacher’s Tip
Don’t learn Python in isolation. Find a problem in your actual job — an Excel file you process every week, a report you build manually — and solve it with Python. The motivation of solving a real problem makes learning 10× faster. Use ChatGPT as your coding mentor: paste your error messages, ask “how do I do X in Pandas?” — AI is the greatest learning accelerator for data professionals today. Combine human understanding of the problem with AI speed of writing code.