AI Python Data Workflow

AI-Enhanced Data Workflow — Python Complete Guide

Python is the language of data science, analytics, and AI. With libraries like Pandas, NumPy, Matplotlib, and Scikit-learn, you can clean messy data, run statistical analysis, build stunning charts, and even train machine learning models — all in one environment. This guide is built for the practical learner.


▶ Python for Data Analytics — Getting Started

Python is free, open-source, and has the world’s largest data science ecosystem. The fastest way to start: use Google Colab (free, browser-based, no installation) or install Anaconda (includes Python + Jupyter + all libraries).

🐍 Python  →  📦 Pandas  →  🔢 NumPy  →  📊 Matplotlib  →  🤖 Scikit-learn  →  🚀 Deploy

  • 🆓 Start here: colab.research.google.com → New Notebook → you’re coding in 30 seconds, zero setup
  • 📦 The core stack: Pandas (data manipulation), NumPy (numerical computing), Matplotlib & Seaborn (charts), Scikit-learn (ML)
  • 📓 Jupyter Notebook: Interactive environment where code, output, charts, and text live in one document — the standard format for data analysis
  • 📁 First dataset: Load a CSV file of your company’s sales data, or download free datasets from Kaggle.com — then follow along with every section below

🔷 Pandas — DataFrames & Data Cleaning

Pandas is Python’s Excel. A DataFrame is a table — rows and columns — that you can filter, sort, group, merge, and transform. 80% of real-world data analytics in Python uses Pandas.

import pandas as pd
# Load data
df = pd.read_csv('sales_data.csv')
# Explore
print(df.shape) # rows, columns
print(df.head()) # first 5 rows
print(df.info()) # column types & nulls
print(df.describe()) # summary statistics
# Clean data
df.dropna(subset=['revenue'], inplace=True) # remove rows where revenue is null
df['order_date'] = pd.to_datetime(df['order_date']) # convert string to date
df['revenue'] = df['revenue'].str.replace(',','').astype(float) # fix number format
# Filter and group
south_sales = df[df['region'] == 'South'] # filter
monthly = df.groupby('month')['revenue'].sum() # group by month, sum revenue
top10 = df.nlargest(10, 'revenue') # top 10 by revenue

📌 Data cleaning is 80% of the job. Key cleaning tasks: handle nulls (fillna/dropna), fix data types (astype), remove duplicates (drop_duplicates), standardise text (str.lower(), str.strip()), fix date formats (pd.to_datetime).


🔷 NumPy — Numerical Computing

NumPy (Numerical Python) is the foundation of all numerical computing in Python. It provides arrays (like lists but faster) and hundreds of mathematical functions. Pandas DataFrames are built on NumPy arrays underneath.

import numpy as np
revenues = np.array([45000, 67000, 23000, 89000, 54000])
print(np.mean(revenues)) # Average: 55600
print(np.median(revenues)) # Median: 54000
print(np.std(revenues)) # Std Deviation: 22720 (spread)
print(np.percentile(revenues, 75)) # 75th percentile: 67000
# Financial calculation example
growth_rates = np.array([0.05, 0.08, 0.12, -0.03, 0.09])
# Compound growth: start with ₹10,00,000
portfolio = 1000000 * np.prod(1 + growth_rates)
print(f"Portfolio after 5 years: ₹{portfolio:,.0f}") # ₹13,76,286

🔷 Matplotlib & Seaborn — Python Visualisations

Matplotlib is the base charting library. Seaborn (built on Matplotlib) produces beautiful statistical charts in fewer lines. Together they cover every visualisation need:

  • 📊 Bar chart: plt.bar(x, y) or sns.barplot(data=df, x='region', y='revenue')
  • 📈 Line chart: plt.plot(dates, revenue) — perfect for time series trends
  • 🔵 Scatter plot: sns.scatterplot(data=df, x='cost', y='profit', hue='category') — colour by category
  • 🌡️ Heatmap: sns.heatmap(df.corr(), annot=True) — shows correlation between all numerical columns at once
  • 📦 Box plot: sns.boxplot(data=df, x='region', y='revenue') — shows distribution, outliers, median
import matplotlib.pyplot as plt
import seaborn as sns
# Monthly revenue trend with annotation
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(df['month'], df['revenue'], marker='o', color='#3B82F6', linewidth=2)
ax.set_title('Monthly Revenue FY 2026-27', fontsize=14, fontweight='bold')
ax.set_xlabel('Month')
ax.set_ylabel('Revenue (₹)')
# Annotate the lowest point
min_idx = df['revenue'].idxmin()
ax.annotate('Supply disruption', xy=(df['month'][min_idx], df['revenue'][min_idx]),
xytext=(min_idx+1, df['revenue'].max()*0.8),
arrowprops=dict(arrowstyle='->', color='red'), color='red')
plt.tight_layout()
plt.show()

🔷 AI/ML Intro — Scikit-learn Basics

Machine Learning (ML) teaches computers to find patterns and make predictions from data — without being explicitly programmed for each case. Scikit-learn makes ML accessible with just a few lines of Python.

📥 Data  →  🧹 Clean  →  ✂️ Split (Train/Test)  →  🤖 Train Model  →  📊 Evaluate  →  🔮 Predict

  • 🔮 Regression: Predict a number — “What will next month’s revenue be?” (Linear Regression)
  • 📂 Classification: Predict a category — “Will this customer churn? Yes/No?” (Logistic Regression, Random Forest)
  • 🔵 Clustering: Group similar items — “Segment our customers into 4 types based on behaviour” (K-Means)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
# Predict next month revenue from marketing spend + headcount
X = df[['marketing_spend', 'sales_headcount', 'month_number']]
y = df['revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train) # Train on 80% of data
y_pred = model.predict(X_test) # Predict on 20% unseen data
print(f"R² Score: {r2_score(y_test, y_pred):.2f}") # 0.85 = 85% variance explained
print(f"MAE: ₹{mean_absolute_error(y_test, y_pred):,.0f}") # Average error
# Predict next month
next_month = [[500000, 12, 13]] # ₹5L marketing spend, 12 salespeople, month 13
prediction = model.predict(next_month)
print(f"Predicted Revenue: ₹{prediction[0]:,.0f}")

✅ End-to-End Python Analytics Project — Sales Analysis

Here is a complete mini-project that ties everything together — from raw CSV to insight to prediction:

1️⃣ Load CSV  →  2️⃣ Clean & Explore  →  3️⃣ Analyse by Region/Product  →  4️⃣ Visualise Trends  →  5️⃣ Predict Next Month  →  6️⃣ Export Report

  • 1️⃣ Load: df = pd.read_csv('sales.csv', parse_dates=['order_date'])
  • 2️⃣ Clean: Drop nulls, fix types, remove duplicates, create new columns (month, quarter, profit margin %)
  • 3️⃣ Analyse: Top 10 products by revenue, revenue by region, monthly trend, customer retention rate
  • 4️⃣ Visualise: Line chart (monthly trend), bar chart (region), heatmap (product × month), scatter (spend vs revenue)
  • 5️⃣ Predict: Train linear regression on 11 months → predict Month 12 revenue
  • 6️⃣ Export: df.to_excel('sales_report.xlsx', index=False) or save charts as PNG for presentation
  • 🏆 Share: Upload Jupyter Notebook to GitHub → link in resume → instant portfolio evidence

🎯

Teacher’s Tip

Don’t learn Python in isolation. Find a problem in your actual job — an Excel file you process every week, a report you build manually — and solve it with Python. The motivation of solving a real problem makes learning 10× faster. Use ChatGPT as your coding mentor: paste your error messages, ask “how do I do X in Pandas?” — AI is the greatest learning accelerator for data professionals today. Combine human understanding of the problem with AI speed of writing code.

Back to Data Analytics Hub