Data Mining Practicals

Practical 1 - Performing the analysis using Excel

1. Import the data warehouse data in Microsoft Excel and create the Pivot table and Pivot Chart.

Procedure:

1. Create the Data Warehouse Database
1. Open SQL Server Management Studio (SSMS).
2. Connect to the Database Engine.
3. Click on New Query.
4. Copy and paste the given T-SQL script for creating the Sales Data Warehouse.
5. Press F5 to execute the script.
6. After execution, a database named Sales_DW will be created and populated with data.
7. Refresh the Databases section to verify that Sales_DW has been created.
2. Open Microsoft Excel
1. Launch Microsoft Excel.
2. Go to the Data tab.
3. Connect Excel to SQL Server
1. Click Get External Data.
2. Select From Other Sources.
3. Click From Data Connection Wizard.
4. Select the Data Source
1. In the Data Connection Wizard, select Microsoft SQL Server.
2. Click Next.
5. Provide Server Details
1. Enter the SQL Server name.
2. Click Next to continue.
6. Select Database and Tables
1. Choose the database Sales_DW.
2. Select the required tables and dimensions.
3. Ensure relationships between tables are imported.
7. Save the Data Connection
1. Choose a location to save the connection file.
2. Click Finish.
8. Import Data into Excel
1. In the Import Data dialog box, select Pivot Chart.
2. Click OK.
9. Create Pivot Table and Chart
1. In the Pivot Table Fields panel:
* Drag SalesDateKey into Filters.
* Drag ProductName into Axis (Rows).
* Drag ProductActualCost into Values.
2. Excel will generate a Pivot Table and Pivot Chart automatically.
10. Insert Pivot Table (Optional Step)
11. Go to the Insert tab.
12. Click Pivot Table.
13. Choose Existing Connection.
14. Select the saved connection with Sales_DW.
15. Click OK to generate the Pivot Table and Chart.

Result:
The Sales Data Warehouse is connected to Excel, and a Pivot Table and Pivot Chart are successfully created for analyzing sales data.

Practical 3 - Implementation of Classification algorithm in R Programming.


1. Consider the annual rainfall details at a place starting from January 2012. We create an R time series object for a period of 12 months and plot it.

c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)
print(rainfall.timeseries)
png(file = "rainfall.png")
plot(rainfall.timeseries)
dev.off()


2. Practical Implementation of Decision Tree using R Tool. We will use the R in-built data set named readingSkills to create a decision tree. It describes the score of someone's readingSkills if we know the variables "age","shoesize","score" and whether the person is a native speaker or not.

install.packages("party")

library(party)
input.dat <- readingSkills[c(1:105),]
png(file = "decision_tree.png")
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)
plot(output.tree)
dev.off()

Practical 4 - Perform the data clustering using clustering algorithm using Python.


1. Write a Python program for k-means clustering.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, alpha=0.7, marker='X')
plt.title("K-Means Clustering Example")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()


2. Write a python code for DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.50, random_state=0)
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='plasma', s=50)
plt.title("DBSCAN Clustering Example")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Practical 5 - Perform the Linear regression on the given data warehouse data using R.


1. Create Relationship Model & get the Coefficients

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
print (relation)


2. Get the Summary of the Relationship

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
print(summary(relation))


3. Predict the weight of new persons

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <-lm(y~x)
a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)


4. Visualize the Regression Graphically

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91,47, 57, 76, 72, 62, 48)
relation <-lm(y~x)
png(file = "linearregression.png")
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")
dev.off()

Practical 6 - Logistic Regression in R


1. Importing the Dataset

install.packages("dplyr")
library(dplyr)
head(mtcars)


2. Splitting the Dataset 
We are using the caTools package to randomly split the mtcars dataset into two parts: 80% for training (train_reg) and 20% for testing (test_reg). This allows us to train the logistic regression model on one set and evaluate its performance on unseen data. 

install.packages("caTools")
library(caTools)
split <- sample.split(mtcars, SplitRatio = 0.8)
train_reg <- subset(mtcars, split== "TRUE")
test_reg <- subset(mtcars, split == "FALSE")


3. Building the model
Logistic regression is implemented in R using glm() by training the model using features or variables in the dataset.

logistic_model <- glm(vs ~ wt + disp, data = train_reg, family = "binomial")
logistic_model
summary(logistic_model)

Practical 7


1. Write a Python program to read data from a CSV file, perform simple data analysis, and generate basic insights.
(On Custom data, name, age, sal)

import pandas as pd
file_path = "Panda.csv"
df = pd.read_csv(file_path)
print("\n First 5 rows of the dataset:")
print(df.head())
print("\n Dataset Summary:")
print(df.info())
print("\n Basic Statistics:")
print(df.describe())
print("\n Missing Values in Each Column:")
print(df.isnull().sum())
print("\n Column Names:")
print(df.columns)
if 'Salary' in df.columns:
    print(f"\n Average Salary: {df['Salary'].mean():.2f}")
if 'Age' in df.columns:
    print(f"\n Youngest Person's Age: {df['Age'].min()}")
    print(f" Oldest Person's Age: {df['Age'].max()}")
if 'Gender' in df.columns:
    print("\n Gender Distribution:")
    print(df['Gender'].value_counts())
print("\n Data Analysis Completed Successfully!")


2. Write a Python program to read data from a CSV file, perform simple data analysis, and generate basic insights.
(On Sample_Data.csv, in ivno.hashom.online/Sample_Data.csv)

import pandas as pd
import matplotlib.pyplot as plt
import os
# --- Step 1: Load the Data ---
try:
    file_path = os.path.expanduser("Sales_Data.csv")
    df = pd.read_csv(file_path)
    df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
    print("--- Dataset Preview ---")
    print(df.head(), "\n")
except FileNotFoundError:
    print("Error: Sales_data.csv file was not found in Downloads folder.")
    exit()
# --- Step 2: Basic Data Exploration ---
print("\n--- Data Info ---")
df.info()
print("\n--- Statistical Summary ---")
print(df.describe())
# --- Step 3: Data Analysis & Insights ---
# Insight 1: Total Revenue
total_revenue = df['TotalPrice'].sum()
# Insight 2: Sales by Region
region_sales = df.groupby('Region')['TotalPrice'].sum()
# Insight 3: Most Popular Product (By Quantity)
popular_product = df.groupby('Product')['Quantity'].sum().idxmax()
# --- Step 4: Display Results ---
print("\n--- Final Insights ---")
print(f"Total Revenue Generated: ${total_revenue}")
print("\nRevenue by Region:")
print(region_sales)
print(f"\nBest Selling Product: {popular_product}")
# --- Step 5: Line Graph - Yearly Sales Trend ---
df['Year'] = df['Date'].dt.year
yearly_sales = df.groupby('Year')['TotalPrice'].sum().reset_index()
yearly_sales = yearly_sales.sort_values('Year')
plt.figure()
plt.plot(
    yearly_sales['Year'],
    yearly_sales['TotalPrice'],
    marker='o',
    linestyle='-'
)
plt.title('Yearly Sales Performance')
plt.xlabel('Year')
plt.ylabel('Total Sales ($)')
plt.xticks(yearly_sales['Year'])
plt.grid(True)
plt.tight_layout()
plt.savefig('yearly_sales_performance.png')
print("\nYearly Sales Data:")
print(yearly_sales)
# --- Step 6: Bar Chart - Revenue by Region ---
region_data = df.groupby('Region')['TotalPrice'].sum().sort_values(ascending=False)
plt.figure(figsize=(10,6))
region_data.plot(kind='bar')
plt.title('Total Revenue by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales ($)')
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig('sales_bar_chart.png')
plt.show()

Practical 8 - Perform data visualization


1. Perform data visualization using Python on any sales data.

Procedure -

ensure that you have the Coffe_sales.csv and if not then get it from https://ivno.hashom.online/Coffe_sales.csv

code :

import pandas as pd
import matplotlib.pyplot as plt
import os
# Load the coffee sales data
file_path = os.path.expanduser("Coffe_sales.csv")
df = pd.read_csv(file_path, encoding='latin1')
# --- 1. Visualization: Total Revenue by Coffee Type ---
coffee_revenue = df.groupby('coffee_name')['money'].sum().sort_values(ascending=False).reset_index()
plt.figure()
plt.bar(coffee_revenue['coffee_name'], coffee_revenue['money'])
plt.title('Total Revenue by Coffee Type')
plt.xlabel('Coffee Name')
plt.ylabel('Total Revenue ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('coffee_revenue.png')
plt.close()
# --- 2. Visualization: Total Revenue by Day of Week ---
weekday_revenue = df.groupby(['Weekday','Weekdaysort'])['money'].sum().reset_index().sort_values('Weekdaysort')
plt.figure()
plt.bar(weekday_revenue['Weekday'], weekday_revenue['money'])
plt.title('Total Revenue by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Total Revenue ($)')
plt.tight_layout()
plt.savefig('weekday_revenue.png')
plt.close()
print("Visualizations saved as 'coffee_revenue.png' and 'weekday_revenue.png'.")

2. Perform data visualization using Power Bi on any Order data.

- Open Power BI.
- Click Get Data → OData Feed, enter the URL:
http://services.odata.org/V3/Northwind/Northwind.svc/
- Then click OK.
- Select the Orders table and click Transform Data.
- Go to Choose Columns → Choose Columns, deselect unnecessary columns and keep only the required ones. Click OK.
- Click Close & Apply to apply the changes.
- In the Visualizations pane, choose a chart type (e.g., Clustered Column Chart or Pie Chart).
- From the Data pane on the right side, drag the required fields into the chart areas.
Example:
CustomerID → X-Axis
OrderID → Y-Axis
- Adjust the visualization as needed.
- Save the Power BI file.

3. Perform data visualization using Power BI on Order data (using CSV file).

- Open Power BI.
- Click Get Data → Text/CSV.
- Browse and select the Order data CSV file, then click Open.
- In the preview window, click Transform Data.
- Go to Choose Columns → Choose Columns, deselect unnecessary columns and keep only the required ones. Click OK.
- Click Close & Apply to load and apply the changes.
- In the Visualizations pane, choose a chart type (e.g., Clustered Column Chart or Pie Chart).
- From the Data pane on the right side, drag the required fields into the chart areas.
Example:
CustomerID → X-Axis
OrderID → Y-Axis
- Adjust the visualization as needed.
- Save the Power BI (.pbix) file.