WHOLESALE CUSTOMERS DATASET

Sriraag Av
5 min readOct 13, 2019

--

Hello again !

This is my second post. This time i’m working with a dataset called Wholesale customers. Let’s get into it.

ABOUT THE DATASET

The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.

Data Set Characteristics: Multivariate

Number of Instances : 440

Area: Business

Attribute Characteristics: Integer

Number of Attributes: 8

Associated Tasks: Classification

Loading Data

df = pd.read_csv("/content/Wholesale customers data.csv")

Reading Data

print(df)

Initialising the values of X and y where X is the attribute and y is target variable .

X = df.iloc[:,1:] # Features
y = df.iloc[:,:-7] # Target variable
print(X)
print(y)

Normalization

Normalization is a technique used to restrict the range of each attribute between 0 to 1 resulting better accuracy of the model. Normalization is a part of Data Preprocessing.

normalized_X = preprocessing.normalize(X)

Training and Testing of Model

In Training and Testing of model we train the model on some part of the dataset and then predict the values of y on the remaining dataset. For example in the statement below I used the testing size to be 20% and the remaining 80% of the data was for training the model. This means on 80% of the dataset real values of y will be given to the system along with the attributes and on the 20% model has to predict the values of y comparing the values of different attributes and the previous instances and experience.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

CART (Classification And Regression Trees)

Importing libraries important for Decision Tree Classifier.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

Here I’ll use Decision Tree Classifier for classifying CART uses lower gini impurity for choosing leaf node attributes.

clf = DecisionTreeClassifier() 
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

Accuracy

After predicting the values of y we compare the predicted values with the actual values to find out how the model is performing. You can vary the test size for better accuracy.

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Decision Tree

from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplusdot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('wines.png')
Image(graph.create_png())

Logistic Regression

Importing libraries required for Logistic regression

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

Classifying dataset using logistic regression. Logistic regression uses Sigmoid function for predicting values.

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

Predicting y values and comparing it with real y values for accuracy and viability of the model.

y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Classifiction Report

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Perceptron

Perceptron is simplest form of neural network. Classifying wine dataset using Perceptron. Importing necessary libraries for single layered neural network .

pn = Perceptron(tol=1e-3, random_state=0)
pn.fit(X_train, y_train)

Training Accuracy

pn.score(X_train,y_train)

Testing Accuracy

pn.score(X_test,y_test)

Neural Network

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated. Neural networks help us cluster and classify.

Importing libraries for neural network

from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

Specifying model type, input shape and configuring compile method

model = Sequential()
model.add(Dense(13, input_dim=13, activation='relu'))
model.add(Dense(7, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Training and Testing of Dataset

y_test_cat=to_categorical(y_test)
y_train_cat=to_categorical(y_train)
model.fit(X_train, y_train_cat, epochs=150, batch_size=10)

Training Accuracy

_, accuracy = model.evaluate(X_train, y_train_cat)
print('Accuracy: %.2f' % (accuracy*100))

Testing Accuracy

_, accuracy = model.evaluate(X_test, y_test_cat)
print('Accuracy: %.2f' % (accuracy*100))

Random Forest

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

Random Forest Classification

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 900, criterion = 'gini', random_state = 0)classifier.fit(X_train, y_train)y_pred = classifier.predict(X_test)

Accuracy

print(classifier.score(X_test, y_test)

--

--