WHOLESALE CUSTOMERS DATASET
Hello again !
This is my second post. This time i’m working with a dataset called Wholesale customers. Let’s get into it.
ABOUT THE DATASET
The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.
Data Set Characteristics: Multivariate
Number of Instances : 440
Area: Business
Attribute Characteristics: Integer
Number of Attributes: 8
Associated Tasks: Classification
Loading Data
df = pd.read_csv("/content/Wholesale customers data.csv")
Reading Data
print(df)
Initialising the values of X and y where X is the attribute and y is target variable .
X = df.iloc[:,1:] # Features
y = df.iloc[:,:-7] # Target variable
print(X)
print(y)
Normalization
Normalization is a technique used to restrict the range of each attribute between 0 to 1 resulting better accuracy of the model. Normalization is a part of Data Preprocessing.
normalized_X = preprocessing.normalize(X)
Training and Testing of Model
In Training and Testing of model we train the model on some part of the dataset and then predict the values of y on the remaining dataset. For example in the statement below I used the testing size to be 20% and the remaining 80% of the data was for training the model. This means on 80% of the dataset real values of y will be given to the system along with the attributes and on the 20% model has to predict the values of y comparing the values of different attributes and the previous instances and experience.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
CART (Classification And Regression Trees)
Importing libraries important for Decision Tree Classifier.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
Here I’ll use Decision Tree Classifier for classifying CART uses lower gini impurity for choosing leaf node attributes.
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
Accuracy
After predicting the values of y we compare the predicted values with the actual values to find out how the model is performing. You can vary the test size for better accuracy.
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Decision Tree
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplusdot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('wines.png')
Image(graph.create_png())
Logistic Regression
Importing libraries required for Logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
Classifying dataset using logistic regression. Logistic regression uses Sigmoid function for predicting values.
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Predicting y values and comparing it with real y values for accuracy and viability of the model.
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
Classifiction Report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Perceptron
Perceptron is simplest form of neural network. Classifying wine dataset using Perceptron. Importing necessary libraries for single layered neural network .
pn = Perceptron(tol=1e-3, random_state=0)
pn.fit(X_train, y_train)
Training Accuracy
pn.score(X_train,y_train)
Testing Accuracy
pn.score(X_test,y_test)
Neural Network
Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated. Neural networks help us cluster and classify.
Importing libraries for neural network
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
Specifying model type, input shape and configuring compile method
model = Sequential()
model.add(Dense(13, input_dim=13, activation='relu'))
model.add(Dense(7, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Training and Testing of Dataset
y_test_cat=to_categorical(y_test)
y_train_cat=to_categorical(y_train)
model.fit(X_train, y_train_cat, epochs=150, batch_size=10)
Training Accuracy
_, accuracy = model.evaluate(X_train, y_train_cat)
print('Accuracy: %.2f' % (accuracy*100))
Testing Accuracy
_, accuracy = model.evaluate(X_test, y_test_cat)
print('Accuracy: %.2f' % (accuracy*100))
Random Forest
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
Random Forest Classification
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 900, criterion = 'gini', random_state = 0)classifier.fit(X_train, y_train)y_pred = classifier.predict(X_test)
Accuracy
print(classifier.score(X_test, y_test)