This is an application of a decision tree classifier from scikit-learn to classify objects from the Sloan Sky Survey as stars, galaxies, or quasars(QSOs). In this refined version of yesterday's analysis, the i and z columns are combined in a PCA analysis. This combined column is used to test whether an object is a star, galaxy, or quasar.

In [1]:
import matplotlib.pyplot as plt
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.decomposition import PCA

from sklearn.tree import DecisionTreeClassifier
import os
pd.options.mode.chained_assignment = None

First we import all necesssary libraries.

In [2]:
os.chdir('/home/wln/Documents/python_programs/Astronomy_Datasets')

s = pd.read_csv("sloan_survey.csv")



sub = s[['class', 'u', 'g' , 'r' , 'i' ,'z']]

Then, we subset the necessary columns from the array

In [3]:
plt.figure()
sns.scatterplot(x='u',y='i',data=s)
Out[3]:
<Axes: xlabel='u', ylabel='g'>
In [4]:
plt.figure()
sns.scatterplot(x='u',y='z',data=s)
Out[4]:
<Axes: xlabel='u', ylabel='z'>

These two graphs show a strong correlation between the red and ultraviolet, and infrared and ultraviolet spectral components of each image.

In [5]:
pca = PCA(n_components=1)

sub['pca'] = PCA(n_components=1).fit_transform(sub[['i','z']])

This is where the PCA analysis occurs.

In [6]:
X = np.array(sub['pca'].fillna(0))
X = X[:, np.newaxis]

y = np.array(sub['class'])


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.1)

This is where the train and test sets are created. Ten percent of the dataset is set aside for testing.

In [7]:
D = DecisionTreeClassifier()

D.fit(X_train,y_train)

y_pred = D.predict(X_test)

In this section, the decision tree classifier is fitted on the training set, and used to predict y-values of the test set.

In [9]:
sub['pred_class'] = D.predict(X)



print(D.score(X, y))


print(D.score(X_test, y_test))
0.9536
0.536

This model predicts the labels of the test set at a rate of 53%, slightly better than the previous model. Further improvement will be needed.

In [ ]: