Today I noticed a function in
sklearn.datasets.make_classification, which allows users to generate fake experimental classification data. The document is here.
Looks like this function can generate all sorts of data in user’s needs. The general API has the form
sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
In the document, it says
This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.
To be honest, it is not clear to me what this function does. So I go to check the source code.
Looks like the basic idea is generating some centroids for different clusters by the number of classes, with some randomness. How closely each cluster centroids can be controlled by setting the value
class_sep. Since data will be generated with gaussian distribution with variance , this argument controls how much each cluster overlaps with other clusters.
# Build the polytope whose vertices become cluster centroids centroids = _generate_hypercube(n_clusters, n_informative, generator).astype(float) centroids *= 2 * class_sep # determine how centroids are scaled centroids -= class_sep if not hypercube: centroids *= generator.rand(n_clusters, 1) centroids *= generator.rand(1, n_informative)
Then it generates data in normal distribution (Gaussian distribution) and informative features, and split data into n_classes clusters by shifting part of data to each cluster, and add correlations on the informative features by multiplying feature matrix with a randomly generated covariance matrix (looks like it’s correlation instead of covariance):
# Initially draw informative features from the standard normal X[:, :n_informative] = generator.randn(n_samples, n_informative) # Create each cluster; a variant of make_blobs stop = 0 for k, centroid in enumerate(centroids): start, stop = stop, stop + n_samples_per_cluster[k] y[start:stop] = k % n_classes # assign labels X_k = X[start:stop, :n_informative] # slice a view of the cluster A = 2 * generator.rand(n_informative, n_informative) - 1 X_k[...] = np.dot(X_k, A) # introduce random covariance X_k += centroid # shift the cluster to a vertex
The function then add redundant features with some correlation, without breaking created informative features
# Create redundant features if n_redundant > 0: B = 2 * generator.rand(n_informative, n_redundant) - 1 X[:, n_informative:n_informative + n_redundant] = \ np.dot(X[:, :n_informative], B)
and some other characteristics to the dataset
One thing I’m particularly interested is the
# Randomly replace labels if flip_y >= 0.0: flip_mask = generator.rand(n_samples) < flip_y y[flip_mask] = generator.randint(n_classes, size=flip_mask.sum())
For each data point, the function changes its target to some random class (actually it can be changed to true class) with rate
flip_y, a float between 0 and 1. It makes some of the data has wrong target, or introduces some noise to the dataset.
There are two things to control the noise (or, how different classes of data are overlapped)
class_sep: determine how clusters are separated. Large value will make datasets overlap less.
flip_y: determine how many data points are randomly labelled (noise)
Let’s now look at the output of the dataset
x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1)
The 3-class dataset looks not too bad. This one I set
flip_y to be the defaulted value, .
Let’s create a worse dataset by setting
flip_y to be .
x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1, flip_y=0.3)
How about increasing
x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1, flip_y=0.3, class_sep=5)
Data points are separated and clearly splitted as three clusters, but they are still overlapped due to high value of
flip_y. This generates a completely different types of errors: previously the datasets are just hard to be splitted, and now the data are easy to be splitted but in each clusters some of the data are from the other classes (noise…)
So this function can help people to quickly generate an artificial dataset with some properties they want. I’m not sure how closely these datasets will be versus real datasets, but it’s useful when people want to do some experiments on the properties of predictive models.
This is the first blog I’ve ever posted. There are two reasons why I want to start to write blogs:
Hopefully I can achieve at least one of the goals.