Generate experimental classification data

Today I noticed a function in sklearn.datasets.make_classification, which allows users to generate fake experimental classification data. The document is here.

Looks like this function can generate all sorts of data in user’s needs. The general API has the form

sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

In the document, it says

This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

To be honest, it is not clear to me what this function does. So I go to check the source code.

Looks like the basic idea is generating some centroids for different clusters by the number of classes, with some randomness. How closely each cluster centroids can be controlled by setting the value class_sep. Since data will be generated with gaussian distribution with variance 1, this argument controls how much each cluster overlaps with other clusters.

# Build the polytope whose vertices become cluster centroids
    centroids = _generate_hypercube(n_clusters, n_informative, generator).astype(float)
    centroids *= 2 * class_sep  # determine how centroids are scaled
    centroids -= class_sep
    if not hypercube:
        centroids *= generator.rand(n_clusters, 1)
        centroids *= generator.rand(1, n_informative)

Then it generates data in normal distribution (Gaussian distribution) and informative features, and split data into n_classes clusters by shifting part of data to each cluster, and add correlations on the informative features by multiplying feature matrix with a randomly generated covariance matrix (looks like it’s correlation instead of covariance):

# Initially draw informative features from the standard normal
    X[:, :n_informative] = generator.randn(n_samples, n_informative)

    # Create each cluster; a variant of make_blobs
    stop = 0
    for k, centroid in enumerate(centroids):
        start, stop = stop, stop + n_samples_per_cluster[k]
        y[start:stop] = k % n_classes  # assign labels
        X_k = X[start:stop, :n_informative]  # slice a view of the cluster

        A = 2 * generator.rand(n_informative, n_informative) - 1
        X_k[...] = np.dot(X_k, A)  # introduce random covariance

        X_k += centroid  # shift the cluster to a vertex

The function then add redundant features with some correlation, without breaking created informative features

# Create redundant features
    if n_redundant > 0:
        B = 2 * generator.rand(n_informative, n_redundant) - 1
        X[:, n_informative:n_informative + n_redundant] = \
            np.dot(X[:, :n_informative], B)

and some other characteristics to the dataset

  • useless features
  • repeated features
  • weights
  • etc.

One thing I’m particularly interested is the flip_y.

# Randomly replace labels
    if flip_y >= 0.0:
        flip_mask = generator.rand(n_samples) < flip_y
        y[flip_mask] = generator.randint(n_classes, size=flip_mask.sum())

For each data point, the function changes its target to some random class (actually it can be changed to true class) with rate flip_y, a float between 0 and 1. It makes some of the data has wrong target, or introduces some noise to the dataset.

There are two things to control the noise (or, how different classes of data are overlapped)

  • class_sep: determine how clusters are separated. Large value will make datasets overlap less.
  • flip_y: determine how many data points are randomly labelled (noise)

Let’s now look at the output of the dataset

x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1)

The 3-class dataset looks not too bad. This one I set flip_y to be the defaulted value, 0.01.

Let’s create a worse dataset by setting flip_y to be 0.3.

x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1, flip_y=0.3)

How about increasing class_sep?

x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1, flip_y=0.3, class_sep=5)

Data points are separated and clearly splitted as three clusters, but they are still overlapped due to high value of flip_y. This generates a completely different types of errors: previously the datasets are just hard to be splitted, and now the data are easy to be splitted but in each clusters some of the data are from the other classes (noise…)

So this function can help people to quickly generate an artificial dataset with some properties they want. I’m not sure how closely these datasets will be versus real datasets, but it’s useful when people want to do some experiments on the properties of predictive models.

One Reply to “Generate experimental classification data”

  1. YOU NEED POTENTIAL CUSTOMERS THAT BUY FROM YOU ?

    I’m talking about a better promotion method than all that exists on the market right now, even better than email marketing.
    Just like you received this message from me, this is exactly how you can promote your business or product.

    Do you want more details or do you want to receive a TEST ?
    CHECK HERE=> https://bit.ly/Good_Promotion

Leave a Reply

Your email address will not be published. Required fields are marked *