一年的起起落落，我回头看了一眼，然后发现我依然是那样得傲慢。

我决定了，忘记什么是技术博客，忘记别的乱糟糟的念头。这里，只是我重新开始的记录点。我会慢悠悠将我自己的念头想法经过写一点下来，以便未来可以查阅。

而不是想着“我需要写一个技术博客”这种无聊的念头，不是吗？

那么就让一切重新开始走上正轨吧。

Skip to content
# Author: cathuan

## 重新开始

## Generate experimental classification data

## First Post

Just For Fun

一年的起起落落，我回头看了一眼，然后发现我依然是那样得傲慢。

我决定了，忘记什么是技术博客，忘记别的乱糟糟的念头。这里，只是我重新开始的记录点。我会慢悠悠将我自己的念头想法经过写一点下来，以便未来可以查阅。

而不是想着“我需要写一个技术博客”这种无聊的念头，不是吗？

那么就让一切重新开始走上正轨吧。

Today I noticed a function in `sklearn.datasets.make_classification`

, which allows users to generate fake experimental classification data. The document is here.

Looks like this function can generate all sorts of data in user’s needs. The general API has the form

```
sklearn.datasets.make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
```

In the document, it says

This initially creates clusters of points normally distributed (std=1) about vertices of an n_informative-dimensional hypercube with sides of length 2*class_sep and assigns an equal number of clusters to each class. It introduces interdependence between these features and adds various types of further noise to the data.

To be honest, it is not clear to me what this function does. So I go to check the source code.

Looks like the basic idea is generating some centroids for different clusters by the number of classes, with some randomness. How closely each cluster centroids can be controlled by setting the value `class_sep`

. Since data will be generated with gaussian distribution with variance , this argument controls how much each cluster overlaps with other clusters.

```
# Build the polytope whose vertices become cluster centroids
centroids = _generate_hypercube(n_clusters, n_informative, generator).astype(float)
centroids *= 2 * class_sep # determine how centroids are scaled
centroids -= class_sep
if not hypercube:
centroids *= generator.rand(n_clusters, 1)
centroids *= generator.rand(1, n_informative)
```

Then it generates data in normal distribution (Gaussian distribution) and informative features, and split data into n_classes clusters by shifting part of data to each cluster, and add correlations on the informative features by multiplying feature matrix with a randomly generated covariance matrix (looks like it’s correlation instead of covariance):

```
# Initially draw informative features from the standard normal
X[:, :n_informative] = generator.randn(n_samples, n_informative)
# Create each cluster; a variant of make_blobs
stop = 0
for k, centroid in enumerate(centroids):
start, stop = stop, stop + n_samples_per_cluster[k]
y[start:stop] = k % n_classes # assign labels
X_k = X[start:stop, :n_informative] # slice a view of the cluster
A = 2 * generator.rand(n_informative, n_informative) - 1
X_k[...] = np.dot(X_k, A) # introduce random covariance
X_k += centroid # shift the cluster to a vertex
```

The function then add redundant features with some correlation, without breaking created informative features

```
# Create redundant features
if n_redundant > 0:
B = 2 * generator.rand(n_informative, n_redundant) - 1
X[:, n_informative:n_informative + n_redundant] = \
np.dot(X[:, :n_informative], B)
```

and some other characteristics to the dataset

- useless features
- repeated features
- weights
- etc.

One thing I’m particularly interested is the `flip_y`

.

```
# Randomly replace labels
if flip_y >= 0.0:
flip_mask = generator.rand(n_samples) < flip_y
y[flip_mask] = generator.randint(n_classes, size=flip_mask.sum())
```

For each data point, the function changes its target to some random class (actually it can be changed to true class) with rate `flip_y`

, a float between 0 and 1. It makes some of the data has *wrong* target, or introduces some noise to the dataset.

There are two things to control the noise (or, how different classes of data are overlapped)

`class_sep`

: determine how clusters are separated. Large value will make datasets overlap less.`flip_y`

: determine how many data points are randomly labelled (noise)

Let’s now look at the output of the dataset

```
x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1)
```

The 3-class dataset looks not too bad. This one I set `flip_y`

to be the defaulted value, .

Let’s create a worse dataset by setting `flip_y`

to be .

```
x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1, flip_y=0.3)
```

How about increasing `class_sep`

?

```
x, y = make_classification(n_features=2, n_redundant=0, n_samples=1000, n_classes=3, n_clusters_per_class=1, flip_y=0.3, class_sep=5)
```

Data points are separated and clearly splitted as three clusters, but they are still overlapped due to high value of `flip_y`

. This generates a completely different types of errors: previously the datasets are just hard to be splitted, and now the data are easy to be splitted but in each clusters some of the data are from the other classes (noise…)

So this function can help people to quickly generate an artificial dataset with some properties they want. I’m not sure how closely these datasets will be versus real datasets, but it’s useful when people want to do some experiments on the properties of predictive models.

This is the first blog I’ve ever posted. There are two reasons why I want to start to write blogs:

- Practice my English
- Try to write down my thoughts and what I learnt in Maths/Programming/Finance

Hopefully I can achieve at least one of the goals.