feature_selection package¶

Submodules¶

feature_selection.forward_selection module¶

feature_selection.forward_selection.forward_selection(scorer, X, y, min_features=1, max_features=10)¶

The Forward Selection is an algorithm used to select features. It starts as an empty model, and add the variable with the best improvement in the model. The process is iteratively repeated and it stops when the remaining variables doesn’t improve the accuracy of the model.

Parameters:	scorer (function) – A custom user-supplied function that accepts X and y (as defined below) as input and returns the index of the column with the lowest weight. X (array-like of shape) – training dataset y (array-like of shape) – test dataset min_features (int (default=None)) – number of minimum features to select max_features (int (default=10)) – number of maximum features to select
Returns:	Numeric array of selected features
Return type:	numpy ndarray

Examples

>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.datasets import make_friedman1
>>> data, target = make_friedman1(n_samples=200, n_features=15,
>>> random_state=0)
>>> from feature_selection import forward_selection
>>>
>>> def my_scorer_fn2(X, y):
>>>      lm = LinearRegression().fit(X, y)
>>>      return 1 - lm.score(X, y)
>>>
>>> forward_selection(my_scorer_fn, data, target, 2, 7)
[3, 1, 0, 4]

feature_selection.recursive_feature_elimination module¶

feature_selection.recursive_feature_elimination.recursive_feature_elimination(scorer, X, y, n_features_to_select=None)¶

Feature selector that implements recursive feature elimination

Implements a greedy algorithm that iteratively fits and scores a scikit-learn classifier and eliminates features using a score-based metric.

Parameters:	scorer (function) – A custom user-supplied function that accepts X and y (as defined below) as input and returns the index of the column with the lowest weight. X (array-like of shape (n_samples, n_features)) – Training samples y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X, used for training n_features_to_select (int or None (default=None)) – The number of features to be selected. If None, half the number of features are selected.
Returns:	List of column names or indices of non-eliminated features.
Return type:	array of shape [n_features_to_select]

Examples

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.linear_model import LinearRegression
>>> from feature_selection import recursive_feature_elimination
>>>
>>> def scorer(X, y):
>>>     model = LinearRegression()
>>>     model.fit(X, y)
>>>     return X.columns[model.coef_.argmin()]
>>>
>>> X, y = make_friedman1(n_samples=200, n_features=10, random_state=10)
>>> result = recursive_feature_elimination(scorer, X, y,
>>>                                        n_features_to_select=5)
array([0, 1, 3, 4, 9])

feature_selection.simulated_annealing module¶

feature_selection.simulated_annealing.simulated_annealing(scorer, X, y, c=1, iterations=100, bools=False, random_state=None)¶

Feature selector that performs simmulated annealing to select features.

Algorithm randomly chooses a set of features, trains on them, scores the model. Then the algorithm slightly modifies the chosen features randomly and tests to see if the model improves. If there is improvement, the newer model is kept, if not the algorithm tests to see if the worse model is still kept based on a acceptance probability that decreases as iterations continue and if the model performs worse.

Parameters:	scorer (function) – A custom user-supplied function that accepts X and y (as defined below) as input and returns the error of the datasets. X (np.array) – Feature training dataset y (np.array) – Target training dataset c (int (default=1)) – Control rate of feature perturbation iterations (int (default=100)) – Number of iterations bools (bool (default=False)) – If true function returns array of boolean values instead of column indicies random_state (int (default=None)) – Seed for random number generators
Returns:	Array of selected features indicies
Return type:	numpy.array

Examples

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.linear_model import LinearRegression
>>> from feature_selection import simulated_annealing
>>>
>>> def scorer(X, y):
>>>     model = LinearRegression()
>>>     model.fit(X, y)
>>>     return 1-model.score(X, y)
>>>
>>> X, y = make_friedman1(n_samples=200, n_features=10, random_state=10)
>>> simulated_annealing(scorer, X, y)
array([ 0,  1,  3,  4,  5,  6,  7,  9, 10])

feature_selection.variance_thresholding module¶

feature_selection.variance_thresholding.variance_thresholding(data, threshold=0)¶

Select features above a certain threshold of variance

Parameters:	data (numpy ndarray, pandas DataFrame, list) – A numpy array, a pandas DataFrame or list to select features from threshold (float, optional) – A variance threshold to filter features for
Returns:	A 1d array of indexes of the features that pass the threshold or are not numerical
Return type:	numpy ndarray

Examples

>>> from feature_selection import variance_thresholding
>>> X = [[1, 6, 0, 5], [1, 2, 4, 5], [1, 7, 8, 5]]
>>> variance_thresholding(X)
array([1, 2])

feature_selection package¶

Submodules¶

feature_selection.forward_selection module¶

feature_selection.recursive_feature_elimination module¶

feature_selection.simulated_annealing module¶

feature_selection.variance_thresholding module¶

Module contents¶

feature_selection

Navigation

Related Topics