feature_selection package¶
Submodules¶
feature_selection.forward_selection module¶
-
feature_selection.forward_selection.forward_selection(scorer, X, y, min_features=1, max_features=10)¶ The Forward Selection is an algorithm used to select features. It starts as an empty model, and add the variable with the best improvement in the model. The process is iteratively repeated and it stops when the remaining variables doesn’t improve the accuracy of the model.
Parameters: - scorer (function) – A custom user-supplied function that accepts X and y (as defined below) as input and returns the index of the column with the lowest weight.
- X (array-like of shape) – training dataset
- y (array-like of shape) – test dataset
- min_features (int (default=None)) – number of minimum features to select
- max_features (int (default=10)) – number of maximum features to select
Returns: Numeric array of selected features
Return type: numpy ndarray
Examples
>>> from sklearn.linear_model import LinearRegression >>> from sklearn.datasets import make_friedman1 >>> data, target = make_friedman1(n_samples=200, n_features=15, >>> random_state=0) >>> from feature_selection import forward_selection >>> >>> def my_scorer_fn2(X, y): >>> lm = LinearRegression().fit(X, y) >>> return 1 - lm.score(X, y) >>> >>> forward_selection(my_scorer_fn, data, target, 2, 7) [3, 1, 0, 4]
feature_selection.recursive_feature_elimination module¶
-
feature_selection.recursive_feature_elimination.recursive_feature_elimination(scorer, X, y, n_features_to_select=None)¶ Feature selector that implements recursive feature elimination
Implements a greedy algorithm that iteratively fits and scores a scikit-learn classifier and eliminates features using a score-based metric.
Parameters: - scorer (function) – A custom user-supplied function that accepts X and y (as defined below) as input and returns the index of the column with the lowest weight.
- X (array-like of shape (n_samples, n_features)) – Training samples
- y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X, used for training
- n_features_to_select (int or None (default=None)) – The number of features to be selected. If None, half the number of features are selected.
Returns: List of column names or indices of non-eliminated features.
Return type: array of shape [n_features_to_select]
Examples
>>> from sklearn.datasets import make_friedman1 >>> from sklearn.linear_model import LinearRegression >>> from feature_selection import recursive_feature_elimination >>> >>> def scorer(X, y): >>> model = LinearRegression() >>> model.fit(X, y) >>> return X.columns[model.coef_.argmin()] >>> >>> X, y = make_friedman1(n_samples=200, n_features=10, random_state=10) >>> result = recursive_feature_elimination(scorer, X, y, >>> n_features_to_select=5) array([0, 1, 3, 4, 9])
feature_selection.simulated_annealing module¶
-
feature_selection.simulated_annealing.simulated_annealing(scorer, X, y, c=1, iterations=100, bools=False, random_state=None)¶ Feature selector that performs simmulated annealing to select features.
Algorithm randomly chooses a set of features, trains on them, scores the model. Then the algorithm slightly modifies the chosen features randomly and tests to see if the model improves. If there is improvement, the newer model is kept, if not the algorithm tests to see if the worse model is still kept based on a acceptance probability that decreases as iterations continue and if the model performs worse.
Parameters: - scorer (function) – A custom user-supplied function that accepts X and y (as defined below) as input and returns the error of the datasets.
- X (np.array) – Feature training dataset
- y (np.array) – Target training dataset
- c (int (default=1)) – Control rate of feature perturbation
- iterations (int (default=100)) – Number of iterations
- bools (bool (default=False)) – If true function returns array of boolean values instead of column indicies
- random_state (int (default=None)) – Seed for random number generators
Returns: Array of selected features indicies
Return type: numpy.array
Examples
>>> from sklearn.datasets import make_friedman1 >>> from sklearn.linear_model import LinearRegression >>> from feature_selection import simulated_annealing >>> >>> def scorer(X, y): >>> model = LinearRegression() >>> model.fit(X, y) >>> return 1-model.score(X, y) >>> >>> X, y = make_friedman1(n_samples=200, n_features=10, random_state=10) >>> simulated_annealing(scorer, X, y) array([ 0, 1, 3, 4, 5, 6, 7, 9, 10])
feature_selection.variance_thresholding module¶
-
feature_selection.variance_thresholding.variance_thresholding(data, threshold=0)¶ Select features above a certain threshold of variance
Parameters: - data (numpy ndarray, pandas DataFrame, list) – A numpy array, a pandas DataFrame or list to select features from
- threshold (float, optional) – A variance threshold to filter features for
Returns: A 1d array of indexes of the features that pass the threshold or are not numerical
Return type: numpy ndarray
Examples
>>> from feature_selection import variance_thresholding >>> X = [[1, 6, 0, 5], [1, 2, 4, 5], [1, 7, 8, 5]] >>> variance_thresholding(X) array([1, 2])