TODO: implement the srht using ffht library.
Examples of random projection sketching methods to reduce the computational burden of intensive matrix computations.
Data:
All datasets can be downloaded in the required format from https://drive.google.com/open?id=1w-EwPNmi-qiddui1RfSBhbrylL_8LlgN
Note that these must be placed in the data/ directory e.g. data/YearPredictionMSD.npy
Packages
Most of the dependencies should come with Anaconda distributions but a
couple more might be needed for the optimisation and the fast sketching.
All but the last one can be installed via pip install.
- Standard:
numpy,scipy,pandas,sklearn,matplotlib,json - Miscellaneous:
numba,timeit,pprint,cvxopt - External: FastWHT (see Installation of fastwht repo below)
Experiment exp is located in experiments/ and the corresponding output is found in
output/exp.
Note that there will be intermediate directories in the above substitution.
baselines/
metadata.py-- computes the basic metadata for the real world datasets used.summary_performance.py- evaluates speed and error performance of the summary methods
ihs_baselines/
error_vs_dimensionality.py,error_vs_num_iters.py,error_vs_row_dim.py- Mostly reproduce the IHS synthetic experimentssjlt_error_sparsity.py- compares the IHS with sparse embeddings at different sparsity settings
3 ihs_timing/
ihs_lasso_real.py,ihs_lasso_synthetic.pyevaluates the real time performance of various sketching techniques in real/synthetic examples, respectively.
To run these experiments: 0. Install external dependency for Fast Hadamard Transform (see below)
- Ensure that the necessary datasets are downloaded. The UCI ones have the url hardcoded,
however you will need to follow the urls in the
all_datasetsdictionary inget_datasets.pyto download the libsvm files and the suite-sparse datasets. These must be saved in the same directory asget_datasets.py. The script will automatically download the UCI and OPENML datasets. Open-ml datasets will be downloaded automatically by running the script. This file must be compiled and run from the directory (I don't know why)
- Running the profile script it is clear that the bottleneck is repeated
conversion of the
ndarraydata type to thecoo_matrix. Two things can be done: (i) Either allow the functions to better accept sparse matrix as input (ii) Convert the data and add sparse references (row,coletc) If better handling of sparse vs dense data can be done in therp.__init__method then this would allow for random number generation within therp.sketchmethod which would be better for repeated calls in IHS. - Refactor and test the solvers for IHS versions of OLS,RIDGE & LASSO
- Start refactoring the subspace embedding experiments
- Generate data metadata scripts and plots
git clone https://bitbucket.org/vegarant/fastwht.git--> then run install in here bycd python,python setup.py,python test.py- Get the directory path for
fastwht/pythonwhich should beyour_path_name = */sketching_optimisation/lib/fastwht/python - Find
.bash_profileor equivalent and addexport PYTHONPATH=$PYTHONPATH:your_path_name, at the final line of thebash_profile, finally save thensource .bash_profile. - Open ipython, do
import sys --> sys.pathand check thatyour_path_nameis displayed. - Go back to the
sketching_optimisationdirectory and run the tests.
- Doing this with more general functions akin to Newton Sketch
- Bias-variance tradeoff with this estimation method?
- Combine this with SGD?
- Rank-deficient of weak embeddings? Is there a range of results here?
- SVM or SVR?