A library that uses Hadoop to analyze a NASA dataset of meteorite landings.
The dataset can be found at this link:
https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh
This project is to learn some cool things about meteorites, and to use Hadoop. Ultimately, this dataset is small enough that using a python stack would likely be quicker to develop and run, but Hadoop is a cool tool and I wanted to work with it.
This project uses maven to manage its dependencies, and the build is geared towards a computer that already has Hadoop installed and running. Assuming that you have maven, run mvn package from the root directory. Then, different tasks can be run as follows:
hadoop jar target/meteoriteanalysis-1.0-SNAPSHOT-job.jar com.nachtm.meteoriteanalysis.MAINCLASS data/meteorite_landings.csv output
Right now, there are two jobs that can be run by replacing MAINCLASS with:
-
com.nachtm.meteoriteanalysis.MaxMassoutputs1 xwherexis the mass, in grams, of the largest meteorite. -
com.nachtm.meteoriteanalysis.NumByAreaoutputs information about the density of meteorites. More specifically, each row defines a 5 degree by 5 degree box, and a count of the number of meteorites which have landed in that box. The format islat long countwherelatis the latitude of the center of the box,longis the longitude of the center of the box, and count is the number of meteorites which landed within the box.
Output is written to the output directory (or whatever filepath you give as the last argument to the run command). Assuming a successful run, the file that actually contains the output will be in output/part-r-0000. If there is enough output that multiple files are needed, look for output/part-r-0001 and so on.