The name is a play on the Million Song Dataset, which includes metadata and features for 1,000,000 music recordings. They were scraped from publicly-available sources on the internet, and then de-duped according to their MD5 checksum.Ī lakh is a unit of measure used in the Indian number system which signifies 100,000 (or, in the Indian convention, 1,00,000).ĭepending on how you count, the Lakh MIDI Dataset includes about 100,000 MIDI files.
MILLION SONG DATASET CSV CODE
The remaining entries were compared using standard (and computationally expensive) dynamic time warping-based MIDI-to-audio alignment.įor a thorough discussion, please see chapters 4-7 of my thesis.Īnd, of course, all of the code used in this project is available here.
MILLION SONG DATASET CSV SERIES
In short, I developed series of efficient learning-based methods to discard the vast majority of possible matches the Million Song Dataset. How were the matched and aligned datasets assembled? However, the DTW-based alignment scheme is intentionally somewhat invariant to differences in instrumentation.Īs a result, songs which are harmonically similar may be matched incorrectly.Īs a concrete example, it's not uncommon for transcriptions of house music to be erroneously matched to dozens of house remixes. This tutorial addresses these questions in detail.Ī MIDI-audio pair was considered a valid match based on the confidence score reported by dynamic time warping-based alignment, which turns out to be extremely reliable.įor more discussion and concrete details, see section 4.5 of my thesis. Flickr Faces Face Images with Marked Landmark Points: This free image dataset for facial recognition contains 7049 images with up to 15 keypoints marking each of them. This gets at two questions: How reliable are the annotations in MIDI files, and how accurately was the MIDI file aligned to the audio recording? CelebA Dataset: This dataset from MMLAB was developed for non-commercial research purposes.It contains 200,000+ celebrity images. It was found that for instrumentation classification, UBM adaptation used in conjunction with the MSD outperformed other traditional machine learning classifiers. How reliable are MIDI-derived annotations? Specifically, the novel use of universal background model (UBM) adaptation and the benefits of incorporating the Million Song Dataset (MSD) are examined. In some cases, MIDI files include key signature annotations and lyrics, among other useful things.įor a discussion of the presence of these different information sources in files in the Lakh MIDI Dataset, see this tutorial. In a simplistic view, a MIDI file can be considered a score with additional optional annotations.Īs a result, you can count on getting a transcription of the song, as well as meter information such as beats and downbeats. Overview of MIDI-to-audio alignment methods and the technique utilized in the Lakh MIDI Datasetįrequently asked questions What kind of information can I get from MIDI files?.Measuring the reliability of MIDI-derived annotations.Overview of sources of information available in MIDI files.
MILLION SONG DATASET CSV HOW TO
How to utilize the Lakh MIDI Dataset, with examples.The blog post 'The Small Files Problem' from Cloudera may shed some light. My approach will be to aggregate data somehow before importing into HDFS.
To facilitate use of this dataset, here are a few IPython notebook tutorials: The million song dataset files don't have more than 1MB. In Proceedings of the 12th International Society for Music Information Retrieval Conference, pages 591–596, 2011.