Published on 01 January 2018 |
<p>Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (<code>akiec</code>), basal cell carcinoma (<code>bcc</code>), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, <code>bkl</code>), dermatofibroma (<code>df</code>), melanoma (<code>mel</code>), melanocytic nevi (<code>nv</code>) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, <code>vasc</code>).</p><p>More than 50% of lesions are confirmed through histopathology (<code>histo</code>), the ground truth for the rest of the cases is either follow-up examination (<code>follow_up</code>), expert consensus (<code>consensus</code>), or confirmation by in-vivo confocal microscopy (<code>confocal</code>). The dataset includes lesions with multiple images, which can be tracked by the <code>lesion_id</code>-column within the <strong>HAM10000_metadata</strong> file.</p><p>Due to upload size limitations, images are stored in two files:<ul><li><strong>HAM10000_images_part1.zip</strong> (5000 JPEG files)</li><li><strong>HAM10000_images_part2.zip</strong> (5015 JPEG files)</li></ul></p><p><h3>Additional data for evaluation purposes</h3>The HAM10000 dataset served as the training set for the <a href='http://arxiv.org/abs/1902.03368'>ISIC 2018 challenge (Task 3)</a>, with the same sources contributing the majority of the validation- and test-set as well. The test-set images are available herein as <strong>ISIC2018_Task3_Test_Images.zip (1511 images)</strong>, the ground-truth in the same format as the HAM10000 data (public since 2023) is available as <strong>ISIC2018_Task3_Test_GroundTruth.csv</strong>.. The ISIC-Archive also provides the challenge images and metadata (training, validation, test) at their <a href="https://challenge.isic-archive.com/data/#2018">"ISIC Challenge Datasets" page</a>.</p><p><h3>Comparison to physicians</h3>Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: <a href="https://doi.org/10.1016/S1470-2045(19)30333-X">Tschandl P. et al., Lancet Oncol 2019</a></p><p><h3>Human-computer collaboration</h3>The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: <a href="https://www.nature.com/articles/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a><br>Following corresponding metadata is available herein:<ul><li><strong>ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv</strong>: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in <a href="https://doi.org/10.1038/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a>, therefore please refer to this publication when using the data. Some details on the abbreviated column headings:<ul><li><em>image_id:</em> This is the ISIC image_id of an image at the time of the study. There should be no duplications in the combination image_id & interaction_modality. As not every image was shown with every interaction modality, not every combination is present.</li><li><em>prob_m_dx_akiec, ... :</em> m is "machine probabilities". Values are values after softmax, and "_mal" is all malignant classes summed.</li><li><em>prob_h_dx_akiec, ... :</em> h is "human probabilities". Values are aggregated percentages of human ratings from past studies distinguishing between seven classes. Note there is no "prob_h_mal" as this was none of the tested interaction modalities.</li><li><em>user_dx_without_interaction_akiec, ...:</em> Number of participants choosing this diagnosis without interaction.</li><li><em>user_dx_with_interaction_akiec, ...:</em> Number of participants choosing this diagnosis with interaction.</li></ul></li><li><strong>HAM10000_segmentations_lesion_tschandl.zip</strong>: To evaluate regions of CNN activations in <a href="https://www.nature.com/articles/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a> (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by <a href="https://doi.org/10.1016/j.compbiomed.2018.11.010">Tschandl et al., Computers in Biology and Medicine 2019</a>, and following verified, corrected or replaced via the free-hand selection tool in <a href="https://fiji.sc/">FIJI</a>.</li></ul></p>
Cited on 01 January 2026
Weight: 1.00
Cited on 28 November 2025
Weight: 1.69
Cited on 29 October 2025
Weight: 1.69
Cited on 19 October 2025
Weight: 1.69
Cited on 29 September 2025
Weight: 1.69
Cited on 29 September 2025
Weight: 1.69
Cited on 08 September 2025
Weight: 1.69
Cited on 18 August 2025
Weight: 1.69
Cited on 25 June 2025
Weight: 1.69
Cited on 09 June 2025
Weight: 1.69
Mentioned on 07 October 2025
Weight: 1.69
Mentioned on 07 October 2025
Weight: 1.69
Mentioned on 07 October 2025
Weight: 1.69
Mentioned on 24 September 2025
Weight: 1.69
Mentioned on 15 September 2025
Weight: 1.69
Mentioned on 12 September 2025
Weight: 1.69
Mentioned on 08 September 2025
Weight: 1.69
Mentioned on 06 September 2025
Weight: 1.69
Mentioned on 02 September 2025
Weight: 1.69
Mentioned on 01 September 2025
Weight: 1.69
Dataset Index
FAIR Score
Citations
Mentions
Publisher
Harvard Dataverse
Topic Name
Cutaneous Melanoma Detection and Management
Subfield
Oncology
Field
Medicine
Domain
Health Sciences
FT
CTw
MTw