Published on 01 January 2018 |

Version 4.0

The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

View Dataset
Tschandl, Philipp

Description

<p>Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available dataset of dermatoscopic images. We tackle this problem by releasing the HAM10000 ("Human Against Machine with 10000 training images") dataset. We collected dermatoscopic images from different populations, acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic images which can serve as a training set for academic machine learning purposes. Cases include a representative collection of all important diagnostic categories in the realm of pigmented lesions: Actinic keratoses and intraepithelial carcinoma / Bowen's disease (<code>akiec</code>), basal cell carcinoma (<code>bcc</code>), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, <code>bkl</code>), dermatofibroma (<code>df</code>), melanoma (<code>mel</code>), melanocytic nevi (<code>nv</code>) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, <code>vasc</code>).</p><p>More than 50% of lesions are confirmed through histopathology (<code>histo</code>), the ground truth for the rest of the cases is either follow-up examination (<code>follow_up</code>), expert consensus (<code>consensus</code>), or confirmation by in-vivo confocal microscopy (<code>confocal</code>). The dataset includes lesions with multiple images, which can be tracked by the <code>lesion_id</code>-column within the <strong>HAM10000_metadata</strong> file.</p><p>Due to upload size limitations, images are stored in two files:<ul><li><strong>HAM10000_images_part1.zip</strong> (5000 JPEG files)</li><li><strong>HAM10000_images_part2.zip</strong> (5015 JPEG files)</li></ul></p><p><h3>Additional data for evaluation purposes</h3>The HAM10000 dataset served as the training set for the <a href='http://arxiv.org/abs/1902.03368'>ISIC 2018 challenge (Task 3)</a>, with the same sources contributing the majority of the validation- and test-set as well. The test-set images are available herein as <strong>ISIC2018_Task3_Test_Images.zip (1511 images)</strong>, the ground-truth in the same format as the HAM10000 data (public since 2023) is available as <strong>ISIC2018_Task3_Test_GroundTruth.csv</strong>.. The ISIC-Archive also provides the challenge images and metadata (training, validation, test) at their <a href="https://challenge.isic-archive.com/data/#2018">"ISIC Challenge Datasets" page</a>.</p><p><h3>Comparison to physicians</h3>Test-set evaluations of the ISIC 2018 challenge were compared to physicians on an international scale, where the majority of challenge participants outperformed expert readers: <a href="https://doi.org/10.1016/S1470-2045(19)30333-X">Tschandl P. et al., Lancet Oncol 2019</a></p><p><h3>Human-computer collaboration</h3>The test-set images were also used in a study comparing different methods and scenarios of human-computer collaboration: <a href="https://www.nature.com/articles/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a><br>Following corresponding metadata is available herein:<ul><li><strong>ISIC2018_Task3_Test_NatureMedicine_AI_Interaction_Benefit.csv</strong>: Human ratings for Test images with and without interaction with a ResNet34 CNN (Malignancy Probability, Multi-Class probability, CBIR) or Human-Crowd Multi-Class probabilities. This is data was collected for and analyzed in <a href="https://doi.org/10.1038/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a>, therefore please refer to this publication when using the data. Some details on the abbreviated column headings:<ul><li><em>image_id:</em> This is the ISIC image_id of an image at the time of the study. There should be no duplications in the combination image_id & interaction_modality. As not every image was shown with every interaction modality, not every combination is present.</li><li><em>prob_m_dx_akiec, ... :</em> m is "machine probabilities". Values are values after softmax, and "_mal" is all malignant classes summed.</li><li><em>prob_h_dx_akiec, ... :</em> h is "human probabilities". Values are aggregated percentages of human ratings from past studies distinguishing between seven classes. Note there is no "prob_h_mal" as this was none of the tested interaction modalities.</li><li><em>user_dx_without_interaction_akiec, ...:</em> Number of participants choosing this diagnosis without interaction.</li><li><em>user_dx_with_interaction_akiec, ...:</em> Number of participants choosing this diagnosis with interaction.</li></ul></li><li><strong>HAM10000_segmentations_lesion_tschandl.zip</strong>: To evaluate regions of CNN activations in <a href="https://www.nature.com/articles/s41591-020-0942-0">Tschandl P. et al., Nature Medicine 2020</a> (please refer to this publication when using the data), a single dermatologist (Tschandl P) created binary segmentation masks for all 10015 images from the HAM10000 dataset. Masks were initialized with the segmentation network as described by <a href="https://doi.org/10.1016/j.compbiomed.2018.11.010">Tschandl et al., Computers in Biology and Medicine 2019</a>, and following verified, corrected or replaced via the free-hand selection tool in <a href="https://fiji.sc/">FIJI</a>.</li></ul></p>

Citations (117)

Mentions (827)

Metrics

Dataset Index

506.6

FAIR Score

15%

Citations

117

Mentions

827

Metrics Over Time

Publication Details

Publisher

Harvard Dataverse

Assigned Domain

Topic Name

Cutaneous Melanoma Detection and Management

Subfield

Oncology

Field

Medicine

Domain

Health Sciences

Keywords

Medicine, Health and Life SciencesComputer and Information ScienceDermatoscopy

Normalization Factors

FT

13.46

CTw

1.00

MTw

1.00