杰瑞科技汇

Python fetch_mldata如何获取数据集?

Of course! The term "fetch mldata" refers to the classic way of loading datasets from the mldata.org repository directly into Python. This repository was the original home for many popular machine learning datasets.

Python fetch_mldata如何获取数据集?-图1
(图片来源网络,侵删)

However, it's crucial to know that mldata.org has been offline for several years. Therefore, the old methods that relied on it no longer work.

This guide will show you:

  1. The old, now-deprecated way (so you can understand legacy code).
  2. The modern, correct ways to fetch the same datasets using popular libraries like scikit-learn, TensorFlow, and TensorFlow Datasets.

The Old (Deprecated) Way: sklearn.datasets.mldata

This is the direct answer to your query, but it will fail if you try to run it. It's here for educational purposes.

The fetch_mldata function was part of scikit-learn. It worked by downloading a .mat (MATLAB) file from mldata.org and caching it locally.

Python fetch_mldata如何获取数据集?-图2
(图片来源网络,侵删)
# THIS CODE WILL FAIL
# mldata.org has been offline since 2025.
from sklearn.datasets import fetch_mldata
try:
    # This will raise an error because the server is down.
    mnist = fetch_mldata('MNIST original')
    print("Dataset fetched successfully!")
    print("Keys:", mnist.keys())
    print("Data shape:", mnist.data.shape)
    print("Target shape:", mnist.target.shape)
except Exception as e:
    print(f"Error: {e}")
    print("The mldata.org repository is no longer available.")

Why it fails: The server https://mldata.org is unreachable. You will get a connection error or a similar HTTPError.


The Modern Ways to Fetch Data

The community has moved on to more reliable and efficient sources. Here’s how to get the most famous datasets, like MNIST, using the current best practices.

Method 1: Using Scikit-Learn (Recommended for most ML tasks)

Scikit-learn has built-in, easy-to-use fetchers for many classic datasets. This is the simplest and most common approach.

Example: Fetching the MNIST Digits Dataset

from sklearn.datasets import fetch_openml
# The 'mnist_784' dataset is the same as the 'MNIST original' from mldata.org
# It contains 70,000 images, each 784 pixels (28x28)
mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
# The data is in the 'data' key, and the labels are in the 'target' key
X, y = mnist["data"], mnist["target"]
print("Shape of the feature data (X):", X.shape)
print("Shape of the target labels (y):", y.shape)
# The labels are strings by default, let's convert them to integers
y = y.astype(int)
# Let's look at the first digit
import matplotlib.pyplot as plt
import matplotlib
some_digit_image = X[0].reshape(28, 28)
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")f"Label: {y[0]}")
plt.show()

Key points:

  • fetch_openml: The modern replacement for fetch_mldata.
  • as_frame=False: Returns a NumPy array (like the old method). Set to True to get a Pandas DataFrame.
  • parser='auto': Handles the parsing of the data file automatically.
  • The data is already split into a training set (the first 60,000 images) and a test set (the last 10,000 images).
    X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Method 2: Using TensorFlow Keras (Excellent for Deep Learning)

TensorFlow's Keras API provides very convenient access to image and text datasets, often with built-in preprocessing and splitting.

Example: Fetching MNIST with Keras

import tensorflow as tf
# The datasets are stored in a dictionary-like object
mnist_keras = tf.keras.datasets.mnist
# load_data() returns two tuples: (train_data, train_labels), (test_data, test_labels)
(X_train, y_train), (X_test, y_test) = mnist_keras.load_data()
print("Shape of X_train:", X_train.shape) # (60000, 28, 28)
print("Shape of y_train:", y_train.shape) # (60000,)
print("Shape of X_test:", X_test.shape)   # (10000, 28, 28)
# The data comes as 28x28 pixel images. We can flatten it if needed.
# X_train_flat = X_train.reshape((60000, 28 * 28))
# Let's look at the first digit
import matplotlib.pyplot as plt
plt.imshow(X_train[0], cmap=plt.cm.binary)f"Label: {y_train[0]}")
plt.axis("off")
plt.show()

Key points:

  • tf.keras.datasets.mnist: The module containing the dataset.
  • load_data(): A simple function that immediately returns the train/test split.
  • The data comes in image format (num_samples, height, width), which is often preferred for Convolutional Neural Networks (CNNs).

Method 3: Using TensorFlow Datasets (TFDS) (For Research & Large-Scale Training)

TFDS is a high-performance library for loading and preparing datasets with a huge catalog of options. It's great for research and training models on multiple TPU/GPU cores.

Example: Fetching MNIST with TFDS

import tensorflow_datasets as tfds
# tfds.load() downloads the data, prepares it, and returns a tf.data.Dataset
# It can also split the data for you
(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,  # returns (img, label) tuples
    with_info=True,
)
# ds_info contains metadata about the dataset
print("Dataset Info:")
print(ds_info)
# Let's look at one example from the training set
example = next(iter(ds_train.take(1)))
image, label = example
print("\nShape of an image tensor:", image.shape)
print("Label:", label.numpy())
# The image is a tensor, we can convert it to a NumPy array for plotting
import matplotlib.pyplot as plt
plt.imshow(image.numpy()[:, :, 0].astype("uint8"), cmap=plt.cm.binary)f"Label: {label.numpy()}")
plt.axis("off")
plt.show()
# You can easily create batches for training
batch_size = 128
train_ds = ds_train.cache().batch(batch_size).prefetch(tf.data.AUTOTUNE)
test_ds = ds_test.cache().batch(batch_size).prefetch(tf.data.AUTOTUNE)

Key points:

  • tfds.load(): The main function for loading datasets.
  • as_supervised=True: Returns a simple (features, label) structure, which is easy to work with.
  • with_info=True: Returns a DatasetInfo object with metadata.
  • Performance: The .cache(), .batch(), and .prefetch() methods create a highly optimized data pipeline that prevents the CPU from becoming a bottleneck during model training.

Summary: Which Method Should You Use?

Library Best For Pros Cons
Scikit-Learn General Machine Learning, classic datasets, prototyping. Very simple API, integrates perfectly with scikit-learn's tools, returns NumPy/Pandas. Not as optimized for deep learning or large-scale data loading.
TensorFlow Keras Deep Learning, especially image/text tasks. Extremely simple, data is pre-processed and split, great integration with the TF ecosystem. Primarily designed for TensorFlow models.
TensorFlow Datasets Research, large-scale training, multi-device/TPU setups. Highest performance, huge catalog of datasets, powerful data pipeline tools (cache, prefetch). Slightly more complex API than Keras, requires understanding of tf.data.Dataset.

For most users starting out, start with Scikit-Learn's fetch_openml. If you are doing deep learning, use TensorFlow's Keras datasets. For advanced research, TFDS is the professional standard.

分享:
扫描分享到社交APP
上一篇
下一篇