Python fetch_mldata如何获取数据集？-杰瑞科技汇

Of course! The term "fetch mldata" refers to the classic way of loading datasets from the mldata.org repository directly into Python. This repository was the original home for many popular machine learning datasets.

（图片来源网络，侵删）

However, it's crucial to know that mldata.org has been offline for several years. Therefore, the old methods that relied on it no longer work.

This guide will show you:

The old, now-deprecated way (so you can understand legacy code).
The modern, correct ways to fetch the same datasets using popular libraries like scikit-learn, TensorFlow, and TensorFlow Datasets.

The Old (Deprecated) Way: `sklearn.datasets.mldata`

This is the direct answer to your query, but it will fail if you try to run it. It's here for educational purposes.

The fetch_mldata function was part of scikit-learn. It worked by downloading a .mat (MATLAB) file from mldata.org and caching it locally.

（图片来源网络，侵删）

# THIS CODE WILL FAIL
# mldata.org has been offline since 2025.
from sklearn.datasets import fetch_mldata
try:
    # This will raise an error because the server is down.
    mnist = fetch_mldata('MNIST original')
    print("Dataset fetched successfully!")
    print("Keys:", mnist.keys())
    print("Data shape:", mnist.data.shape)
    print("Target shape:", mnist.target.shape)
except Exception as e:
    print(f"Error: {e}")
    print("The mldata.org repository is no longer available.")

Why it fails: The server https://mldata.org is unreachable. You will get a connection error or a similar HTTPError.

The Modern Ways to Fetch Data

The community has moved on to more reliable and efficient sources. Here’s how to get the most famous datasets, like MNIST, using the current best practices.

Method 1: Using Scikit-Learn (Recommended for most ML tasks)

Scikit-learn has built-in, easy-to-use fetchers for many classic datasets. This is the simplest and most common approach.

Example: Fetching the MNIST Digits Dataset

from sklearn.datasets import fetch_openml
# The 'mnist_784' dataset is the same as the 'MNIST original' from mldata.org
# It contains 70,000 images, each 784 pixels (28x28)
mnist = fetch_openml('mnist_784', version=1, as_frame=False, parser='auto')
# The data is in the 'data' key, and the labels are in the 'target' key
X, y = mnist["data"], mnist["target"]
print("Shape of the feature data (X):", X.shape)
print("Shape of the target labels (y):", y.shape)
# The labels are strings by default, let's convert them to integers
y = y.astype(int)
# Let's look at the first digit
import matplotlib.pyplot as plt
import matplotlib
some_digit_image = X[0].reshape(28, 28)
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")f"Label: {y[0]}")
plt.show()

Key points:

fetch_openml: The modern replacement for fetch_mldata.
as_frame=False: Returns a NumPy array (like the old method). Set to True to get a Pandas DataFrame.
parser='auto': Handles the parsing of the data file automatically.
The data is already split into a training set (the first 60,000 images) and a test set (the last 10,000 images).
```
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
```

Method 2: Using TensorFlow Keras (Excellent for Deep Learning)

TensorFlow's Keras API provides very convenient access to image and text datasets, often with built-in preprocessing and splitting.

Example: Fetching MNIST with Keras

import tensorflow as tf
# The datasets are stored in a dictionary-like object
mnist_keras = tf.keras.datasets.mnist
# load_data() returns two tuples: (train_data, train_labels), (test_data, test_labels)
(X_train, y_train), (X_test, y_test) = mnist_keras.load_data()
print("Shape of X_train:", X_train.shape) # (60000, 28, 28)
print("Shape of y_train:", y_train.shape) # (60000,)
print("Shape of X_test:", X_test.shape)   # (10000, 28, 28)
# The data comes as 28x28 pixel images. We can flatten it if needed.
# X_train_flat = X_train.reshape((60000, 28 * 28))
# Let's look at the first digit
import matplotlib.pyplot as plt
plt.imshow(X_train[0], cmap=plt.cm.binary)f"Label: {y_train[0]}")
plt.axis("off")
plt.show()

Key points:

tf.keras.datasets.mnist: The module containing the dataset.
load_data(): A simple function that immediately returns the train/test split.
The data comes in image format (num_samples, height, width), which is often preferred for Convolutional Neural Networks (CNNs).

Method 3: Using TensorFlow Datasets (TFDS) (For Research & Large-Scale Training)

TFDS is a high-performance library for loading and preparing datasets with a huge catalog of options. It's great for research and training models on multiple TPU/GPU cores.

Example: Fetching MNIST with TFDS

import tensorflow_datasets as tfds
# tfds.load() downloads the data, prepares it, and returns a tf.data.Dataset
# It can also split the data for you
(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,  # returns (img, label) tuples
    with_info=True,
)
# ds_info contains metadata about the dataset
print("Dataset Info:")
print(ds_info)
# Let's look at one example from the training set
example = next(iter(ds_train.take(1)))
image, label = example
print("\nShape of an image tensor:", image.shape)
print("Label:", label.numpy())
# The image is a tensor, we can convert it to a NumPy array for plotting
import matplotlib.pyplot as plt
plt.imshow(image.numpy()[:, :, 0].astype("uint8"), cmap=plt.cm.binary)f"Label: {label.numpy()}")
plt.axis("off")
plt.show()
# You can easily create batches for training
batch_size = 128
train_ds = ds_train.cache().batch(batch_size).prefetch(tf.data.AUTOTUNE)
test_ds = ds_test.cache().batch(batch_size).prefetch(tf.data.AUTOTUNE)

Key points:

tfds.load(): The main function for loading datasets.
as_supervised=True: Returns a simple (features, label) structure, which is easy to work with.
with_info=True: Returns a DatasetInfo object with metadata.
Performance: The .cache(), .batch(), and .prefetch() methods create a highly optimized data pipeline that prevents the CPU from becoming a bottleneck during model training.

Summary: Which Method Should You Use?

Library	Best For	Pros	Cons
Scikit-Learn	General Machine Learning, classic datasets, prototyping.	Very simple API, integrates perfectly with scikit-learn's tools, returns NumPy/Pandas.	Not as optimized for deep learning or large-scale data loading.
TensorFlow Keras	Deep Learning, especially image/text tasks.	Extremely simple, data is pre-processed and split, great integration with the TF ecosystem.	Primarily designed for TensorFlow models.
TensorFlow Datasets	Research, large-scale training, multi-device/TPU setups.	Highest performance, huge catalog of datasets, powerful data pipeline tools (`cache`, `prefetch`).	Slightly more complex API than Keras, requires understanding of `tf.data.Dataset`.

For most users starting out, start with Scikit-Learn's fetch_openml. If you are doing deep learning, use TensorFlow's Keras datasets. For advanced research, TFDS is the professional standard.

Python fetch_mldata如何获取数据集？

The Old (Deprecated) Way: `sklearn.datasets.mldata`

The Modern Ways to Fetch Data

Method 1: Using Scikit-Learn (Recommended for most ML tasks)

Example: Fetching the MNIST Digits Dataset

Method 2: Using TensorFlow Keras (Excellent for Deep Learning)

Example: Fetching MNIST with Keras

Method 3: Using TensorFlow Datasets (TFDS) (For Research & Large-Scale Training)

Example: Fetching MNIST with TFDS

Summary: Which Method Should You Use?

99ANYc3cd6

Python版本如何用Python判断？

photoshop视频教程入门

Win10 U盘安装教程，步骤详解是怎样的？

Java与DateTime互转，如何高效处理int值？

java properties遍历

androidkiller安装教程

Python UTC时间戳如何转换与使用？

夏宇闻Verilog教程，数字系统设计如何入门？

Dreamweaver 8.0教程视频如何快速入门？

Adobe Audition视频教程，从哪开始学最有效？

Premiere CC教程从哪学？新手入门该看哪些？

Java的Date与MySQL的Date如何正确转换？

Java异常捕获时，父类异常能否先于子类被catch？

Java字符串如何转成JSON对象数组？

Python dict为何报KeyError？

Ghost Win10系统安装教程详细步骤是怎样的？

Python fetch_mldata如何获取数据集？

The Old (Deprecated) Way: sklearn.datasets.mldata

The Modern Ways to Fetch Data

Method 1: Using Scikit-Learn (Recommended for most ML tasks)

Example: Fetching the MNIST Digits Dataset

Method 2: Using TensorFlow Keras (Excellent for Deep Learning)

Example: Fetching MNIST with Keras

Method 3: Using TensorFlow Datasets (TFDS) (For Research & Large-Scale Training)

Example: Fetching MNIST with TFDS

Summary: Which Method Should You Use?

相关推荐

androidkiller安装教程

The Old (Deprecated) Way: `sklearn.datasets.mldata`