在 Scikit 中加载自定义数据集（类似于 20 个新闻组集）以对文本文档进行分类

Question

我正在尝试运行 this scikit example code 我的 Ted Talks 自定义数据集。每个目录都是一个主题，主题下是包含每个 Ted 演讲描述的文本文件。

这就是我的数据集树结构。如您所见，每个目录都是一个主题，下面是带有描述的文本文件。

Topics/
|-- Activism
|   |-- 1149.txt
|   |-- 1444.txt
|   |-- 157.txt
|   |-- 1616.txt
|   |-- 1706.txt
|   |-- 1718.txt
|-- Adventure
|   |-- 1036.txt
|   |-- 1777.txt
|   |-- 2930.txt
|   |-- 2968.txt
|   |-- 3027.txt
|   |-- 3290.txt
|-- Advertising
|   |-- 3673.txt
|   |-- 3685.txt
|   |-- 6567.txt
|   `-- 6925.txt
|-- Africa
|   |-- 1045.txt
|   |-- 1072.txt
|   |-- 1103.txt
|   |-- 1112.txt
|-- Aging
|   |-- 1848.txt
|   |-- 2495.txt
|   |-- 2782.txt
|-- Agriculture
|   |-- 3469.txt
|   |-- 4140.txt
|   |-- 4733.txt
|   |-- 4939.txt

我将我的数据集制作成类似于 20news 组的形式，其树结构如下：

20news-18828/
|-- alt.atheism
|   |-- 49960
|   |-- 51060
|   |-- 51119

|-- comp.graphics
|   |-- 37261
|   |-- 37913
|   |-- 37914
|   |-- 37915
|   |-- 37916
|   |-- 37917
|   |-- 37918
|-- comp.os.ms-windows.misc
|   |-- 10000
|   |-- 10001
|   |-- 10002
|   |-- 10003
|   |-- 10004
|   |-- 10005

在 original code (98-124) 中，这是直接从 scikit 加载训练和测试数据的方式。

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('data loaded')

categories = data_train.target_names    # for case categories == None
def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print("%d documents - %0.3fMB (training set)" % (
    len(data_train.data), data_train_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(data_test.data), data_test_size_mb))
print("%d categories" % len(categories))
print()

# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

由于此数据集可用于 Scikit，因此其标签等都是内置的。对于我的情况，我知道如何加载数据集 (Line 84)：

dataset = load_files('./TED_dataset/Topics/')

我不知道在那之后我应该做什么。我想知道我应该如何在训练和测试中拆分这些数据并从我的数据集中生成这些标签：

data_train.data,  data_test.data

总而言之，我只想加载我的数据集，运行它在此代码上没有错误。我有 uploaded the dataset here 给那些可能想看的人。

我已经提到了，其中简要介绍了测试列车装载。我还想知道如何从我的数据集中获取 data_train.target_names。

编辑：

我试图搭火车并测试 returns 错误：

dataset = load_files('./TED_dataset/Topics/')
train, test = train_test_split(dataset, train_size = 0.8)

更新后的代码是 here。

Answer 1

使用您引用的代码，数据集是从 sklearn 包下载的，训练集和测试集也是如此（通过使用 fetch_20newsgroup() 函数）。如果你想加载你自己的数据集，你必须预处理你的数据，向量化文本，提取特征，最好把所有东西都放在漂亮的 numpy 数组或矩阵中。有合适的功能可以为您做到这一点。您引用的代码无法对普通文本文件执行任何操作（无论如何，单独基于字母和单词的计算都很困难 ;-)）。

到目前为止，您可以定义训练集和测试集。通常，一个人使用 90% 的数据进行训练，10% 作为测试数据。如果你想走得更远，你可以使用 10 折交叉验证，将你的数据分成 10 部分，在第一轮的前 9 部分进行训练，在第 10 部分进行测试，在前 8 部分和第 10 部分进行训练并在第二轮9号考试，依此类推

Answer 2

我想你正在寻找这样的东西：

In [1]: from sklearn.datasets import load_files

In [2]: from sklearn.cross_validation import train_test_split

In [3]: bunch = load_files('./Topics')

In [4]: X_train, X_test, y_train, y_test = train_test_split(bunch.data, bunch.target, test_size=.4)

# Then proceed to train your model and validate.

请注意，bunch.target 是一个整数数组，它是存储在 bunch.target_names 中的类别名称的索引。

In [14]: X_test[:2]
Out[14]:
['Psychologist Philip Zimbardo asks, "Why are boys struggling?" He shares some stats (lower graduation rates, greater worries about intimacy and relationships) and suggests a few reasons -- and challenges the TED community to think about solutions.Philip Zimbardo was the leader of the notorious 1971 Stanford Prison Experiment -- and an expert witness at Abu Ghraib. His book The Lucifer Effect explores the nature of evil; now, in his new work, he studies the nature of heroism.',
 'Human growth has strained the Earth\'s resources, but as Johan Rockstrom reminds us, our advances also give us the science to recognize this and change behavior. His research has found nine "planetary boundaries" that can guide us in protecting our planet\'s many overlapping ecosystems.If Earth is a self-regulating system, it\'s clear that human activity is capable of disrupting it. Johan Rockstrom has led a team of scientists to define the nine Earth systems that need to be kept within bounds for Earth to keep itself in balance.']

In [15]: y_test[:2]
Out[15]: array([ 84, 113])

In [16]: [bunch.target_names[idx] for idx in y_test[:2]]
Out[16]: ['Education', 'Global issues']

在 Scikit 中加载自定义数据集（类似于 20 个新闻组集）以对文本文档进行分类

Load Custom Dataset (which is like 20 news group set) in Scikit for Classification of text documents

python

nlp

machine-learning

dataset

scikit-learn