BOW 没有考虑所有元素
BOW not considered all elements
我正在研究应用 BOW 方法为表示的和弦生成向量的可能性。但是,当我使用这种方法时,我可以生成向量,但并没有考虑所有的和弦。
这里是详细的代码:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# DF
music chords
0 1.wav N, A7, Am7, Am7b5/G, A7, N
1 2.wav N, Em, C, D, Em, C, D, N
2 3.wav N, E, A, E, B, A, D6, E, N
#BOW
bow = CountVectorizer(max_features=1000, ngram_range=(1,1))
train_bow = bow.fit_transform(df['chords'])
pd.DataFrame(bow.transform(df['chords']).toarray(), columns=sorted(bow.vocabulary_.keys()))
#Result
a7 am7 am7b5 d6 em
0 2 1 1 0 0
1 0 0 0 0 2
2 0 0 0 1 0
例如,C、D 和 A 等和弦不计算在内。有谁知道我可能错了什么?
我不知道 sklearn 的默认分词器是如何工作的,但它不适合你的输入。
tokenizer = lambda x: x.replace(" ", "").split(",")
bow = CountVectorizer(max_features=1000, tokenizer = tokenizer, ngram_range=(1,1))
train_bow = bow.fit_transform(df['chords'])
pd.DataFrame(bow.transform(df['chords']).toarray(), columns=sorted(bow.vocabulary_.keys()))
打印输出:
>>> bow.vocabulary_.keys()
dict_keys(['n', 'a7', 'am7', 'am7b5/g', 'em', 'c', 'd', 'e', 'a', 'b', 'd6'])
我写了一种方法来手动创建词汇表,另一种方法来标记化。
输出如下:
>>>
b d am7 c em n a e a7 am7b5/g d6
0 0 0 1 0 0 2 0 0 2 1 0
1 0 2 0 2 2 2 0 0 0 0 0
2 1 0 0 0 0 2 2 3 0 0 1
代码如下:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def voc(chord):
items = []
for item in chord:
items += item.split(', ')
items = [el.lower() for el in items]
vocabulary = list(set(items))
return vocabulary
def tokenizer(item):
items = []
items = item.split(', ')
items = [el.lower() for el in items]
return items
df = pd.read_excel("df.xlsx") #I created a df for test purpose, replace with yours
chord = list(df['chords'].values)
vocabulary = voc(chord)
#BOW
bow = CountVectorizer(vocabulary = vocabulary, tokenizer = tokenizer, max_features=1000, ngram_range=(1,1))
train_bow = bow.fit_transform(df['chords'])
bow = pd.DataFrame(bow.transform(df['chords']).toarray(),columns=bow.vocabulary_.keys())
让我知道这是否是您所需要的!
我正在研究应用 BOW 方法为表示的和弦生成向量的可能性。但是,当我使用这种方法时,我可以生成向量,但并没有考虑所有的和弦。
这里是详细的代码:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# DF
music chords
0 1.wav N, A7, Am7, Am7b5/G, A7, N
1 2.wav N, Em, C, D, Em, C, D, N
2 3.wav N, E, A, E, B, A, D6, E, N
#BOW
bow = CountVectorizer(max_features=1000, ngram_range=(1,1))
train_bow = bow.fit_transform(df['chords'])
pd.DataFrame(bow.transform(df['chords']).toarray(), columns=sorted(bow.vocabulary_.keys()))
#Result
a7 am7 am7b5 d6 em
0 2 1 1 0 0
1 0 0 0 0 2
2 0 0 0 1 0
例如,C、D 和 A 等和弦不计算在内。有谁知道我可能错了什么?
我不知道 sklearn 的默认分词器是如何工作的,但它不适合你的输入。
tokenizer = lambda x: x.replace(" ", "").split(",")
bow = CountVectorizer(max_features=1000, tokenizer = tokenizer, ngram_range=(1,1))
train_bow = bow.fit_transform(df['chords'])
pd.DataFrame(bow.transform(df['chords']).toarray(), columns=sorted(bow.vocabulary_.keys()))
打印输出:
>>> bow.vocabulary_.keys()
dict_keys(['n', 'a7', 'am7', 'am7b5/g', 'em', 'c', 'd', 'e', 'a', 'b', 'd6'])
我写了一种方法来手动创建词汇表,另一种方法来标记化。
输出如下:
>>>
b d am7 c em n a e a7 am7b5/g d6
0 0 0 1 0 0 2 0 0 2 1 0
1 0 2 0 2 2 2 0 0 0 0 0
2 1 0 0 0 0 2 2 3 0 0 1
代码如下:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def voc(chord):
items = []
for item in chord:
items += item.split(', ')
items = [el.lower() for el in items]
vocabulary = list(set(items))
return vocabulary
def tokenizer(item):
items = []
items = item.split(', ')
items = [el.lower() for el in items]
return items
df = pd.read_excel("df.xlsx") #I created a df for test purpose, replace with yours
chord = list(df['chords'].values)
vocabulary = voc(chord)
#BOW
bow = CountVectorizer(vocabulary = vocabulary, tokenizer = tokenizer, max_features=1000, ngram_range=(1,1))
train_bow = bow.fit_transform(df['chords'])
bow = pd.DataFrame(bow.transform(df['chords']).toarray(),columns=bow.vocabulary_.keys())
让我知道这是否是您所需要的!