从 2D numpy 获得联合概率的最佳方法
Best way to get joint probability from 2D numpy
想知道是否有更好的方法来获取二维 numpy 数组的概率。也许使用一些 numpy 的内置函数。
为简单起见,假设我们有一个示例数组:
[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]
想得到的概率如:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
其中 'juice' 作为第二个单词的概率为 0.2。因为苹果有 2/5 * 1/2 的概率(果汁)。
另一方面,'pie' 作为第二个词的概率为 0.4。 'apple' 和 'orange'.
的概率组合
我解决这个问题的方法是在数组中添加 3 个新列,分别是第 1 列、第 2 列和最终概率的概率。将数组按第一列分组,然后按第二列分组,并相应地更新概率。
下面是我的代码:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
输出:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
想知道是否有更好的 short/faster 方式使用 numpy 或其他方式来做到这一点。添加列不是必需的,这只是我这样做的方式。其他方法也可以接受。
根据您给出的概率分布定义,您可以使用 pandas
做同样的事情,即
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
输出:
0 1 2 3 4
0 apple pie 0.4 0.5 0.2
1 apple juice 0.4 0.5 0.2
2 orange pie 0.2 1.0 0.2
3 strawberry cream 0.4 0.5 0.2
4 strawberry candy 0.4 0.5 0.2
如果你想要列表的形式,你可以使用df.values.tolist()
如果您不想要这些列,那么
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
输出:
0 1 2
0 apple pie 0.2
1 apple juice 0.2
2 orange pie 0.2
3 strawberry cream 0.2
4 strawberry candy 0.2
组合概率print(df.groupby(1)[2].sum())
candy 0.2
cream 0.2
juice 0.2
pie 0.4
想知道是否有更好的方法来获取二维 numpy 数组的概率。也许使用一些 numpy 的内置函数。
为简单起见,假设我们有一个示例数组:
[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]
想得到的概率如:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
其中 'juice' 作为第二个单词的概率为 0.2。因为苹果有 2/5 * 1/2 的概率(果汁)。
另一方面,'pie' 作为第二个词的概率为 0.4。 'apple' 和 'orange'.
的概率组合我解决这个问题的方法是在数组中添加 3 个新列,分别是第 1 列、第 2 列和最终概率的概率。将数组按第一列分组,然后按第二列分组,并相应地更新概率。
下面是我的代码:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
输出:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
想知道是否有更好的 short/faster 方式使用 numpy 或其他方式来做到这一点。添加列不是必需的,这只是我这样做的方式。其他方法也可以接受。
根据您给出的概率分布定义,您可以使用 pandas
做同样的事情,即
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
输出:
0 1 2 3 4 0 apple pie 0.4 0.5 0.2 1 apple juice 0.4 0.5 0.2 2 orange pie 0.2 1.0 0.2 3 strawberry cream 0.4 0.5 0.2 4 strawberry candy 0.4 0.5 0.2
如果你想要列表的形式,你可以使用df.values.tolist()
如果您不想要这些列,那么
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
输出:
0 1 2 0 apple pie 0.2 1 apple juice 0.2 2 orange pie 0.2 3 strawberry cream 0.2 4 strawberry candy 0.2
组合概率print(df.groupby(1)[2].sum())
candy 0.2 cream 0.2 juice 0.2 pie 0.4