如何在给定转移概率矩阵的情况下生成随机序列?

How to generate a random sequence given a probability matrix of transitions?

下面的脚本生成给定列表的概率矩阵:

transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D']

def rank(c):
   return ord(c) - ord('A')

T = [rank(c) for c in transitions]

#create matrix of zeros

M = [[0]*4 for _ in range(4)]

for (i,j) in zip(T,T[1:]):
   M[i][j] += 1

#now convert to probabilities:
for row in M:
   n = sum(row)
   if n > 0:
       row[:] = [f/sum(row) for f in row]

#print M:
for row in M:
   print(row)

输出

[0.0, 0.5, 0.0, 0.5]
[0.5, 0.25, 0.25, 0.0]
[0.0, 1.0, 0.0, 0.0]
[0.5, 0.0, 0.0, 0.5]

我现在想反其道而行之,按照概率矩阵做一个新的A B C D的转换列表。
我怎样才能做到这一点?

随机库的choices函数可能会有帮助。由于问题没有说明如何选择第一个字母,所以这里选择与原始列表内容相同的概率。

因为 Python 3.6 random.choices 接受带有权重的参数。对它们进行归一化并不是绝对必要的。

import random

letter = random.choice(transitions)  # take a starting letter with the same weights as the original list
new_list = [letter]
for _ in range(len(transitions) - 1):
    letter = chr(random.choices(range(4), weights=M[rank(letter)])[0] + ord('A'))
    new_list.append(letter)
print(new_list)

完整的代码可以在某种程度上被概括为适用于任何类型的节点,而不仅仅是连续的字母:

from _collections import defaultdict
import random

transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D']

nodes = sorted(set(transitions))  # a list of all letters used
M = defaultdict(int)  # dictionary counting the occurrences for each transition i,j)

for (i, j) in zip(transitions, transitions[1:]):
    M[(i, j)] += 1

# dictionary with for each node a list of frequencies for the transition to a next node
T = {i: [M[(i, j)] for j in nodes] for i in nodes}

# node = random.choice(transitions) # chose the first node randomly with the same probability as the original list
node = random.choice(nodes) # chose the first node randomly, each node with equal probability
new_list = [node]
for _ in range(9):
    node = random.choices(nodes, T[node])[0]
    new_list.append(node)

print(new_list)

示例输出:['D', 'A', 'D', 'A', 'D', 'D', 'A', 'D', 'A', 'B']

在我看来,您正在尝试创建马尔可夫模型。 作为一名生物信息学学生,我碰巧对(隐)马尔可夫模型有一些经验,因此我会使用嵌套字典来简化矩阵的处理。请注意,我已经导入了 numpy.random 函数。

希望对您有所帮助!

import numpy.random as rnd

alphabet = ['A', 'B', 'C', 'D']
transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D']

# Create probability matrix filled with zeroes
# Matrix consists of nested libraries
prob_matrix = {}
for i in alphabet:
    prob_matrix[i] = {}
    for j in alphabet:
        prob_matrix[i][j] = 0.0

def rank(c):
   return ord(c) - ord('A')

# fill matrix with numbers based on transitions list
T = [rank(c) for c in transitions]
for (i,j) in zip(T,T[1:]):
    prob_matrix[alphabet[i]][alphabet[j]] += 1

# convert to probabilities
for row in prob_matrix:
   total = sum([prob_matrix[row][column] for column in prob_matrix[row]])
   if total > 0:
       for column in prob_matrix[row]:
           prob_matrix[row][column] /= total

# generate first random sequence letter
outputseq = rnd.choice(alphabet, None)

# generate rest of string based on probability matrix
for i in range(11):
    probabilities = [prob_matrix[outputseq[-1]][j] for j in alphabet]
    outputseq += rnd.choice(alphabet, None, False, probabilities)

# output generated sequence
print(outputseq)