当行元组具有不同的数据类型时，对由假设生成的数据帧进行排序

Question

我想创建 End 大于 Start 的数据帧。

我用的是：

from hypothesis.extra.pandas import columns, data_frames, column
import hypothesis.strategies as st

positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
data_frames(columns=columns(["Start", "End"], dtype=int),
            rows=st.tuples(positions, positions).map(sorted)).example()

这给出了

     Start      End
0   589492  6620613
1  5990807  8083222
2   252458  8368032
3  1575938  5763895
4  4689113  9133040
5  7439297  8646668
6   838051  1886133

但是，我想将第三列 Strand 添加到数据中，这是通过上述策略生成的。然后这停止工作：

data_frames(columns=columns(["Start", "End", "Strands"], dtype=int),
            rows=st.tuples(positions, positions, strands).map(sorted)).example()

报错

TypeError: '<' not supported between instances of 'str' and 'int'

这是由于整数和字符串的元组排序。我该如何解决这个问题？

我可以要求假设生成一个带有 pos、pos、strand_int 的数据帧，其中 strand_int 是 0 或 1，并在测试中将其转换为“-”或“+”，但是感觉恶心。

Answer 1

最佳方法

better_dfs_min = data_frames(index=range_indexes(min_size=better_df_minsize),
                             columns=[column("Chromosome", chromosomes_small),
                                      column("Start", elements=small_lengths),
                                      column("End", elements=small_lengths),
                                      column("Strand", strands)])


@st.composite()
def dfs_min(draw):
    df = draw(better_dfs_min)
    df.loc[:, "End"] += df.Start
    return df

@given(df=dfs_min())
def test_me(df):
    print(df)
    assert 0

第一次尝试：

from hypothesis.extra.pandas import columns, data_frames, column
import hypothesis.strategies as st

def mysort(tp):

    key = [-1, tp[1], tp[2], int(1e10)]

    return [x for _, x in sorted(zip(key, tp))]

positions = st.integers(min_value=0, max_value=int(1e7))
strands = st.sampled_from("+ -".split())
chromosomes = st.sampled_from(elements=["chr{}".format(str(e)) for e in list(range(23)) + "X Y M".split()])

data_frames(columns=columns(["Chromosome", "Start", "End", "Strand"], dtype=int), rows=st.tuples(chromosomes, positions, positions, strands).map(mysort)).example()

结果：

  Chromosome    Start      End Strand
0      chr13  5660600  6171569      -
1       chrY  3987154  5435816      +
2      chr11  4659655  4956997      +
3      chr14   239357  8566407      +
4       chr3  3200488  9337489      +
5       chr8   304886  1078020      +

一定有比实现您自己的排序更好的方法...我的排序取决于 Start 和 End 中介于 0 和 int(1e10) - 1 之间的整数，这让人觉得恶心。

Answer 2

作弊！

将测试的第一行设为 df.End += df.Start，end 将始终大于 start（假设为正整数）。如果您有更具体的大小限制，请将 end 描述为 Hypothesis 所需的差异，然后使用此技巧。

您还可以使用 @st.composite 装饰器编写自定义策略来执行此内联操作。 IMO 只有当你将它用于多个测试时才值得，但这是一个风格问题而不是实质问题。

当行元组具有不同的数据类型时，对由假设生成的数据帧进行排序

Sort dataframes generated by hypothesis when row tuples have different dtypes

python-hypothesis

最佳方法

第一次尝试：