没有文件更改数据的合并提交无法显示在数据框中

merged commit who does not has file changed data could not be shown in dataframe

我使用命令 git log --all --numstat --pretty=format:'--%h--%ad--%aN--%s' > ../react_git.logs 提取了 git 日志。我也想有一个合并提交列,这样我就可以分析哪个作者合并提交最多等等。我以这种方式编码以从生成日志

中创建一个数据框
COMMIT_LOG = os.path.join(os.path.abspath(''), 'react_git.logs')
raw_df = pd.read_csv(COMMIT_LOG, sep="\u0012", header=None, names=["raw"])

commit_marker = raw_df[raw_df["raw"].str.startswith("--", na=False)]

commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<timestamp>.*?)--(?P<author>.*?)--(?P<message>.*?)$", expand=True)
commit_info.insert(loc=2, column='date', value=pd.to_datetime(commit_info['timestamp'], utc=True))
commit_info_copy = commit_info.loc[:]
commit_info_copy['today'] = pd.to_datetime('today', utc=True)
commit_info['age'] = commit_info_copy['date'] - commit_info_copy['today']

file_stats_marker = raw_df[~raw_df.index.isin(commit_info.index)]

file_stats = file_stats_marker['raw'].str.split("\t", expand=True)

file_stats = file_stats.rename(columns={0: "insertion", 1: "deletion", 2: "filepath"})
file_stats['insertion'] = pd.to_numeric(file_stats['insertion'], errors="coerce")
file_stats['deletion'] = pd.to_numeric(file_stats['deletion'], errors="coerce")
file_stats['churn'] = file_stats['insertion'] - file_stats['deletion']

commit_data = commit_info.reindex(raw_df.index).fillna(method="ffill")

commit_data = commit_data[~commit_data.index.isin(commit_info.index)]


df = commit_data.join(file_stats)
print(df.size)
print(df.head())

如果我有这样的日志

--cae635054--Sat Jun 26 14:51:23 2021 -0400--Andrew Clark--`act`: Resolve to return value of scope function (#21759)
31      0       packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js
1       1       packages/react-test-renderer/src/ReactTestRenderer.js
24      14      packages/react/src/ReactAct.js

--e2453e200--Fri Jun 25 15:39:46 2021 -0400--Andrew Clark--act: Add test for bypassing queueMicrotask (#21743)
50      0       packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js

--8f03109cd--Wed Sep 11 09:51:32 2019 -0700--Brian Vaughn--Moved backend injection to the content script (#16752)
--efa780d0a--Wed Sep 11 09:51:24 2019 -0700--Brian Vaughn--Removed DT inject() script since it's no longer being used
0       24      packages/react-devtools-extensions/src/inject.js

--4290967d4--Wed Sep 11 09:34:31 2019 -0700--Brian Vaughn--Merge branch 'tt-compat' of https://github.com/onionymous/react into onionymous-tt-compat
--f09854a9e--Wed Sep 11 09:30:57 2019 -0700--Brian Vaughn--Moved inline comment.
3       5       packages/react-devtools-extensions/src/injectGlobalHook.js

然后 8f03109cd4290967d4 将不会在数据框中,这对于分析查找合并提交的数量非常重要,就像我上面说的那样。

如何将这些数据放入数据框中,其中插入:0、删除:0 和文件路径:0 以及一列或以任何方式将它们与其他数据区分开来,以便更容易知道它与合并提交相关?

我在 repl 上也有这个以及日志文件 https://replit.com/@milanregmi/metricsLogs#main.py

我理解你的问题有两个方面:

  1. 如何从 git 日志中获取有关合并提交的信息
  2. 如何计算每个用户的合并数并将其写入数据框

首先,您可以为 git log 添加 pretty%b 格式化程序。这将为您提供提交主体,其中包含顶部的一行,表明该提交是合并提交。像这样的格式字符串

git log --all --numstat --pretty=format:'--%h--%ad--%aN--%s--%b'

然后会产生类似这样的结果

--364d23ac--Wed Jul 14 13:44:55 2021 +0200--Doe, John--Pull request #114: Bugfix/foo-branch--Merge in <repository> from bugfix/foo-branch to dev

* commit '64ee15b12345670d1ec214cb83468cf0a55a341':
  Bugfix: lorem ipsum
  Fix: dolor sit amet

您可以在 commit_marker 数据框中查找 Merge 来识别合并提交。其次,您可以计算每个作者的合并,为所有单个作者及其合并提交制作一个逆累积和。

全部都在这里:

import os
import pandas as pd

COMMIT_LOG = os.path.join(os.path.abspath(''), 'react_git.logs')
raw_df = pd.read_csv(COMMIT_LOG, sep="\u0012", header=None, names=["raw"])

commit_marker = raw_df[raw_df["raw"].str.startswith("--", na=False)]

# Add extraction for the new body part
commit_info = commit_marker['raw'].str.extract(r"^--(?P<sha>.*?)--(?P<timestamp>.*?)--(?P<author>.*?)--("r"?P<message>.*?)--(?P<body>.*?)$", expand=True)
commit_info.insert(loc=2, column='date', value=pd.to_datetime(commit_info['timestamp'], utc=True))
commit_info_copy = commit_info.loc[:]
commit_info_copy['today'] = pd.to_datetime('today', utc=True)
commit_info['age'] = commit_info_copy['date'] - commit_info_copy['today']

file_stats_marker = raw_df[~raw_df.index.isin(commit_info.index)]

file_stats = file_stats_marker['raw'].str.split("\t", expand=True)

file_stats = file_stats.rename(columns={0: "insertion", 1: "deletion", 2: "filepath"})
file_stats['insertion'] = pd.to_numeric(file_stats['insertion'], errors="coerce")
file_stats['deletion'] = pd.to_numeric(file_stats['deletion'], errors="coerce")
file_stats['churn'] = file_stats['insertion'] - file_stats['deletion']

commit_data = commit_info.reindex(raw_df.index).fillna(method="ffill")

commit_data = commit_data[~commit_data.index.isin(commit_info.index)]

df = commit_data.join(file_stats)

# Remove additional lines coming from the git commit body.
df.drop_duplicates(inplace=True)

# count merges per author
for author in df['author'].unique():
    idx = df[(df['author'] == author) & (df['body'].str.contains('Merge'))].index
    df.loc[idx, 'merges'] = list(
    range(1, len(df[(df['author'] == author) & (df['body'].str.contains('Merge'))]) + 1)[::-1])

print(df.size)
print(df.head())

你最终得到的是你之前的数据框加上一列 merges 越来越多地计算每个作者的合并次数

sha      timestamp                       date                      author       message                            body                                      age                            insertion   deletion    filepath           churn  merges
b3e5eb7  Fri Jul 16 11:36:43 2021 +0200  2021-07-16 09:36:43+00:00 Doe, John    test deploy                                                                  -4 days +00:51:00.878903000    31.0        3.0         file/path/file1.py 28.0 
4fc0c34  Thu Jul 15 11:12:10 2021 +0200  2021-07-15 09:12:10+00:00 Cow, Jane    Pull request #116: Dev             Merge in repo from dev to master          -5 days +00:26:27.878903000                                                      14.0
8188751  Thu Jul 15 07:42:40 2021 +0200  2021-07-15 05:42:40+00:00 Doe, John    Pull request #115: Feature/foo-bar Merge in repo from feature/foo-bar to dev -6 days +20:56:57.878903000                                                      7.0
6fa89c3  Wed Jul 14 16:02:38 2021 +0200  2021-07-14 14:02:38+00:00 Cow, Jane    Added: foo bar                                                               -6 days +05:16:55.878903000    4056.0      0.0         file/path/file2.py 4056.0