如何提取 github 存储库的提交页总数
How to extract the total number of commit pages for a github repository
我正在设置一个脚本来导出所有提交和拉取请求以获得更大的 github 存储库列表(大约 4000)。
在脚本的基本思想起作用之后,我需要一种方法来遍历存储库的所有提交页面。
我发现我可以每页导出 100 个提交。对于一些 repos,有更多的提交(比如 8000),所以我需要循环遍历 80 页。
我找不到从 github api 中提取页数的方法。
到目前为止我所做的是设置脚本,它循环遍历所有提交并将它们导出到 txt/csv 文件。
我需要做的是在开始循环回购提交之前知道总页数。
这里以我无法使用的方式提供了页数。
curl -u "user:password" -I https://api.github.com/repos/0chain/rocksdb/commits?per_page=100
结果:
Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel="next", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel="last"
我需要将值 75(或其他存储库中的任何其他值)用作循环中的变量。
像这样:
repolist=`cat repolist.txt`
repolistarray=($(echo $repolist))
repolength=$(echo "${#repolistarray[@]}")
for (( i = 0; i <= $repolength; i++ )); do
#here i need to extract the pagenumber
pagenumber=$(curl -u "user:password" -I https://api.github.com/repos/$(echo "${repolistarray[i]}")/commits?per_page=100)
for (( n = 1; n <= $pagenumber; n++ )); do
curl -u "user:password" -s https://api.github.com/repos/$(echo "${repolistarray[i]}")/commits?per_page=100&page$(echo "$n") >committest.txt
done
done
done
如何从中得到“75”或任何其他结果
Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel="next", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel="last"
用作 "n"?
这里有一些与@Poshi 评论的内容类似的内容:无限期地循环请求下一页,直到您遇到一个空页面,然后跳出内部循环,继续进行下一个回购。
# this is the contents of a page past the last real page:
emptypage='[
]'
# here's a simpler way to iterate over each repo than using a bash array
cat repolist.txt | while read -d' ' repo; do
# loop indefinitely
page=0
while true; do
page=$((page + 1))
# minor improvement: use a variable, not a file.
# also, you don't need to echo variables, just use them
result=$(curl -u "user:password" -s \
"https://api.github.com/repos/$repo/commits?per_page=100&page=$n")
# if the result is empty, break out of the inner loop
[ "$result" = "$emptypage" ] && break
echo "$result" > committest.txt
# note that > overwrites (whereas >> appends),
# so committest.txt will be overwritten with each new page.
#
# in the final version, you probably want to process the results here,
# and then
#
# echo "$processed_results"
# done > repo1.txt
# done
#
# to ouput once per repo, or
#
# echo "$processed_results"
# done
# done > all_results.txt
#
# to output all results to a single file
done
done
嗯,你要求的方法不是最常见的,通常是通过抓取页面直到没有更多数据可用来完成。但是要回答您的具体问题,我们必须解析包含信息的行。一种快速而肮脏的方法可能是:
response="Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel=\"next\", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel=\"last\""
<<< "$response" cut -f2- -d: | # First, get the contents of "Link": everything after the first colon
tr "," $'\n' | # Separate the different parts in different lines
grep 'rel="last"' | # Select the line with last page information
cut -f1 -d';' | # Keep only the URL
tr "?&" $'\n' | # Split URL and its parameters, one per line
grep -e "^page" | # Select the "page" parameter
cut -f2 -d= # Finally, extract the number we are interested in
还有其他一些方法可以做到这一点,命令更少,也许更简单,但是这个可以让我一步一步地进行解释。这些其他方式之一可能是:
<<< "$response" sed 's/.*&page=\(.*\); rel="last".*//'
这个做了一些假设,比如 page
永远是最后一个参数。
我正在设置一个脚本来导出所有提交和拉取请求以获得更大的 github 存储库列表(大约 4000)。
在脚本的基本思想起作用之后,我需要一种方法来遍历存储库的所有提交页面。
我发现我可以每页导出 100 个提交。对于一些 repos,有更多的提交(比如 8000),所以我需要循环遍历 80 页。
我找不到从 github api 中提取页数的方法。
到目前为止我所做的是设置脚本,它循环遍历所有提交并将它们导出到 txt/csv 文件。
我需要做的是在开始循环回购提交之前知道总页数。
这里以我无法使用的方式提供了页数。
curl -u "user:password" -I https://api.github.com/repos/0chain/rocksdb/commits?per_page=100
结果:
Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel="next", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel="last"
我需要将值 75(或其他存储库中的任何其他值)用作循环中的变量。
像这样:
repolist=`cat repolist.txt`
repolistarray=($(echo $repolist))
repolength=$(echo "${#repolistarray[@]}")
for (( i = 0; i <= $repolength; i++ )); do
#here i need to extract the pagenumber
pagenumber=$(curl -u "user:password" -I https://api.github.com/repos/$(echo "${repolistarray[i]}")/commits?per_page=100)
for (( n = 1; n <= $pagenumber; n++ )); do
curl -u "user:password" -s https://api.github.com/repos/$(echo "${repolistarray[i]}")/commits?per_page=100&page$(echo "$n") >committest.txt
done
done
done
如何从中得到“75”或任何其他结果
Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel="next", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel="last"
用作 "n"?
这里有一些与@Poshi 评论的内容类似的内容:无限期地循环请求下一页,直到您遇到一个空页面,然后跳出内部循环,继续进行下一个回购。
# this is the contents of a page past the last real page:
emptypage='[
]'
# here's a simpler way to iterate over each repo than using a bash array
cat repolist.txt | while read -d' ' repo; do
# loop indefinitely
page=0
while true; do
page=$((page + 1))
# minor improvement: use a variable, not a file.
# also, you don't need to echo variables, just use them
result=$(curl -u "user:password" -s \
"https://api.github.com/repos/$repo/commits?per_page=100&page=$n")
# if the result is empty, break out of the inner loop
[ "$result" = "$emptypage" ] && break
echo "$result" > committest.txt
# note that > overwrites (whereas >> appends),
# so committest.txt will be overwritten with each new page.
#
# in the final version, you probably want to process the results here,
# and then
#
# echo "$processed_results"
# done > repo1.txt
# done
#
# to ouput once per repo, or
#
# echo "$processed_results"
# done
# done > all_results.txt
#
# to output all results to a single file
done
done
嗯,你要求的方法不是最常见的,通常是通过抓取页面直到没有更多数据可用来完成。但是要回答您的具体问题,我们必须解析包含信息的行。一种快速而肮脏的方法可能是:
response="Link: https://api.github.com/repositories/152923130/commits?per_page=100&page=2; rel=\"next\", https://api.github.com/repositories/152923130/commits?per_page=100&page=75; rel=\"last\""
<<< "$response" cut -f2- -d: | # First, get the contents of "Link": everything after the first colon
tr "," $'\n' | # Separate the different parts in different lines
grep 'rel="last"' | # Select the line with last page information
cut -f1 -d';' | # Keep only the URL
tr "?&" $'\n' | # Split URL and its parameters, one per line
grep -e "^page" | # Select the "page" parameter
cut -f2 -d= # Finally, extract the number we are interested in
还有其他一些方法可以做到这一点,命令更少,也许更简单,但是这个可以让我一步一步地进行解释。这些其他方式之一可能是:
<<< "$response" sed 's/.*&page=\(.*\); rel="last".*//'
这个做了一些假设,比如 page
永远是最后一个参数。