在 GCP 中流式处理抓取的音频

Question

我想从一个网站上抓取多个音频频道。我想同时实时执行以下操作：

1. Save the audio to GCP Storage. 
2. Apply speech-to-text ML and send transcripts to an app.

为此 post 我想重点关注 (1)。在 GCP 中执行此操作的最佳方法是什么，是 Pubsub 吗？如果不是，构建它的最佳方法是什么？

我有一个实用的 Python 脚本。

设置录音功能。

def record(url): 
  # Open url. 
  response = urllib.request.urlopen(url)
  block_size = 1024

  # Make folder with station name. 
  # Example, 'www.music.com/station_1' has folder name of '/station_1/'
  channel = re.search('([^\/]+$)' , url)[0]
  folder = '/' + channel + '/'
  os.makedirs(os.path.dirname(folder), exist_ok=True)

  # Run indefinitely
  while True:
    # Name recording as the current date_time. 
    filename = folder + time.strftime("%m-%d-%Y--%H-%M-%S") + '.mp3'
    f = open(filename, 'wb')

    start = time.time()
    # Create new file every 60 seconds. 
    while time.time() - start < 60:
      buffer = response.read(block_size)
      f.write(buffer)
    f.close()

声明要记录的 URL

urls = ['www.music.com/station_1',...,'www.music.com/station_n']

一次从多个 URL 进行记录的线程。

p = Pool(len(urls))
p.map(record, urls)
p.terminate()
p.join()

Answer 1

Beam 不适合这种用例。

解释：

假设频道名称是元素。

您的示例需要无限期地处理单个元素，而 Beam 做得不太好。

即使我们把每一个元素都定义为（频道名称，时间戳），问题也解决不了，因为我们不能根据站点拉取任意时间的数据window。

在 GCP 中流式处理抓取的音频

Stream Process Scraped Audio in GCP

google-cloud-storage

google-cloud-platform

google-cloud-pubsub

google-cloud-dataflow

apache-beam