如何将数据流从 googleDrive 中的 CSV 转换为 tf 数据集 - 在 Colab 中

how to make DataFlow from CSV in googleDrive to tf DataSet - in Colab

根据 instructions in Colab 我可以获得缓冲区甚至从中获取 pd.DataFrame(文件只是示例)...

# ... authentification    

file_id = '1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU' # titanic

# loading data
import io
from googleapiclient.http import MediaIoBaseDownload

drive_service = build('drive', 'v3')      # , credentials=creds

request = drive_service.files().get_media(fileId=file_id)
buf = io.BytesIO()
downloader = MediaIoBaseDownload(buf, request)

buf.seek(0)

import pandas as pd
df= pd.read_csv(buf);
print(df.head())

但是在正确创建到数据集的数据流时遇到问题 - “buf”var 在 =>

中不起作用

dataset = tf.data.experimental.make_csv_dataset(csv_file_path, batch_size=100, num_epochs=1)

只有“csv_file_path”作为第一个参数。 Colab 是否有可能将 IO 从我的 GoogleDrive 的 csv 文件获取到数据集中(在训练中进一步使用)?以及如何以节省内存的方式做到这一点?..

P.S。 我知道我也许可以为所有人打开文件(在 GoogleDrive 中)并让 url 使用简单的方法:

#TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TRAIN_DATA_URL = "https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
dataset = tf.data.experimental.make_csv_dataset(train_file_path, batch_size=100, num_epochs=1) 

!但我不需要共享真实文件...如何保存机密文件并从中获取 IO(在 GoogleDrive 中)到 Colab 中的 tf.data.Dataset? (最好是最短的代码——在 Colab 中测试的真实项目中会有更多的代码)

drive.CreateFile 帮助 (link) - 据我了解在 Colab 工作 - 我在一个单独的环境中工作(与我的环境分开PC & I'net env)...所以我尝试了(根据 link)

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# https://drive.google.com/file/d/1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU/view?usp=sharing
link = 'https://drive.google.com/open?id=1S1w0Z7g3bI1PGLPR49PW5VBRo7c_KYgU'

fluff, id = link.split('=')
print (id) # Verify that you have everything after '='

downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('Filename.csv')

import tensorflow as tf
ds = tf.data.experimental.make_csv_dataset('Filename.csv', batch_size=100, num_epochs=1) 

iterator = ds.as_numpy_iterator()
print(next(iterator))

对我有用。感谢您对该主题的兴趣(如果有人尝试过)

更简单

# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

_types = [float(),  float(), float(), float(),  str()]
_lines = tf.data.TextLineDataset('/content/drive/My Drive/iris.csv')
ds=_lines.skip(1).map(lambda x: tf.io.decode_csv(x, record_defaults=_types) )
ds0= ds.take(2)
print(*ds0.as_numpy_iterator(), sep='\n')   # print list with sep => by rows.

OR 来自 df:(并进行批处理以节省内存)

import tensorflow as tf

# Load the Drive helper and mount
from google.colab import drive
drive.flush_and_unmount()
drive.mount('/content/drive')

df= pd.read_csv('/content/drive/My Drive/iris.csv', dtype = 'float32', converters = {'variety' : str},  nrows=20, decimal='.')
ds = tf.data.Dataset.from_tensor_slices(dict(df))   # if mixed types
ds = ds.shuffle(20, reshuffle_each_iteration=False )   #  for train.ds ONLY!
ds = ds.batch(batch_size=4)
ds = ds.prefetch(4)

# labels
label=  ds.map(lambda x: x['variety'])

print(list(label.as_numpy_iterator()))

# features
#features = ds.map(lambda x: (x['sepal.length'], x['sepal.width']))
# Or with dynamic keys:
features =  ds.map(lambda x: (list(map(x.get, list(np.setdiff1d(list(x.keys()),['variety']))))))

print(list(features.as_numpy_iterator()))

地图中的任何转换...