根据我对 PySpark 所做的工作创建熊猫数据框
Creating a panda dataframe based on what I have done with PySpark
我希望在不使用 spark 的情况下在 pandas 中执行以下等效操作。
这就是我在 spark 中使用 class UsedFunctions(不是重点)生成一些随机数据的方法。
class UsedFunctions:
def randomString(self,length):
letters = string.ascii_letters
result_str = ''.join(random.choice(letters) for i in range(length))
return result_str
def clustered(self,x,numRows):
return math.floor(x -1)/numRows
def scattered(self,x,numRows):
return abs((x -1 % numRows))* 1.0
def randomised(self,seed,numRows):
random.seed(seed)
return abs(random.randint(0, numRows) % numRows) * 1.0
def padString(self,x,chars,length):
n = int(math.log10(x) + 1)
result_str = ''.join(random.choice(chars) for i in range(length-n)) + str(x)
return result_str
def padSingleChar(self,chars,length):
result_str = ''.join(chars for i in range(length))
return result_str
def println(self,lst):
for ll in lst:
print(ll[0])
usedFunctions = UsedFunctions()
spark = SparkSession.builder \
.enableHiveSupport() \
.getOrCreate()
sc = SparkContext.getOrCreate()
numRows = 10
start = 1
end = start + 9
print ("starting at ID = ",start, ",ending on = ",end)
Range = range(start, end)
rdd = sc.parallelize(Range). \
map(lambda x: (x, usedFunctions.clustered(x,numRows), \
usedFunctions.scattered(x,numRows), \
usedFunctions.randomised(x,numRows), \
usedFunctions.randomString(50), \
usedFunctions.padString(x," ",50), \
usedFunctions.padSingleChar("x",4000)))
df = rdd.toDF()
好的,如何在不使用 Spark 的情况下创建 panda DataFrame df?
我知道以下 spark dataframe 到 pandas 的转换是可行的,但这里不能使用 Spark。
p_dfm = df.toPandas() # converting spark DF to Pandas DF
谢谢
我试图从 spark 中保留你的大部分代码和语法。
# your class and functions on top as is ...
usedFunctions = UsedFunctions()
numRows = 10
start = 1
end = start + 9
print ("starting at ID = ",start, ",ending on = ",end)
Range = range(start, end)
df =pd.DataFrame(map(lambda x: (x, usedFunctions.clustered(x,numRows), \
usedFunctions.scattered(x,numRows), \
usedFunctions.randomised(x,numRows), \
usedFunctions.randomString(50), \
usedFunctions.padString(x," ",50), \
usedFunctions.padSingleChar("x",4000)), Range))
输出:
0 1 2 3 4 5 6
0 1 0.0 0.0 2.0 KZWeqhFWCEPyYngFbyBM... ... xxxxxxxxxxxxxxx...
1 2 0.1 1.0 0.0 ffxkVZQtqMnMcLRkBOzZ... ... xxxxxxxxxxxxxxx...
2 3 0.2 2.0 3.0 LIixMEOLeMaEqJomTEIJ... ... xxxxxxxxxxxxxxx...
3 4 0.3 3.0 3.0 tgUzEjfebzJsZWdoHIxr... ... xxxxxxxxxxxxxxx...
4 5 0.4 4.0 9.0 qVwYSVPHbDXpPdkhxEpy... ... xxxxxxxxxxxxxxx...
5 6 0.5 5.0 9.0 fFWqcajQLEWVxuXbrFZm... ... xxxxxxxxxxxxxxx...
6 7 0.6 6.0 5.0 jzPdeIgxLdGncfBAepfJ... ... xxxxxxxxxxxxxxx...
7 8 0.7 7.0 3.0 xyimTcfipZGnzPbDFDyF... ... xxxxxxxxxxxxxxx...
8 9 0.8 8.0 7.0 NxrilRavGDMfvJNScUyk... ... xxxxxxxxxxxxxxx...
我希望在不使用 spark 的情况下在 pandas 中执行以下等效操作。
这就是我在 spark 中使用 class UsedFunctions(不是重点)生成一些随机数据的方法。
class UsedFunctions:
def randomString(self,length):
letters = string.ascii_letters
result_str = ''.join(random.choice(letters) for i in range(length))
return result_str
def clustered(self,x,numRows):
return math.floor(x -1)/numRows
def scattered(self,x,numRows):
return abs((x -1 % numRows))* 1.0
def randomised(self,seed,numRows):
random.seed(seed)
return abs(random.randint(0, numRows) % numRows) * 1.0
def padString(self,x,chars,length):
n = int(math.log10(x) + 1)
result_str = ''.join(random.choice(chars) for i in range(length-n)) + str(x)
return result_str
def padSingleChar(self,chars,length):
result_str = ''.join(chars for i in range(length))
return result_str
def println(self,lst):
for ll in lst:
print(ll[0])
usedFunctions = UsedFunctions()
spark = SparkSession.builder \
.enableHiveSupport() \
.getOrCreate()
sc = SparkContext.getOrCreate()
numRows = 10
start = 1
end = start + 9
print ("starting at ID = ",start, ",ending on = ",end)
Range = range(start, end)
rdd = sc.parallelize(Range). \
map(lambda x: (x, usedFunctions.clustered(x,numRows), \
usedFunctions.scattered(x,numRows), \
usedFunctions.randomised(x,numRows), \
usedFunctions.randomString(50), \
usedFunctions.padString(x," ",50), \
usedFunctions.padSingleChar("x",4000)))
df = rdd.toDF()
好的,如何在不使用 Spark 的情况下创建 panda DataFrame df? 我知道以下 spark dataframe 到 pandas 的转换是可行的,但这里不能使用 Spark。
p_dfm = df.toPandas() # converting spark DF to Pandas DF
谢谢
我试图从 spark 中保留你的大部分代码和语法。
# your class and functions on top as is ...
usedFunctions = UsedFunctions()
numRows = 10
start = 1
end = start + 9
print ("starting at ID = ",start, ",ending on = ",end)
Range = range(start, end)
df =pd.DataFrame(map(lambda x: (x, usedFunctions.clustered(x,numRows), \
usedFunctions.scattered(x,numRows), \
usedFunctions.randomised(x,numRows), \
usedFunctions.randomString(50), \
usedFunctions.padString(x," ",50), \
usedFunctions.padSingleChar("x",4000)), Range))
输出:
0 1 2 3 4 5 6
0 1 0.0 0.0 2.0 KZWeqhFWCEPyYngFbyBM... ... xxxxxxxxxxxxxxx...
1 2 0.1 1.0 0.0 ffxkVZQtqMnMcLRkBOzZ... ... xxxxxxxxxxxxxxx...
2 3 0.2 2.0 3.0 LIixMEOLeMaEqJomTEIJ... ... xxxxxxxxxxxxxxx...
3 4 0.3 3.0 3.0 tgUzEjfebzJsZWdoHIxr... ... xxxxxxxxxxxxxxx...
4 5 0.4 4.0 9.0 qVwYSVPHbDXpPdkhxEpy... ... xxxxxxxxxxxxxxx...
5 6 0.5 5.0 9.0 fFWqcajQLEWVxuXbrFZm... ... xxxxxxxxxxxxxxx...
6 7 0.6 6.0 5.0 jzPdeIgxLdGncfBAepfJ... ... xxxxxxxxxxxxxxx...
7 8 0.7 7.0 3.0 xyimTcfipZGnzPbDFDyF... ... xxxxxxxxxxxxxxx...
8 9 0.8 8.0 7.0 NxrilRavGDMfvJNScUyk... ... xxxxxxxxxxxxxxx...