通过 Python 将数据加载到 mariaDB 并且 Int Column 被设置为 0
Loading data to mariaDB via Python and Int Column is being set to 0
我正在通过 Python 将数据集加载到我的 MariaDB 数据库中,当我在 Python 中查看 DataFrame 中的一个子集时,我认为数据看起来是正确的,当它到达 table ] 在 mariaDB 中发生了一些错误
- 日期列最后 2 个值缺失
- 价格列设置为 0
以及其他潜在问题
我正在加载这个数据集,循环遍历半年一次的文件
https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
我的方法是创建一个数据库table然后通过for循环从本地保存的csv文件中加载数据
#define fn to run queries
def run(query):
connection = mysql.connector.connect( user=USR, password=PASSWORD, database=DBNAME, host=ENDPOINT , port =PORT, allow_local_infile = True)
cursor = connection.cursor()
cursor.execute(query, multi=True)
#create database table
query = """
DROP TABLE IF EXISTS `pp_data`;
CREATE TABLE IF NOT EXISTS `pp_data` (
`transaction_unique_identifier` tinytext COLLATE utf8_bin NOT NULL,
`price` int(10) unsigned NOT NULL,
`date_of_transfer` date NOT NULL,
`postcode` varchar(8) COLLATE utf8_bin NOT NULL,
`property_type` varchar(1) COLLATE utf8_bin NOT NULL,
`new_build_flag` varchar(1) COLLATE utf8_bin NOT NULL,
`tenure_type` varchar(1) COLLATE utf8_bin NOT NULL,
`primary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`secondary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`street` tinytext COLLATE utf8_bin NOT NULL,
`locality` tinytext COLLATE utf8_bin NOT NULL,
`town_city` tinytext COLLATE utf8_bin NOT NULL,
`district` tinytext COLLATE utf8_bin NOT NULL,
`county` tinytext COLLATE utf8_bin NOT NULL,
`ppd_category_type` varchar(2) COLLATE utf8_bin NOT NULL,
`record_status` varchar(2) COLLATE utf8_bin NOT NULL,
`db_id` bigint(20) unsigned NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
"""
cursor.execute(query, multi = True)
#add primary key
query=""" ALTER TABLE `pp_data`
ADD PRIMARY KEY (`db_id`);
MODIFY `db_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=1;
CREATE INDEX `pp.postcode` USING HASH
ON `pp_data`
(postcode);
CREATE INDEX `pp.date` USING HASH
ON `pp_data`
(date_of_transfer);"""
run(query)
#pull data via for loop
year_list = range(1995,2000)
filename1 = ["http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-{0}-part1.csv".format(year) for year in year_list]
filename2 = ["http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-{0}-part2.csv".format(year) for year in year_list]
filenames= filename1 + filename2
connection = mysql.connector.connect( user=USR, password=PASSWORD, database=DBNAME, host=ENDPOINT , port =PORT, allow_local_infile = True)
cursor = connection.cursor()
query = """LOAD DATA local INFILE 'out.csv'
INTO TABLE property_prices.pp_data
FIELDS TERMINATED BY ','
ENCLOSED BY ''
LINES TERMINATED BY '\n';
"""
for url in filenames:
data = pd.read_csv(url)
df = pd.DataFrame(data)
df.to_csv('out.csv')
#print(df.head())
cursor.execute(query)
当我查看 Python 中的一个数据框 (df) 时,数据看起来是正确的:
{36F3EF94-82A0-418E-8A00-19D504CFC03B} 33000 1995-11-17 00:00 PL24 2NH \
0 {4D99BC24-0AD1-454E-96BB-19D51B3AD118} 54000 1995-04-28 00:00 CB23 7XB
1 {0B64C330-4E83-427D-9C98-20F0FE8697F0} 27000 1995-02-24 00:00 DL3 9TU
2 {AC7D0429-9801-4584-8E5E-20F10CF3DD36} 56950 1995-10-20 00:00 CV7 7LY
3 {8EB62929-D245-4BB9-ADB5-20F10E847F1A} 55000 1995-01-31 00:00 MK41 7LS
4 {BCB01E86-FB6A-4031-A67D-20F11887223B} 62995 1995-04-07 00:00 E3 2UR
T N F 41 Unnamed: 8 FORE STREET ST BLAZEY \
0 T N F 19 NaN LAMBOURNE ROAD HARDWICK
1 T N F 11 NaN ATHERSTONE WAY DARLINGTON
2 T N F 10 NaN HIGHFIELD MERIDEN
3 S N F 12 NaN LELY CLOSE BEDFORD
4 F Y L PARK WEST BUILDING FLAT 38 FAIRFIELD ROAD LONDON
PAR RESTORMEL CORNWALL A A.1
0 CAMBRIDGE SOUTH CAMBRIDGESHIRE CAMBRIDGESHIRE A A
1 DARLINGTON DARLINGTON DARLINGTON A A
2 COVENTRY SOLIHULL WEST MIDLANDS A A
3 BEDFORD NORTH BEDFORDSHIRE BEDFORDSHIRE A A
4 LONDON TOWER HAMLETS GREATER LONDON A A
但是当我查看 SQL 中的 table 时,它看起来很不一样,特别是日期列被截断并且价格列值为 0
query = "select * from pp_data limit 5 ;"
cursor.execute(query)
records = cursor.fetchall()
print(records)
[(b'', 0, None, '1995-11-', 'P', 'T', 'N', b'F', b'41', b'Unnamed: 8', b'FORE STREET', b'ST BLAZEY', b'PAR', b'RESTORMEL', 'CO', 'A', 0), (b'0', 0, None, '1995-04-', 'C', 'T', 'N', b'F', b'19', b'', b'LAMBOURNE ROAD', b'HARDWICK', b'CAMBRIDGE', b'SOUTH CAMBRIDGESHIRE', 'CA', 'A', 0), (b'1', 0, None, '1995-02-', 'D', 'T', 'N', b'F', b'11', b'', b'ATHERSTONE WAY', b'DARLINGTON', b'DARLINGTON', b'DARLINGTON', 'DA', 'A', 0), (b'2', 0, None, '1995-10-', 'C', 'T', 'N', b'F', b'10', b'', b'HIGHFIELD', b'MERIDEN', b'COVENTRY', b'SOLIHULL', 'WE', 'A', 0), (b'3', 0, None, '1995-01-', 'M', 'S', 'N', b'F', b'12', b'', b'LELY CLOSE', b'BEDFORD', b'BEDFORD', b'NORTH BEDFORDSHIRE', 'BE', 'A', 0)]
我不知道该怎么做,尤其是 table 的模式是由我的 Uni 提供的!
谢谢:)
你差一分。您的 table 没有该 GUID 的字段,因此日期字段被加载到 postcode
,在那里它被截断为 8。您需要在导入之前删除 GUID 列。
我正在通过 Python 将数据集加载到我的 MariaDB 数据库中,当我在 Python 中查看 DataFrame 中的一个子集时,我认为数据看起来是正确的,当它到达 table ] 在 mariaDB 中发生了一些错误
- 日期列最后 2 个值缺失
- 价格列设置为 0 以及其他潜在问题
我正在加载这个数据集,循环遍历半年一次的文件 https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
我的方法是创建一个数据库table然后通过for循环从本地保存的csv文件中加载数据
#define fn to run queries
def run(query):
connection = mysql.connector.connect( user=USR, password=PASSWORD, database=DBNAME, host=ENDPOINT , port =PORT, allow_local_infile = True)
cursor = connection.cursor()
cursor.execute(query, multi=True)
#create database table
query = """
DROP TABLE IF EXISTS `pp_data`;
CREATE TABLE IF NOT EXISTS `pp_data` (
`transaction_unique_identifier` tinytext COLLATE utf8_bin NOT NULL,
`price` int(10) unsigned NOT NULL,
`date_of_transfer` date NOT NULL,
`postcode` varchar(8) COLLATE utf8_bin NOT NULL,
`property_type` varchar(1) COLLATE utf8_bin NOT NULL,
`new_build_flag` varchar(1) COLLATE utf8_bin NOT NULL,
`tenure_type` varchar(1) COLLATE utf8_bin NOT NULL,
`primary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`secondary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`street` tinytext COLLATE utf8_bin NOT NULL,
`locality` tinytext COLLATE utf8_bin NOT NULL,
`town_city` tinytext COLLATE utf8_bin NOT NULL,
`district` tinytext COLLATE utf8_bin NOT NULL,
`county` tinytext COLLATE utf8_bin NOT NULL,
`ppd_category_type` varchar(2) COLLATE utf8_bin NOT NULL,
`record_status` varchar(2) COLLATE utf8_bin NOT NULL,
`db_id` bigint(20) unsigned NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
"""
cursor.execute(query, multi = True)
#add primary key
query=""" ALTER TABLE `pp_data`
ADD PRIMARY KEY (`db_id`);
MODIFY `db_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=1;
CREATE INDEX `pp.postcode` USING HASH
ON `pp_data`
(postcode);
CREATE INDEX `pp.date` USING HASH
ON `pp_data`
(date_of_transfer);"""
run(query)
#pull data via for loop
year_list = range(1995,2000)
filename1 = ["http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-{0}-part1.csv".format(year) for year in year_list]
filename2 = ["http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-{0}-part2.csv".format(year) for year in year_list]
filenames= filename1 + filename2
connection = mysql.connector.connect( user=USR, password=PASSWORD, database=DBNAME, host=ENDPOINT , port =PORT, allow_local_infile = True)
cursor = connection.cursor()
query = """LOAD DATA local INFILE 'out.csv'
INTO TABLE property_prices.pp_data
FIELDS TERMINATED BY ','
ENCLOSED BY ''
LINES TERMINATED BY '\n';
"""
for url in filenames:
data = pd.read_csv(url)
df = pd.DataFrame(data)
df.to_csv('out.csv')
#print(df.head())
cursor.execute(query)
当我查看 Python 中的一个数据框 (df) 时,数据看起来是正确的:
{36F3EF94-82A0-418E-8A00-19D504CFC03B} 33000 1995-11-17 00:00 PL24 2NH \
0 {4D99BC24-0AD1-454E-96BB-19D51B3AD118} 54000 1995-04-28 00:00 CB23 7XB
1 {0B64C330-4E83-427D-9C98-20F0FE8697F0} 27000 1995-02-24 00:00 DL3 9TU
2 {AC7D0429-9801-4584-8E5E-20F10CF3DD36} 56950 1995-10-20 00:00 CV7 7LY
3 {8EB62929-D245-4BB9-ADB5-20F10E847F1A} 55000 1995-01-31 00:00 MK41 7LS
4 {BCB01E86-FB6A-4031-A67D-20F11887223B} 62995 1995-04-07 00:00 E3 2UR
T N F 41 Unnamed: 8 FORE STREET ST BLAZEY \
0 T N F 19 NaN LAMBOURNE ROAD HARDWICK
1 T N F 11 NaN ATHERSTONE WAY DARLINGTON
2 T N F 10 NaN HIGHFIELD MERIDEN
3 S N F 12 NaN LELY CLOSE BEDFORD
4 F Y L PARK WEST BUILDING FLAT 38 FAIRFIELD ROAD LONDON
PAR RESTORMEL CORNWALL A A.1
0 CAMBRIDGE SOUTH CAMBRIDGESHIRE CAMBRIDGESHIRE A A
1 DARLINGTON DARLINGTON DARLINGTON A A
2 COVENTRY SOLIHULL WEST MIDLANDS A A
3 BEDFORD NORTH BEDFORDSHIRE BEDFORDSHIRE A A
4 LONDON TOWER HAMLETS GREATER LONDON A A
但是当我查看 SQL 中的 table 时,它看起来很不一样,特别是日期列被截断并且价格列值为 0
query = "select * from pp_data limit 5 ;"
cursor.execute(query)
records = cursor.fetchall()
print(records)
[(b'', 0, None, '1995-11-', 'P', 'T', 'N', b'F', b'41', b'Unnamed: 8', b'FORE STREET', b'ST BLAZEY', b'PAR', b'RESTORMEL', 'CO', 'A', 0), (b'0', 0, None, '1995-04-', 'C', 'T', 'N', b'F', b'19', b'', b'LAMBOURNE ROAD', b'HARDWICK', b'CAMBRIDGE', b'SOUTH CAMBRIDGESHIRE', 'CA', 'A', 0), (b'1', 0, None, '1995-02-', 'D', 'T', 'N', b'F', b'11', b'', b'ATHERSTONE WAY', b'DARLINGTON', b'DARLINGTON', b'DARLINGTON', 'DA', 'A', 0), (b'2', 0, None, '1995-10-', 'C', 'T', 'N', b'F', b'10', b'', b'HIGHFIELD', b'MERIDEN', b'COVENTRY', b'SOLIHULL', 'WE', 'A', 0), (b'3', 0, None, '1995-01-', 'M', 'S', 'N', b'F', b'12', b'', b'LELY CLOSE', b'BEDFORD', b'BEDFORD', b'NORTH BEDFORDSHIRE', 'BE', 'A', 0)]
我不知道该怎么做,尤其是 table 的模式是由我的 Uni 提供的! 谢谢:)
你差一分。您的 table 没有该 GUID 的字段,因此日期字段被加载到 postcode
,在那里它被截断为 8。您需要在导入之前删除 GUID 列。