Python 3 用 lxml 写大 (300+ mb) XML
Python 3 writing large (300+ mb) XML with lxml
过去几天我一直在谷歌搜索,但我根本找不到任何类似的问题:(
我在 Python 3 中的脚本有简单的 objective:
- 连接到 MySQL 数据库并获取数据
- 使用 lxml
创建 XML
- 将 XML 保存到文件
通常我对包含 5000 多个元素的 XML 文件没有问题,但在这种情况下,我的 VPS (Amazon EC2 micro) 达到了最大内存使用量。我的代码(核心部分):
engine = create_engine(config('DB_URI'))
Session = sessionmaker(bind=engine)
session = Session()
query = session.query(Trips.Country,
Trips.Region,
Trips.Name,
Trips.Rebate,
Trips.Stars,
Trips.PromotionName,
Trips.ProductURL,
Trips.SubProductURL,
Trips.Date,
Trips.City,
Trips.Type,
Trips.Price,
TripsImages.ImageURL) \
.join(TripsImages) \
.all()
# define namespace xmlns:g
XMLNS = "{http://base.google.com/ns/1.0}"
NSMAP = {"g": "http://base.google.com/ns/1.0"}
# create root rss and channel
rss = etree.Element("rss", nsmap=NSMAP, attrib={"version": "2.0"})
channel = etree.SubElement(rss, "channel", attrib={"generated": str(datetime.now())})
# add <channel> title and description
channel_title = etree.SubElement(channel, "title")
channel_link = etree.SubElement(channel, "link")
channel_description = etree.SubElement(channel, "description")
channel_title.text = "Trips"
channel_link.text = "https://example.com"
channel_description.text = "Description"
# generate xml elements
for count, elem in enumerate(query):
item = etree.SubElement(channel, "item")
url = "/".join(["https://example.com",
elem.ProductURL,
elem.SubProductURL,
datetime.strftime(elem.Date, '%Y%m%d')
])
price_discounted = round(elem.Price - elem.Price * (elem.Rebate / 100))
etree.SubElement(item, XMLNS + "id").text = str(count)
etree.SubElement(item, XMLNS + "title").text = elem.Country
etree.SubElement(item, XMLNS + "description").text = elem.Product
etree.SubElement(item, XMLNS + "link").text = url
etree.SubElement(item, XMLNS + "image_link").text = elem.ImageURL
etree.SubElement(item, XMLNS + "condition").text = "new"
etree.SubElement(item, XMLNS + "availability").text = "in stock"
etree.SubElement(item, XMLNS + "price").text = str(elem.Price)
etree.SubElement(item, XMLNS + "sale_price").text = str(price_discounted)
etree.SubElement(item, XMLNS + "brand").text = "Brand"
etree.SubElement(item, XMLNS + "additional_image_link").text = elem.ImageURL
etree.SubElement(item, XMLNS + "custom_label_0").text = elem.Date.strftime("%Y-%m-%d")
etree.SubElement(item, XMLNS + "custom_label_1").text = elem.Type
etree.SubElement(item, XMLNS + "custom_label_2").text = str(elem.Stars / 10)
etree.SubElement(item, XMLNS + "custom_label_3").text = elem.City
etree.SubElement(item, XMLNS + "custom_label_4").text = elem.Country
etree.SubElement(item, XMLNS + "custom_label_5").text = elem.PromotionName
# finally, serialize XML and save as file
with open(target_xml, "wb") as file:
file.write(etree.tostring(rss, encoding="utf-8", pretty_print=True))
我使用 SQLAlchemy 查询数据库,使用 LXML 生成 XML 文件。从数据库中获取数据时,它已经创建了包含 228890 个元素的列表,这会占用大量内存。然后创建 XML 也会在内存中创建对象,导致总共使用大约 1.5 GB RAM。
此代码在我的 8 GB 内存笔记本电脑上运行良好,但在 Amazon EC2 上使用 1 GB 内存和 1 GB 交换空间执行时,我点击了 write() 操作并从 [=39= 获得了 'Killed' 响应].
Whosebug 上有很多关于解析大 XML 文件的内容,但我找不到任何关于在 Python 中写入大文件的内容,除了避免多个 I/O操作:(
我认为你需要的是 yield_per()
函数,这样你就不必一次处理所有结果,而是将它们分成块。这样你可以节省更多的内存。您可以在此处阅读有关此功能的更多信息
this link.
但是请注意,yield_per()
可能会忽略您的某些查询行,the answer in this question provides a detailed explanation. If you think you do not want to use yield_per()
after reading, you may refer to all the answers posted on this Whosebug question 也会。
另一个处理大列表的技巧是使用yield
,这样您就不必一次将所有条目加载到内存中,而是一个一个地处理它们。希望对你有帮助。
过去几天我一直在谷歌搜索,但我根本找不到任何类似的问题:(
我在 Python 3 中的脚本有简单的 objective:
- 连接到 MySQL 数据库并获取数据
- 使用 lxml 创建 XML
- 将 XML 保存到文件
通常我对包含 5000 多个元素的 XML 文件没有问题,但在这种情况下,我的 VPS (Amazon EC2 micro) 达到了最大内存使用量。我的代码(核心部分):
engine = create_engine(config('DB_URI'))
Session = sessionmaker(bind=engine)
session = Session()
query = session.query(Trips.Country,
Trips.Region,
Trips.Name,
Trips.Rebate,
Trips.Stars,
Trips.PromotionName,
Trips.ProductURL,
Trips.SubProductURL,
Trips.Date,
Trips.City,
Trips.Type,
Trips.Price,
TripsImages.ImageURL) \
.join(TripsImages) \
.all()
# define namespace xmlns:g
XMLNS = "{http://base.google.com/ns/1.0}"
NSMAP = {"g": "http://base.google.com/ns/1.0"}
# create root rss and channel
rss = etree.Element("rss", nsmap=NSMAP, attrib={"version": "2.0"})
channel = etree.SubElement(rss, "channel", attrib={"generated": str(datetime.now())})
# add <channel> title and description
channel_title = etree.SubElement(channel, "title")
channel_link = etree.SubElement(channel, "link")
channel_description = etree.SubElement(channel, "description")
channel_title.text = "Trips"
channel_link.text = "https://example.com"
channel_description.text = "Description"
# generate xml elements
for count, elem in enumerate(query):
item = etree.SubElement(channel, "item")
url = "/".join(["https://example.com",
elem.ProductURL,
elem.SubProductURL,
datetime.strftime(elem.Date, '%Y%m%d')
])
price_discounted = round(elem.Price - elem.Price * (elem.Rebate / 100))
etree.SubElement(item, XMLNS + "id").text = str(count)
etree.SubElement(item, XMLNS + "title").text = elem.Country
etree.SubElement(item, XMLNS + "description").text = elem.Product
etree.SubElement(item, XMLNS + "link").text = url
etree.SubElement(item, XMLNS + "image_link").text = elem.ImageURL
etree.SubElement(item, XMLNS + "condition").text = "new"
etree.SubElement(item, XMLNS + "availability").text = "in stock"
etree.SubElement(item, XMLNS + "price").text = str(elem.Price)
etree.SubElement(item, XMLNS + "sale_price").text = str(price_discounted)
etree.SubElement(item, XMLNS + "brand").text = "Brand"
etree.SubElement(item, XMLNS + "additional_image_link").text = elem.ImageURL
etree.SubElement(item, XMLNS + "custom_label_0").text = elem.Date.strftime("%Y-%m-%d")
etree.SubElement(item, XMLNS + "custom_label_1").text = elem.Type
etree.SubElement(item, XMLNS + "custom_label_2").text = str(elem.Stars / 10)
etree.SubElement(item, XMLNS + "custom_label_3").text = elem.City
etree.SubElement(item, XMLNS + "custom_label_4").text = elem.Country
etree.SubElement(item, XMLNS + "custom_label_5").text = elem.PromotionName
# finally, serialize XML and save as file
with open(target_xml, "wb") as file:
file.write(etree.tostring(rss, encoding="utf-8", pretty_print=True))
我使用 SQLAlchemy 查询数据库,使用 LXML 生成 XML 文件。从数据库中获取数据时,它已经创建了包含 228890 个元素的列表,这会占用大量内存。然后创建 XML 也会在内存中创建对象,导致总共使用大约 1.5 GB RAM。
此代码在我的 8 GB 内存笔记本电脑上运行良好,但在 Amazon EC2 上使用 1 GB 内存和 1 GB 交换空间执行时,我点击了 write() 操作并从 [=39= 获得了 'Killed' 响应].
Whosebug 上有很多关于解析大 XML 文件的内容,但我找不到任何关于在 Python 中写入大文件的内容,除了避免多个 I/O操作:(
我认为你需要的是 yield_per()
函数,这样你就不必一次处理所有结果,而是将它们分成块。这样你可以节省更多的内存。您可以在此处阅读有关此功能的更多信息
this link.
但是请注意,yield_per()
可能会忽略您的某些查询行,the answer in this question provides a detailed explanation. If you think you do not want to use yield_per()
after reading, you may refer to all the answers posted on this Whosebug question 也会。
另一个处理大列表的技巧是使用yield
,这样您就不必一次将所有条目加载到内存中,而是一个一个地处理它们。希望对你有帮助。