Excel 到文本转换正确处理公式和空单元格
Excel to text conversion properly handle formula and empty cells
我正在尝试通过 Apache POI 将 excel 文件转换为制表符分隔的文本文件。 excel 的一些单元格使用公式格式化,一些单元格为空。
这是原始 excel 文件的示例:
这是最终输出的摘录:
'US' 'USORACLEAP' SYSTEMREFERENCE SUPPLIERID SUPPLIERNAME CLASSIFICATION VENDOR_SITE_CODE SUPPLIERADDRESS1 SUPPLIERADDRESS2 STATE ZIPCODE COUNTRY SOURCE INVOICENUM INVOICEDATE PAYMENTDATE LINE_DESC GL_COMPANY GL_CODE GL_DESCR COSTCENTER CC_DESCR CURRENCY_CODE CHECK_NUMBER NUM_DOCS SPEND TERM PAYMENT_METHOD SYSTEM_APPROVED PO_DISTRIBUTION_ID WALKER_COST_CENTER RGL_LEDGER_ENTITY
US US Oracle AP RANDBETWEEN(3000,100000) "TEXT "&D2 VENDOR "TEXT "&D3 "TEXT "&D3 "TEXT "&D3 ONTARIO RIGHT(D2,5) US "TEXT "&D3 "TEXT "&D3 RANDBETWEEN(43831, 44150) RANDBETWEEN(44105,44135) "TEXT "&D3 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000, 60000) "TEXT "&D3 "TEXT "&D3 "TEXT "&D3 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE Check "TEXT"&D2 X2
US US Oracle AP 31836 "TEXT "&D3 1099 "TEXT "&D4 "TEXT "&D4 "TEXT "&D4 NY RIGHT(D3,5) US "TEXT "&D4 "TEXT "&D4 RANDBETWEEN(43831,44150) RANDBETWEEN(44105,44135) "TEXT "&D4 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000,60000) "TEXT "&D4 "TEXT "&D4 "TEXT "&D4 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE Check GSUEDCM03 AF2
US US Oracle AP 3504 "TEXT "&D4 VENDOR "TEXT "&D5 "TEXT "&D5 "TEXT "&D5 NY RIGHT(D4,5) US "TEXT "&D5 "TEXT "&D5 RANDBETWEEN(43831,44150) RANDBETWEEN(44105,44135) "TEXT "&D5 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000,60000) "TEXT "&D5 "TEXT "&D5 "TEXT "&D5 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE ACH GSUEIT001 AF3
US US Oracle AP 3504 "TEXT "&D5 VENDOR "TEXT "&D6 "TEXT "&D6 "TEXT "&D6 NY RIGHT(D5,5) US "TEXT "&D6 "TEXT "&D6 RANDBETWEEN(43831,44150) RANDBETWEEN(44105,44135) "TEXT "&D6 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000,60000) "TEXT "&D6 "TEXT "&D6 "TEXT "&D6 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE ACH GSUEIT001 AF4
US US Oracle AP 3504 "TEXT "&D6 VENDOR "TEXT "&D7 "TEXT "&D7 "TEXT "&D7 NY RIGHT(D6,5) US "TEXT "&D7 "TEXT "&D7 RANDBETWEEN(43831,44150) RANDBETWEEN(44105,44135) "TEXT "&D7 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000,60000) "TEXT "&D7 "TEXT "&D7 "TEXT "&D7 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE ACH GSUEIT001 AF5
如您所见,第一行代表第 headers 列。一些单元格 (D1
) 已转换为实际公式。第 3 列没有任何值,因此整个内容在文本文件中向左移动。
代码如下:
private void convertXlsToText(InputStream inputStream, String delimiter, File targetFile) throws IOException {
StringBuilder sb = new StringBuilder();
setMinInflateRatio(0);
try (Workbook wb = create(inputStream)) {
Sheet firstSheet = wb.getSheetAt(0);
for (Row nextRow : firstSheet) {
Iterator<Cell> cellIterator = nextRow.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
switch (cell.getCellType()) {
case STRING:
sb.append(cell.getStringCellValue()).append(delimiter);
break;
case BOOLEAN:
sb.append(cell.getBooleanCellValue()).append(delimiter);
break;
case NUMERIC:
sb.append(cell.getNumericCellValue()).append(delimiter);
break;
case FORMULA:
sb.append(cell.getCellFormula()).append(delimiter);
break;
default:
sb.append(EMPTY).append(delimiter);
}
}
sb.append(DEFAULT_LINE_END);
}
}
dumpStringBuilderToFile(sb, targetFile);
}
有人可以指出我应该在我的代码中进行哪些更改来解决对齐和公式问题吗?
PS:我使用 TAB (\t)
作为分隔符。
更新:
这是根据建议更新后的代码。
private void convertXlsToText(InputStream inputStream, String delimiter, File targetFile) throws IOException {
StringBuilder sb = new StringBuilder();
setMinInflateRatio(0);
try (Workbook wb = create(inputStream)) {
Sheet firstSheet = wb.getSheetAt(0);
FormulaEvaluator evaluator = wb.getCreationHelper().createFormulaEvaluator();
DataFormatter formatter = new DataFormatter();
for (Row nextRow : firstSheet) {
Iterator<Cell> cellIterator = nextRow.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
if (cell != null) {
sb.append(format("%-20s", formatter.formatCellValue(cell, evaluator))).append(delimiter);
} else {
sb.append(format("%-20s", EMPTY)).append(delimiter);
}
}
sb.append(DEFAULT_LINE_END);
}
}
dumpStringBuilderToFile(sb, targetFile);
}
要从公式字段而不是公式本身获取值,请检查以下实现:
FormulaEvaluator evaluator = myWorkbook.getCreationHelper().createFormulaEvaluator();
CellValue cellValue = evaluator.evaluate(cell); // where **cell** is your formula cell
switch (cellValue.getCellType()) {
case Cell.CELL_TYPE_BOOLEAN:
System.out.println(cellValue.getBooleanValue());
break;
case Cell.CELL_TYPE_NUMERIC:
System.out.println(cellValue.getNumberValue());
break;
case Cell.CELL_TYPE_STRING:
System.out.println(cellValue.getStringValue());
break;
case Cell.CELL_TYPE_BLANK:
break;
case Cell.CELL_TYPE_ERROR:
break;
}
}
编辑:
关于对齐问题,检查这个:How can I pad a String in Java?
如果要求将 Excel
数据写入文本文件,则所有单元格值都需要获取为 String
。一个方便的方法是使用 DataFormatter of apache poi
。使用 DataFormatter
您将获得 Excel
工作表中显示的单元格值。例如。具有数字格式和日期格式。如果您将 DataFormatter
与 FormulaEvaluator
一起使用,则计算公式并将计算值转换为 String
.
为了避免丢失空单元格,需要先计算单元格数量,因为单元格迭代器会跳过空单元格。例如,header 行中的单元格数也将是每个后续行的单元格数。
所以整个代码就这么简单:
import org.apache.poi.ss.usermodel.*;
import java.io.*;
class ExcelToText {
static final String DEFAULT_LINE_END = System.getProperty("line.separator");
static void convertXlsToText(InputStream inputStream, String delimiter, OutputStream outputStream) throws Exception {
StringBuilder sb = new StringBuilder();
Workbook workbook = WorkbookFactory.create(inputStream);
DataFormatter dataFormatter = new DataFormatter(java.util.Locale.US);
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
String cellValue = "";
Sheet sheet = workbook.getSheetAt(0);
Row headerRow = sheet.getRow(0);
int cellCount = 0;
if (headerRow != null) {
cellCount = headerRow.getLastCellNum();
}
if (cellCount > 0) {
for (Row row : sheet) {
for (int c = 0; c < cellCount; c++) {
Cell cell = row.getCell(c, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
sb.append(cellValue);
if (c < cellCount-1) sb.append(delimiter);
}
sb.append(DEFAULT_LINE_END);
}
}
workbook.close();
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(outputStream, java.nio.charset.StandardCharsets.UTF_8));
bw.append(sb);
bw.flush();
bw.close();
}
public static void main(String[] args) throws Exception {
convertXlsToText(new FileInputStream("./Excel.xlsx"), "\t", new FileOutputStream("./Data.txt"));
}
}
不需要 CellType
检查和额外的公式评估。
对于您的其他要求:带分隔符的文本文件应仅包含用分隔符分隔的真实内容。不应该有内容操纵。因此,在我看来,在内容前添加空格或填充特殊宽度的空格并不是一个好主意。例如,如果您将制表符作为分隔符,那么只有在文本查看器中设置的制表符位置才会影响视图。补充加空格只会打扰
我正在尝试通过 Apache POI 将 excel 文件转换为制表符分隔的文本文件。 excel 的一些单元格使用公式格式化,一些单元格为空。
这是原始 excel 文件的示例:
这是最终输出的摘录:
'US' 'USORACLEAP' SYSTEMREFERENCE SUPPLIERID SUPPLIERNAME CLASSIFICATION VENDOR_SITE_CODE SUPPLIERADDRESS1 SUPPLIERADDRESS2 STATE ZIPCODE COUNTRY SOURCE INVOICENUM INVOICEDATE PAYMENTDATE LINE_DESC GL_COMPANY GL_CODE GL_DESCR COSTCENTER CC_DESCR CURRENCY_CODE CHECK_NUMBER NUM_DOCS SPEND TERM PAYMENT_METHOD SYSTEM_APPROVED PO_DISTRIBUTION_ID WALKER_COST_CENTER RGL_LEDGER_ENTITY
US US Oracle AP RANDBETWEEN(3000,100000) "TEXT "&D2 VENDOR "TEXT "&D3 "TEXT "&D3 "TEXT "&D3 ONTARIO RIGHT(D2,5) US "TEXT "&D3 "TEXT "&D3 RANDBETWEEN(43831, 44150) RANDBETWEEN(44105,44135) "TEXT "&D3 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000, 60000) "TEXT "&D3 "TEXT "&D3 "TEXT "&D3 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE Check "TEXT"&D2 X2
US US Oracle AP 31836 "TEXT "&D3 1099 "TEXT "&D4 "TEXT "&D4 "TEXT "&D4 NY RIGHT(D3,5) US "TEXT "&D4 "TEXT "&D4 RANDBETWEEN(43831,44150) RANDBETWEEN(44105,44135) "TEXT "&D4 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000,60000) "TEXT "&D4 "TEXT "&D4 "TEXT "&D4 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE Check GSUEDCM03 AF2
US US Oracle AP 3504 "TEXT "&D4 VENDOR "TEXT "&D5 "TEXT "&D5 "TEXT "&D5 NY RIGHT(D4,5) US "TEXT "&D5 "TEXT "&D5 RANDBETWEEN(43831,44150) RANDBETWEEN(44105,44135) "TEXT "&D5 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000,60000) "TEXT "&D5 "TEXT "&D5 "TEXT "&D5 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE ACH GSUEIT001 AF3
US US Oracle AP 3504 "TEXT "&D5 VENDOR "TEXT "&D6 "TEXT "&D6 "TEXT "&D6 NY RIGHT(D5,5) US "TEXT "&D6 "TEXT "&D6 RANDBETWEEN(43831,44150) RANDBETWEEN(44105,44135) "TEXT "&D6 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000,60000) "TEXT "&D6 "TEXT "&D6 "TEXT "&D6 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE ACH GSUEIT001 AF4
US US Oracle AP 3504 "TEXT "&D6 VENDOR "TEXT "&D7 "TEXT "&D7 "TEXT "&D7 NY RIGHT(D6,5) US "TEXT "&D7 "TEXT "&D7 RANDBETWEEN(43831,44150) RANDBETWEEN(44105,44135) "TEXT "&D7 RIGHT("000"&RANDBETWEEN(1,999),3) RANDBETWEEN(55000,60000) "TEXT "&D7 "TEXT "&D7 "TEXT "&D7 USD RANDBETWEEN(2000000,2100000) RANDBETWEEN(1,4) RANDBETWEEN(1,100000)/100 IMMEDIATE ACH GSUEIT001 AF5
如您所见,第一行代表第 headers 列。一些单元格 (D1
) 已转换为实际公式。第 3 列没有任何值,因此整个内容在文本文件中向左移动。
代码如下:
private void convertXlsToText(InputStream inputStream, String delimiter, File targetFile) throws IOException {
StringBuilder sb = new StringBuilder();
setMinInflateRatio(0);
try (Workbook wb = create(inputStream)) {
Sheet firstSheet = wb.getSheetAt(0);
for (Row nextRow : firstSheet) {
Iterator<Cell> cellIterator = nextRow.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
switch (cell.getCellType()) {
case STRING:
sb.append(cell.getStringCellValue()).append(delimiter);
break;
case BOOLEAN:
sb.append(cell.getBooleanCellValue()).append(delimiter);
break;
case NUMERIC:
sb.append(cell.getNumericCellValue()).append(delimiter);
break;
case FORMULA:
sb.append(cell.getCellFormula()).append(delimiter);
break;
default:
sb.append(EMPTY).append(delimiter);
}
}
sb.append(DEFAULT_LINE_END);
}
}
dumpStringBuilderToFile(sb, targetFile);
}
有人可以指出我应该在我的代码中进行哪些更改来解决对齐和公式问题吗?
PS:我使用 TAB (\t)
作为分隔符。
更新: 这是根据建议更新后的代码。
private void convertXlsToText(InputStream inputStream, String delimiter, File targetFile) throws IOException {
StringBuilder sb = new StringBuilder();
setMinInflateRatio(0);
try (Workbook wb = create(inputStream)) {
Sheet firstSheet = wb.getSheetAt(0);
FormulaEvaluator evaluator = wb.getCreationHelper().createFormulaEvaluator();
DataFormatter formatter = new DataFormatter();
for (Row nextRow : firstSheet) {
Iterator<Cell> cellIterator = nextRow.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
if (cell != null) {
sb.append(format("%-20s", formatter.formatCellValue(cell, evaluator))).append(delimiter);
} else {
sb.append(format("%-20s", EMPTY)).append(delimiter);
}
}
sb.append(DEFAULT_LINE_END);
}
}
dumpStringBuilderToFile(sb, targetFile);
}
要从公式字段而不是公式本身获取值,请检查以下实现:
FormulaEvaluator evaluator = myWorkbook.getCreationHelper().createFormulaEvaluator();
CellValue cellValue = evaluator.evaluate(cell); // where **cell** is your formula cell
switch (cellValue.getCellType()) {
case Cell.CELL_TYPE_BOOLEAN:
System.out.println(cellValue.getBooleanValue());
break;
case Cell.CELL_TYPE_NUMERIC:
System.out.println(cellValue.getNumberValue());
break;
case Cell.CELL_TYPE_STRING:
System.out.println(cellValue.getStringValue());
break;
case Cell.CELL_TYPE_BLANK:
break;
case Cell.CELL_TYPE_ERROR:
break;
}
}
编辑:
关于对齐问题,检查这个:How can I pad a String in Java?
如果要求将 Excel
数据写入文本文件,则所有单元格值都需要获取为 String
。一个方便的方法是使用 DataFormatter of apache poi
。使用 DataFormatter
您将获得 Excel
工作表中显示的单元格值。例如。具有数字格式和日期格式。如果您将 DataFormatter
与 FormulaEvaluator
一起使用,则计算公式并将计算值转换为 String
.
为了避免丢失空单元格,需要先计算单元格数量,因为单元格迭代器会跳过空单元格。例如,header 行中的单元格数也将是每个后续行的单元格数。
所以整个代码就这么简单:
import org.apache.poi.ss.usermodel.*;
import java.io.*;
class ExcelToText {
static final String DEFAULT_LINE_END = System.getProperty("line.separator");
static void convertXlsToText(InputStream inputStream, String delimiter, OutputStream outputStream) throws Exception {
StringBuilder sb = new StringBuilder();
Workbook workbook = WorkbookFactory.create(inputStream);
DataFormatter dataFormatter = new DataFormatter(java.util.Locale.US);
FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
String cellValue = "";
Sheet sheet = workbook.getSheetAt(0);
Row headerRow = sheet.getRow(0);
int cellCount = 0;
if (headerRow != null) {
cellCount = headerRow.getLastCellNum();
}
if (cellCount > 0) {
for (Row row : sheet) {
for (int c = 0; c < cellCount; c++) {
Cell cell = row.getCell(c, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
sb.append(cellValue);
if (c < cellCount-1) sb.append(delimiter);
}
sb.append(DEFAULT_LINE_END);
}
}
workbook.close();
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(outputStream, java.nio.charset.StandardCharsets.UTF_8));
bw.append(sb);
bw.flush();
bw.close();
}
public static void main(String[] args) throws Exception {
convertXlsToText(new FileInputStream("./Excel.xlsx"), "\t", new FileOutputStream("./Data.txt"));
}
}
不需要 CellType
检查和额外的公式评估。
对于您的其他要求:带分隔符的文本文件应仅包含用分隔符分隔的真实内容。不应该有内容操纵。因此,在我看来,在内容前添加空格或填充特殊宽度的空格并不是一个好主意。例如,如果您将制表符作为分隔符,那么只有在文本查看器中设置的制表符位置才会影响视图。补充加空格只会打扰