12 releases
0.2.8 | Nov 7, 2023 |
---|---|
0.2.7 | Aug 16, 2023 |
0.2.6 | May 17, 2022 |
0.2.5 | Apr 6, 2022 |
0.1.2 | Dec 10, 2020 |
#1857 in Parser implementations
Used in xlsx2csv
360KB
3K
SLoC
OOXML - Office OpenXML parser in Rust
This crate is started as a private-purposed project with limited knownledge of Office Open XML, use it with caution!
Office Open XML,为由Microsoft开发的一种以XML为基础并以ZIP格式压缩的电子文件规范,支持文件、表格、备忘录、幻灯片等文件格式。
Office Open XML (also informally known as OOXML or Microsoft Open XML (MOX)) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.
OOXML, as it's naming, is trying to be a pure rust implementation of Office Open XML parser - reading and writing ooxml components efficiently in Rust. But at now, only xlsx parsing is supported.
TLDR;
Example code in examples/xlsx.rs
:
use ooxml::document::SpreadsheetDocument;
fn main() {
let xlsx =
SpreadsheetDocument::open("examples/simple-spreadsheet/data-image-demo.xlsx").unwrap();
let workbook = xlsx.get_workbook();
//println!("{:?}", xlsx);
let _sheet_names = workbook.worksheet_names();
for (sheet_idx, sheet) in workbook.worksheets().iter().enumerate() {
println!("worksheet {}", sheet_idx);
println!("worksheet dimension: {:?}", sheet.dimenstion());
println!("---------DATA---------");
for rows in sheet.rows() {
// get cell values
let cols: Vec<_> = rows
.into_iter()
.map(|cell| cell.value().unwrap_or_default())
.collect();
println!("{}", itertools::join(&cols, ","));
}
}
}
Run cargo run --example xlsx
:
worksheet 0
worksheet dimension: Some((1, 1))
---------DATA---------
----------------------
worksheet 1
worksheet dimension: Some((4, 4))
---------DATA---------
name,age,birthday,last edited
bob,17,1983/12/12,2020/10/11 19:59
tom,18,1982/12/12,2020/10/11 19:59
cury,20,1980-12-12,2020-10-11 19:59
----------------------
Library Design
The main idea come from the DotNet OpenXML SDK.
- Implement OpenXML Package Convention for any OOXML format(docx/xlsx/pptx...), including:
- package read and write
- content type parsing
- relationship common types
- Implement shared OpenXML parts
- content type
- core properties
- app properties
- file properties
- embedded package
- image
- theme
- style
- Implement Excel/SpreadsheetML specifications
- Calculation Chain
- Chartsheet
- Comments
- Connections
- Custom Property
- Customer XML Mappings
- Dialogsheet
- Drawings
- External Workbook References
- Metadata
- Pivot Table
- Pivot Table Cache Definition
- Pivot Table Cache Records
- Query Table
- Shared String Table
- Shared Workbook Revision Log
- Shared Workbook User Data
- Single Cell Table Definition
- Table Definition
- Volatile Dependencies
- Workbook
- Worksheet
- Other OpenXML formats(docx, pptx)
Codebase tree structure will be like below.
src
├── document
│ ├── mod.rs
│ ├── presentation
│ │ └── mod.rs
│ ├── spreadsheet
│ │ ├── cell.rs
│ │ ├── chart.rs
│ │ ├── document_type.rs
│ │ ├── drawing.rs
│ │ ├── media.rs
│ │ ├── mod.rs
│ │ ├── shared_string.rs
│ │ ├── style.rs
│ │ ├── workbook.rs
│ │ └── worksheet.rs
│ └── wordprocessing
│ └── mod.rs
├── drawing
│ └── mod.rs
├── error.rs
├── lib.rs
├── math
│ └── mod.rs
└── packaging
├── app_property.rs
├── content_type.rs
├── custom_property.rs
├── element.rs
├── mod.rs
├── namespace.rs
├── package.rs
├── part
│ ├── container.rs
│ ├── mod.rs
│ └── pair.rs
├── property.rs
├── relationship
│ ├── mod.rs
│ └── reference.rs
├── variant.rs
├── xml.rs
└── zip.rs
Definitions For the Crate
The main design principle is typed everything
.
Package
: APackage
is a zipped OpenXML document, which could be wordprocessing/spreadsheet/presentation document.Element
: AnElement
is an OpenXML element reperasenting data details in each xml.Part
: APart
is a collection ofElement
s or pure data that should be serializing to an file in the package.Component
: AComponent
is the bridge of behaviors and the internal OpenXML stuff, includingPackage
,Element
, andPart
.Property
: AProperty
represents attributes for an element.Document
: ADocument
is the entryComponent
for an real document, eg.SpreadSheetDocument
etc.Relationship
: ARelationship
is a link relationship for the element and other resources from aPart
.
The data flows open or create an document will be like below.
Document -> Package : open/parse from
Package -> Parts : parse to parts
Parts -> Components: build components tree
Components -> Elements: elements one-to-one map
Elements -> Components: elements changes
Components -> Parts: components write back
Parts -> Package: serialize to package
Package <- Document: flush, save or others
Document -> Components: create new document. add or remove components
Components <-> Elements: operations
Components -> Parts: component add/remove
Parts -> Package: serialize to package
Document -> Package: flush, save or others
Initialize Implementing Features
- OPC parsing, include read and write
- Shared components
- content type
- core properties
- app properties
- file properties(not in schedule)
- embedded package(not int schedule)
- image
- theme
- style
- SpreadsheetML
- Workbook
- Worksheet
TODOS:
- create marker traits for OpenXML element, make it more generialize.
- use
minidom
in an xml part, tracking the changes and write back to dom tree. - lazy parse some of the openxml part for first start speedup.
- implement helper macros for component generation.
Tokei - 2020-11-04-11:35:51
===============================================================================
Language Files Lines Code Comments Blanks
===============================================================================
Markdown 1 272 0 230 42
Plain Text 1 1 0 1 0
TOML 1 23 21 1 1
XML 52 164 164 0 0
-------------------------------------------------------------------------------
Rust 34 2721 2189 194 338
|- Markdown 14 106 7 90 9
(Total) 2827 2196 284 347
===============================================================================
Total 89 3287 2381 516 390
===============================================================================
Concepts
Office Open XML, or OpenXML
Office Open XML (also informally known as OOXML or Microsoft Open XML (MOX)) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.
Microsoft Office 2010 provides read support for ECMA-376, read/write support for ISO/IEC 29500 Transitional, and read support for ISO/IEC 29500 Strict. Microsoft Office 2013 and Microsoft Office 2016 additionally support both reading and writing of ISO/IEC 29500 Strict.While Office 2013 and onward have full read/write support for ISO/IEC 29500 Strict, Microsoft has not yet implemented the strict non-transitional, or original standard, as the default file format yet due to remaining interoperability concerns.
OpenXML Package Convention
The Open Packaging Conventions (OPC) is a container-file technology initially created by Microsoft to store a combination of XML and non-XML files that together form a single entity such as an Open XML Paper Specification (OpenXPS) document. OPC-based file formats combine the advantages of leaving the independent file entities embedded in the document intact and resulting in much smaller files compared to normal use of XML.
Standard ECMA-376
Standard ECMA-376 - The Office Open XML File Formats standard.
1st edition (December 2006), 2nd edition (December 2008), 3rd edition (June 2011), 4th edition (December 2012) and 5th edition (Part 3, December 2015; and Parts 1 & 4, December 2016).
Edition downloads:
Currently is 4th edition, technically aligned with ISO/IEC 29500. 5th edition is ongoing. There is a Office Open XML Overview introduction pdf file.
SpreadsheetML
A SpreadsheetML or .xlsx file is a zip file (a package) containing a number of "parts" (typically UTF-8 or UTF-16 encoded) or XML files. The package may also contain other media files such as images. The structure is organized according to the Open Packaging Conventions as outlined in Part 2 of the OOXML standard ECMA-376.
You can look at the file structure and the files that comprise a SpreadsheetML file by simply unzipping the .xlsx file.
├── [Content_Types].xml
├── docProps
│ ├── app.xml
│ ├── core.xml
│ └── custom.xml
├── _rels
└── xl
├── charts
│ ├── chart1.xml
│ ├── colors1.xml
│ ├── _rels
│ │ └── chart1.xml.rels
│ └── style1.xml
├── drawings
│ ├── drawing1.xml
│ ├── drawing2.xml
│ └── _rels
│ ├── drawing1.xml.rels
│ └── drawing2.xml.rels
├── media
│ └── image1.png
├── _rels
│ └── workbook.xml.rels
├── sharedStrings.xml
├── styles.xml
├── theme
│ └── theme1.xml
├── workbook.xml
└── worksheets
├── _rels
│ ├── sheet1.xml.rels
│ └── sheet2.xml.rels
├── sheet1.xml
└── sheet2.xml
The number and types of parts will vary based on what is in the spreadsheet, but there will always be a [Content_Types].xml
, one or more relationship parts, a workbook part , and at least one worksheet. The core data of the spreadsheet is contained within the worksheet part(s), discussed in more detail at xslx Content Overview.
Resources
- Wikipedia Office OpenXML: English, 中文.
- Microsoft DotNet OpenXML SDK documents and source code.
- Wikipedia OpenXML Package Convention - 开放打包约定.
- What is OOXML: http://officeopenxml.com/
- SpreadsheetML: http://officeopenxml.com/anatomyofOOXML-xlsx.php
- Rust quick-xml documents.
- Rust docx-rs documents and source code on github.
- Go Excel file parser excelize.
- Standard ECMA-376.
Dependencies
~15MB
~263K SLoC