Skip to content

Commit

Permalink
feat<table model>: add tablemaster with paddleocr to detect and recog…
Browse files Browse the repository at this point in the history
…nize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url

Co-authored-by: sfk <[email protected]>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in #418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url

Co-authored-by: sfk <[email protected]>

* add dockerfile (#189)

Co-authored-by: drunkpig <[email protected]>

* Update cla.yml

* Update cla.yml

---------

Co-authored-by: drunkpig <[email protected]>
Co-authored-by: sfk <[email protected]>
Co-authored-by: Aoyang Fang <[email protected]>
Co-authored-by: Xiaomeng Zhao <[email protected]>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)

Co-authored-by: liukaiwen <[email protected]>

* @Matthijz98 has signed the CLA in #467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in #487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------

Co-authored-by: Xiaomeng Zhao <[email protected]>
Co-authored-by: sfk <[email protected]>
Co-authored-by: drunkpig <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <[email protected]>
Co-authored-by: liukaiwen <[email protected]>
  • Loading branch information
7 people authored Aug 28, 2024
1 parent f4316f0 commit cd64b81
Show file tree
Hide file tree
Showing 22 changed files with 306 additions and 67 deletions.
5 changes: 2 additions & 3 deletions .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,8 @@ body:
#multiple: false
options:
-
- "0.5.x"
- "0.6.x"
- "0.7.x"
validations:
required: true

Expand All @@ -92,6 +92,5 @@ body:
-
- cpu
- cuda
- mps
validations:
required: true
required: true
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@ Find the `magic-pdf.json` file in your user directory and configure the "models-
// other config
"models-dir": "D:/models",
"table-config": {
"model": "TableMaster", // Another option of this value is 'struct_eqtable'
"is_table_recog_enable": false, // Table recognition is disabled by default, modify this value to enable it
"max_time": 400
}
Expand Down
3 changes: 2 additions & 1 deletion README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,14 +200,15 @@ cp magic-pdf.template.json ~/magic-pdf.json
在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录
> ❗️务必正确配置模型权重文件所在目录的【绝对路径】,否则会因为找不到模型文件而导致程序无法运行
>
> windows系统中此路径应包含盘符,且需把路径中所有的"\"替换为"/",否则会因为转义原因导致json文件语法错误。
> windows系统中此路径应包含盘符,且需把路径中所有的`"\"`替换为`"/"`,否则会因为转义原因导致json文件语法错误。
>
> 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models"
```json
{
// other config
"models-dir": "D:/models",
"table-config": {
"model": "TableMaster", // 使用structEqTable请修改为'struct_eqtable'
"is_table_recog_enable": false, // 表格识别功能默认是关闭的,如果需要修改此处的值
"max_time": 400
}
Expand Down
8 changes: 8 additions & 0 deletions docs/FAQ_en_us.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,11 @@ sudo apt-get install libgl1-mesa-glx
```

Reference: https://github.com/opendatalab/MinerU/issues/388

### 5. Encountered error `ModuleNotFoundError: No module named 'fairscale'`
You need to uninstall the module and reinstall it:
```bash
pip uninstall fairscale
pip install fairscale
```
Reference: https://github.com/opendatalab/MinerU/issues/411
8 changes: 8 additions & 0 deletions docs/FAQ_zh_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,11 @@ WSL2的Ubuntu22.04中缺少`libgl`库,可通过以下命令安装`libgl`库解
sudo apt-get install libgl1-mesa-glx
```
参考:https://github.com/opendatalab/MinerU/issues/388

### 5.遇到报错 `ModuleNotFoundError : Nomodulenamed 'fairscale'`
需要卸载该模块并重新安装
```bash
pip uninstall fairscale
pip install fairscale
```
参考:https://github.com/opendatalab/MinerU/issues/411
2 changes: 1 addition & 1 deletion docs/README_Windows_CUDA_Acceleration_zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.co
> ```bash
> magic-pdf --version
>```
> 如果版本号小于0.6.2,请到issue中向我们反馈
> 如果版本号小于0.7.0,请到issue中向我们反馈
## 5. 下载模型
详细参考 [如何下载模型文件](how_to_download_models_zh_cn.md)
Expand Down
4 changes: 4 additions & 0 deletions docs/download_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# use modelscope sdk download models
from modelscope import snapshot_download
model_dir = snapshot_download('wanderkid/PDF-Extract-Kit')
print(f"model dir is: {model_dir}/models")
15 changes: 15 additions & 0 deletions docs/how_to_download_models_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,21 @@ The structure of the model folder is as follows, including configuration files a
│ ├── spiece.model
│ ├── tokenizer.json
│ └── tokenizer_config.json
│ └─ TableMaster
│ └─ ch_PP-OCRv3_det_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ ch_PP-OCRv3_rec_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ table_structure_tablemaster_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ ├── ppocr_keys_v1.txt
│ └── table_master_structure_dict.txt
└── README.md
```
#### 2. Check whether the model file is fully downloaded.
Expand Down
15 changes: 15 additions & 0 deletions docs/how_to_download_models_zh_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,21 @@ print(f"模型文件下载路径为:{model_dir}/models")
│ ├── spiece.model
│ ├── tokenizer.json
│ └── tokenizer_config.json
│ └─ TableMaster
│ └─ ch_PP-OCRv3_det_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ ch_PP-OCRv3_rec_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ table_structure_tablemaster_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ ├── ppocr_keys_v1.txt
│ └── table_master_structure_dict.txt
└── README.md
```

Expand Down
1 change: 1 addition & 0 deletions magic-pdf.template.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
"models-dir":"/tmp/models",
"device-mode":"cpu",
"table-config": {
"model": "TableMaster",
"is_table_recog_enable": false,
"max_time": 400
}
Expand Down
4 changes: 4 additions & 0 deletions magic_pdf/dict2md/ocr_mkcontent.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
# if processed by table model
if span.get('latex', ''):
para_text += f"\n\n$\n {span['latex']}\n$\n\n"
elif span.get('html', ''):
para_text += f"\n\n{span['html']}\n\n"
else:
para_text += f"\n![{table_caption}]({join_path(img_buket_path, span['image_path'])}) \n"
for block in para_block['blocks']: # 3rd.拼table_footnote
Expand Down Expand Up @@ -256,6 +258,8 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
if block['type'] == BlockType.TableBody:
if block["lines"][0]["spans"][0].get('latex', ''):
para_content['table_body'] = f"\n\n$\n {block['lines'][0]['spans'][0]['latex']}\n$\n\n"
elif block["lines"][0]["spans"][0].get('html', ''):
para_content['table_body'] = f"\n\n{block['lines'][0]['spans'][0]['html']}\n\n"
para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
if block['type'] == BlockType.TableCaption:
para_content['table_caption'] = merge_para_with_text(block)
Expand Down
28 changes: 27 additions & 1 deletion magic_pdf/libs/Constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,31 @@
# block中lines是否被删除
LINES_DELETED = "lines_deleted"

# struct eqtable
STRUCT_EQTABLE = "struct_eqtable"

# table recognition max time default value
TABLE_MAX_TIME_VALUE = 400
TABLE_MAX_TIME_VALUE = 400

# pp_table_result_max_length
TABLE_MAX_LEN = 480

# pp table structure algorithm
TABLE_MASTER = "TableMaster"

# table master structure dict
TABLE_MASTER_DICT = "table_master_structure_dict.txt"

# table master dir
TABLE_MASTER_DIR = "table_structure_tablemaster_infer/"

# pp detect model dir
DETECT_MODEL_DIR = "ch_PP-OCRv3_det_infer"

# pp rec model dir
REC_MODEL_DIR = "ch_PP-OCRv3_rec_infer"

# pp rec char dict path
REC_CHAR_DICT = "ppocr_keys_v1.txt"


3 changes: 3 additions & 0 deletions magic_pdf/model/magic_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -562,8 +562,11 @@ def remove_duplicate_spans(spans):
elif category_id == 5:
# 获取table模型结果
latex = layout_det.get("latex", None)
html = layout_det.get("html", None)
if latex:
span["latex"] = latex
elif html:
span["html"] = html
span["type"] = ContentType.Table
elif category_id == 13:
span["content"] = layout_det["latex"]
Expand Down
47 changes: 35 additions & 12 deletions magic_pdf/model/pdf_extract_kit.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import os
import time

from magic_pdf.libs.Constants import TABLE_MAX_TIME_VALUE
from magic_pdf.libs.Constants import *

os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1' # 禁止albumentations检查更新
try:
Expand Down Expand Up @@ -34,10 +34,18 @@
from magic_pdf.model.pek_sub_modules.post_process import get_croped_image, latex_rm_whitespace
from magic_pdf.model.pek_sub_modules.self_modify import ModifiedPaddleOCR
from magic_pdf.model.pek_sub_modules.structeqtable.StructTableModel import StructTableModel


def table_model_init(model_path, max_time, _device_='cpu'):
table_model = StructTableModel(model_path, max_time=max_time, device=_device_)
from magic_pdf.model.ppTableModel import ppTableModel


def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
if table_model_type == STRUCT_EQTABLE:
table_model = StructTableModel(model_path, max_time=max_time, device=_device_)
else:
config = {
"model_dir": model_path,
"device": _device_
}
table_model = ppTableModel(config)
return table_model


Expand Down Expand Up @@ -104,9 +112,11 @@ def __init__(self, ocr: bool = False, show_log: bool = False, **kwargs):
# 初始化解析配置
self.apply_layout = kwargs.get("apply_layout", self.configs["config"]["layout"])
self.apply_formula = kwargs.get("apply_formula", self.configs["config"]["formula"])
# table config
self.table_config = kwargs.get("table_config", self.configs["config"]["table_config"])
self.apply_table = self.table_config.get("is_table_recog_enable", False)
self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE)
self.table_model_type = self.table_config.get("model", TABLE_MASTER)
self.apply_ocr = ocr
logger.info(
"DocAnalysis init, this may take some times. apply_layout: {}, apply_formula: {}, apply_ocr: {}, apply_table: {}".format(
Expand Down Expand Up @@ -141,10 +151,11 @@ def __init__(self, ocr: bool = False, show_log: bool = False, **kwargs):
if self.apply_ocr:
self.ocr_model = ModifiedPaddleOCR(show_log=show_log)

# init structeqtable
# init table model
if self.apply_table:
self.table_model = table_model_init(str(os.path.join(models_dir, self.configs["weights"]["table"])),
max_time = self.table_max_time, _device_=self.device)
table_model_dir = self.configs["weights"][self.table_model_type]
self.table_model = table_model_init(self.table_model_type, str(os.path.join(models_dir, table_model_dir)),
max_time=self.table_max_time, _device_=self.device)
logger.info('DocAnalysis init done!')

def __call__(self, image):
Expand Down Expand Up @@ -278,16 +289,28 @@ def crop_img(input_res, input_pil_img, crop_paste_x=0, crop_paste_y=0):
new_image, _ = crop_img(res, pil_img)
single_table_start_time = time.time()
logger.info("------------------table recognition processing begins-----------------")
latex_code = None
html_code = None
with torch.no_grad():
latex_code = self.table_model.image2latex(new_image)[0]
if self.table_model_type == STRUCT_EQTABLE:
latex_code = self.table_model.image2latex(new_image)[0]
else:
html_code = self.table_model.img2html(new_image)
run_time = time.time() - single_table_start_time
logger.info(f"------------table recognition processing ends within {run_time}s-----")
if run_time > self.table_max_time:
logger.warning(f"------------table recognition processing exceeds max time {self.table_max_time}s----------")
# 判断是否返回正常
expected_ending = latex_code.strip().endswith('end{tabular}') or latex_code.strip().endswith('end{table}')
if latex_code and expected_ending:
res["latex"] = latex_code

if latex_code:
expected_ending = latex_code.strip().endswith('end{tabular}') or latex_code.strip().endswith(
'end{table}')
if expected_ending:
res["latex"] = latex_code
else:
logger.warning(f"------------table recognition processing fails----------")
elif html_code:
res["html"] = html_code
else:
logger.warning(f"------------table recognition processing fails----------")
table_cost = round(time.time() - table_start, 2)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ def __init__(self, model_path, max_new_tokens=2048, max_time=400, device = 'cpu'
self.model = StructTable(self.model_path, self.max_new_tokens, self.max_time)

def image2latex(self, image) -> str:
#
table_latex = self.model.forward(image)
return table_latex

Expand Down
67 changes: 67 additions & 0 deletions magic_pdf/model/ppTableModel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
from paddleocr.ppstructure.table.predict_table import TableSystem
from paddleocr.ppstructure.utility import init_args
from magic_pdf.libs.Constants import *
import os
from PIL import Image
import numpy as np


class ppTableModel(object):
"""
This class is responsible for converting image of table into HTML format using a pre-trained model.
Attributes:
- table_sys: An instance of TableSystem initialized with parsed arguments.
Methods:
- __init__(config): Initializes the model with configuration parameters.
- img2html(image): Converts a PIL Image or NumPy array to HTML string.
- parse_args(**kwargs): Parses configuration arguments.
"""

def __init__(self, config):
"""
Parameters:
- config (dict): Configuration dictionary containing model_dir and device.
"""
args = self.parse_args(**config)
self.table_sys = TableSystem(args)

def img2html(self, image):
"""
Parameters:
- image (PIL.Image or np.ndarray): The image of the table to be converted.
Return:
- HTML (str): A string representing the HTML structure with content of the table.
"""
if isinstance(image, Image.Image):
image = np.array(image)
pred_res, _ = self.table_sys(image)
pred_html = pred_res["html"]
res = '<td><table border="1">' + pred_html.replace("<html><body><table>", "").replace("</table></body></html>",
"") + "</table></td>\n"
return res

def parse_args(self, **kwargs):
parser = init_args()
model_dir = kwargs.get("model_dir")
table_model_dir = os.path.join(model_dir, TABLE_MASTER_DIR)
table_char_dict_path = os.path.join(model_dir, TABLE_MASTER_DICT)
det_model_dir = os.path.join(model_dir, DETECT_MODEL_DIR)
rec_model_dir = os.path.join(model_dir, REC_MODEL_DIR)
rec_char_dict_path = os.path.join(model_dir, REC_CHAR_DICT)
device = kwargs.get("device", "cpu")
use_gpu = True if device == "cuda" else False
config = {
"use_gpu": use_gpu,
"table_max_len": kwargs.get("table_max_len", TABLE_MAX_LEN),
"table_algorithm": TABLE_MASTER,
"table_model_dir": table_model_dir,
"table_char_dict_path": table_char_dict_path,
"det_model_dir": det_model_dir,
"rec_model_dir": rec_model_dir,
"rec_char_dict_path": rec_char_dict_path,
}
parser.set_defaults(**config)
return parser.parse_args([])
Loading

0 comments on commit cd64b81

Please sign in to comment.