Skip to content

Latest commit

ย 

History

History
120 lines (85 loc) ยท 5.63 KB

File metadata and controls

120 lines (85 loc) ยท 5.63 KB

generate_synthetic_table

LangGraph ๊ธฐ๋ฐ˜์˜ Agentic ํ”Œ๋กœ์šฐ๋ฅผ ์ด์šฉํ•ด ํ•œ๊ตญ์–ด ํ‘œ ์ด๋ฏธ์ง€๋ฅผ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ HTML ํ‘œ๋กœ ๋ณ€ํ™˜ํ•œ ๋’ค, ๋‚ด์šฉ์„ ๋ถ„์„ํ•˜์—ฌ ๋™์ผํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ ๋ผ์ด์„ ์Šค ํ”„๋ฆฌ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  JSON์œผ๋กœ ํŒŒ์‹ฑํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

ํ”Œ๋กœ์šฐ๋Š” ๋‹ค์Œ 7๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Image2HTML โ€“ ํ‘œ ์ด๋ฏธ์ง€๋ฅผ HTML <table> ๊ตฌ์กฐ๋กœ ๋ณต์›ํ•ฉ๋‹ˆ๋‹ค.
  2. Validate PyMuPDF โ€“ PyMuPDF๋กœ ํŒŒ์‹ฑ๋œ ๊ฒฐ๊ณผ๊ฐ€ ์œ ํšจํ•œ์ง€ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.
  3. Analyze Table โ€“ ์ถ”์ถœ๋œ HTML ํ‘œ์˜ ๊ตฌ์กฐ์™€ ๋ฐ์ดํ„ฐ ํŒจํ„ด์„ ๋ถ„์„ํ•˜์—ฌ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค.
  4. Generate Synthetic Dataset โ€“ ๋ถ„์„๋œ ์š”์•ฝ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋™์ผํ•œ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ฑ„์šด HTML ํ‘œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  5. Self-Reflection โ€“ ์ƒ์„ฑ๋œ ํ‘œ๊ฐ€ ๋ผ์ด์„ ์Šค/๊ฐœ์ธ์ •๋ณด ์ด์Šˆ๊ฐ€ ์—†๋Š”์ง€ ์ ๊ฒ€ํ•˜๊ณ , ํ•„์š”์‹œ ์žฌ์ƒ์„ฑ์„ ์š”์ฒญํ•ฉ๋‹ˆ๋‹ค.
  6. Parse Synthetic Table โ€“ ์ตœ์ข… ์ƒ์„ฑ๋œ ํ•ฉ์„ฑ HTML ํ‘œ๋ฅผ ๊ตฌ์กฐํ™”๋œ JSON ํฌ๋งท์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  7. Generate QA โ€“ ํ•ฉ์„ฑ๋œ ํ‘œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์งˆ๋ฌธ-๋‹ต๋ณ€(QA) ์Œ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Flow Diagram

graph TD
    START[Start] -->|Provider: OpenAI/Gemini| generate_synthetic_table_from_image[Generate Synthetic Table From Image]
    START[Start] -->|Other Providers| pymupdf_parse[PyMuPDF Parse]

    generate_synthetic_table_from_image --> self_reflection[Self Reflection]

    pymupdf_parse --> validate_parsed_table[Validate PyMuPDF]
    
    validate_parsed_table -->|Valid| analyze_table[Analyze Table]
    validate_parsed_table -->|Invalid| image_to_html[Image to HTML]
    
    image_to_html --> analyze_table
    analyze_table --> generate_synthetic_table[Generate Synthetic Table]
    generate_synthetic_table --> self_reflection
    
    self_reflection -->|Passed| parse_synthetic_table[Parse Synthetic Table]
    self_reflection -->|Failed| revise_synthetic_table[Revise Synthetic Table]
    
    revise_synthetic_table --> self_reflection
    parse_synthetic_table --> generate_qa[Generate QA]
    generate_qa --> END[End]
Loading

์ฃผ์š” ์ฝ”๋“œ ์„ค๋ช… (Code Review Guide)

์ฝ”๋“œ ๋ฆฌ๋ทฐ ์‹œ ์ฐธ๊ณ ํ•  ์ฃผ์š” ํŒŒ์ผ๊ณผ ํ•ต์‹ฌ ๋กœ์ง์— ๋Œ€ํ•œ ์„ค๋ช…์ž…๋‹ˆ๋‹ค.

1. generate_synthetic_table/flow.py

ํ•ต์‹ฌ ๋กœ์ง์ธ LangGraph ํ”Œ๋กœ์šฐ๊ฐ€ ์ •์˜๋œ ํŒŒ์ผ์ž…๋‹ˆ๋‹ค.

  • TableState (TypedDict):

    • ํ”Œ๋กœ์šฐ ์ „์ฒด์—์„œ ๊ณต์œ ๋˜๋Š” ์ƒํƒœ ๊ฐ์ฒด์ž…๋‹ˆ๋‹ค.
    • image_path: ์ž…๋ ฅ ์ด๋ฏธ์ง€ ๊ฒฝ๋กœ
    • html_table: ์ด๋ฏธ์ง€์—์„œ ์ถ”์ถœ๋œ ์›๋ณธ HTML
    • table_summary: ํ‘œ ๊ตฌ์กฐ ๋ฐ ๋ฐ์ดํ„ฐ ํŒจํ„ด ์š”์•ฝ
    • synthetic_table: ์ƒ์„ฑ๋œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ HTML
    • synthetic_json: ์ตœ์ข… ํŒŒ์‹ฑ๋œ JSON ๋ฐ์ดํ„ฐ
    • reflection, passed, attempts: ์ž๊ธฐ ์ ๊ฒ€ ๋ฐ ์žฌ์‹œ๋„ ๋กœ์ง์„ ์œ„ํ•œ ํ•„๋“œ๋“ค
  • build_synthetic_table_graph:

    • LangGraph์˜ ๋…ธ๋“œ์™€ ์—ฃ์ง€๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • image_to_html -> parse_contents -> generate_synthetic_table -> self_reflection ์ˆœ์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.
    • self_reflection ๊ฒฐ๊ณผ์— ๋”ฐ๋ผ revise_synthetic_table๋กœ ์ด๋™ํ•˜์—ฌ ์žฌ์‹œ๋„ํ•˜๊ฑฐ๋‚˜, ์„ฑ๊ณต ์‹œ parse_synthetic_table๋กœ ์ด๋™ํ•˜์—ฌ ์ข…๋ฃŒํ•ฉ๋‹ˆ๋‹ค.
  • Nodes:

    • image_to_html_node: VLM์„ ์‚ฌ์šฉํ•ด ์ด๋ฏธ์ง€๋ฅผ HTML๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    • parse_synthetic_table_node: ํ•ฉ์„ฑ๋œ HTML์„ ์ตœ์ข…์ ์œผ๋กœ JSON์œผ๋กœ ํŒŒ์‹ฑํ•˜์—ฌ ํ™œ์šฉํ•˜๊ธฐ ์‰ฝ๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

2. generate_synthetic_table/runner.py

CLI ์‹คํ–‰ ๋ฐ ํŒŒ์ผ ์ž…์ถœ๋ ฅ์„ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

  • run_with_args:
    • argparse๋กœ ๋ฐ›์€ ์ธ์ž๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ํ”Œ๋กœ์šฐ๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ์‹คํ–‰ ๊ฒฐ๊ณผ(html_table, synthetic_table, synthetic_json)๋ฅผ ๊ฐ๊ฐ ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๋Š” ๋กœ์ง์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

3. Prompts (generate_synthetic_table/prompts/)

LLM์—๊ฒŒ ์ „๋‹ฌ๋˜๋Š” ์ง€์‹œ์‚ฌํ•ญ๋“ค์ž…๋‹ˆ๋‹ค. ์˜๋ฌธ์œผ๋กœ ์ž‘์„ฑ๋˜์–ด ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • image_to_html.txt: ์ด๋ฏธ์ง€์—์„œ ํ‘œ ๊ตฌ์กฐ(rowspan, colspan ํฌํ•จ)๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ถœํ•˜๋„๋ก ์ง€์‹œํ•ฉ๋‹ˆ๋‹ค.
  • generate_synthetic_table.txt: ์›๋ณธ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋˜, ๋‚ด์šฉ์€ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋กœ ์™„์ „ํžˆ ๋Œ€์ฒดํ•˜๋„๋ก ์ง€์‹œํ•ฉ๋‹ˆ๋‹ค.
  • self_reflection.txt: ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ๊ณผ ๊ตฌ์กฐ์  ์ •ํ™•์„ฑ์„ ๊ฒ€์ฆํ•˜๋Š” QA ํ”„๋กฌํ”„ํŠธ์ž…๋‹ˆ๋‹ค.
  • parse_synthetic_table.txt: HTML์„ JSON์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ทœ์น™์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

์„ค์น˜

ํ”„๋กœ์ ํŠธ ๋ฃจํŠธ์— .env ํŒŒ์ผ์„ ๋งŒ๋“ค๊ณ  OpenAI ํ‚ค๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

echo "OPENAI_API_KEY=sk-..." > .env

์˜์กด์„ฑ์€ pyproject.toml์„ ํ†ตํ•ด ๊ด€๋ฆฌ๋˜๋ฏ€๋กœ ์›ํ•˜๋Š” ํŒจํ‚ค์ง€ ๋งค๋‹ˆ์ €๋กœ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค. uv๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ:

uv sync

๋˜๋Š” ์ผ๋ฐ˜ pip๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ:

pip install .

์‚ฌ์šฉ๋ฒ•

๋ช…๋ นํ–‰ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ด์šฉํ•ด ํ”Œ๋กœ์šฐ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

# ๊ธฐ๋ณธ ์‹คํ–‰ (๊ฒฐ๊ณผ๋ฅผ result.json ๋ฐ ๊ด€๋ จ ํŒŒ์ผ๋กœ ์ €์žฅ)
python main.py I_table_78.png --save-json result.json

# ๋ชจ๋ธ ๋ฐ ์˜ต์…˜ ์ง€์ •
python main.py I_table_78.png --model gpt-4o --temperature 0.1 --save-json output.json
  • image_path: (ํ•„์ˆ˜) ๋ณ€ํ™˜ํ•  ํ‘œ ์ด๋ฏธ์ง€ ํŒŒ์ผ ๊ฒฝ๋กœ
  • --model: ์‚ฌ์šฉํ•  OpenAI ๋ชจ๋ธ ์ด๋ฆ„ (๊ธฐ๋ณธ๊ฐ’: gpt-4.1-mini)
  • --temperature: ๋ชจ๋ธ ์˜จ๋„ (๊ธฐ๋ณธ๊ฐ’: 0.2)
  • --save-json: ์ตœ์ข… ์ƒํƒœ๋ฅผ JSON ํŒŒ์ผ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. (ํŒŒ์ƒ๋œ HTML ๋ฐ JSON ํŒŒ์ผ๋“ค๋„ ํ•จ๊ป˜ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค)

์‹คํ–‰ ๊ฒฐ๊ณผ์—๋Š” HTML ํ‘œ, ๋‚ด์šฉ ์š”์•ฝ, ํ•ฉ์„ฑ ํ‘œ, ์ž๊ธฐ ์ ๊ฒ€ ๊ฒฐ๊ณผ, ๊ทธ๋ฆฌ๊ณ  ํŒŒ์‹ฑ๋œ JSON ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.