법령, 시행령, 시행규칙 Preprocessing

법령, 시행령, 시행규칙을 수집하고, 전처리하여 데이터셋을 생성합니다.

0. Law Process

법령 사이트에서 관련 법령 파일(doc)로 저장: doc
raw docx: doc to docx (python)
raw jsonl: docx to jsonl
vector DB: jsonl to vector db by embedding
SQL DB: jsonl to SQL db for caching
(TO Do): OpenAI API로 row jsonl에서 문제 생성
- input:
- output: question, context, options, answer, explanation

1. params format

input: 법령, 시행령, 시행 규칙
(final) output:
1. vector db
2. sql db
3. questions list: 정의적 문제 생성

2. Raw docx to record jsonl

source: 법령, 시행령, 시행규칙
record: jsonl

{
  "text": "제2조(정의) 이 법에서 사용하는 용어의 뜻은 다음과 같다. 1. “담보계약”이란 「민법」 제608조에 따라 그 효력이 상실되는 대물반환(代物返還)의 예약[환매(還買), 양도담보(讓渡擔保) 등 명목(名目)이 어떠하든 그 모두를 포함한다]에 포함되거나 병존(竝存)하는 채권담보(債權擔保) 계약을 말한다. 2. “채무자등”이란 다음 각 목의 자를 말한다. 가. 채무자 나. 담보가등기목적 부동산의 물상보증인(物上保證人) 다. 담보가등기 후 소유권을 취득한 제삼자 3. “담보가등기(擔保假登記)”란 채권담보의 목적으로 마친 가등기를 말한다. 4. “강제경매등”이란 강제경매(强制競賣)와 담보권의 실행 등을 위한 경매를 말한다. 5. “후순위권리자(後順位權利者)”란 담보가등기 후에 등기된 저당권자ㆍ전세권자 및 담보가등기권리자를 말한다. [전문개정 2008. 3. 21.]",
  "metadata": {
    "source": "가등기담보 등에 관한 법률(법률)(제14474호)(20170328)",
    "article_num": "제2조",
    "article_title": "정의",
    "enforcement_date": "2017. 3. 28."
  }
}

3. Record jsonl to vector db

Chroma DB
text 필드 → 임베딩 생성 후 documents로 저장됨
metadata 필드 → metadatas로 저장됨
- 여기에 "subject": 파일명에서 추출한 값도 추가됨
text 값의 md5 해시값 → ids로 저장됨

즉, Chroma에 저장될 데이터 구조는 아래와 같다:

{
  "ids": ["<text의 md5 해시>"],
  "documents": ["제2조(정의) 이 법에서 사용하는 용어의 뜻은 다음과 같다. ..."],
  "metadatas": [{
    "subject": "tax",  # 예: tax_records.jsonl에서 왔으면 tax
    "source": "가등기담보 등에 관한 법률(법률)(제14474호)(20170328)",
    "article_num": "제2조",
    "article_title": "정의",
    "enforcement_date": "2017. 3. 28."
  }]
}

4. Parse Law Data and save to Chroma DB

import os
import json
import hashlib
from tqdm import tqdm
from pprint import pprint
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions
from constants.embedding_models import embedding_models


# 설정
JSON_FILES = [
    os.path.abspath("data/real_estate_agent/refined/law/civil_records.jsonl"),
    os.path.abspath("data/real_estate_agent/refined/law/brokerage_records.jsonl"),
    os.path.abspath("data/real_estate_agent/refined/law/disclosure_records.jsonl"),
    os.path.abspath("data/real_estate_agent/refined/law/public_records.jsonl"),
    os.path.abspath("data/real_estate_agent/refined/law/tax_records.jsonl")
]

CHROMA_PATH = os.path.abspath("data/index/law_db")
COLLECTION_NAME = "law_all"
BATCH_SIZE = 64



# 임베딩 모델 및 함수 설정
# EMBEDDING_MODEL_NAME = "upskyy/bge-m3-korean"
EMBEDDING_MODEL_NAME = embedding_models[1]  # 사용하고자 하는 모델 선택
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBEDDING_MODEL_NAME)

# Chroma 클라이언트 및 컬렉션 초기화
client = chromadb.PersistentClient(path=CHROMA_PATH)
collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}
)
print(f"✅ '{COLLECTION_NAME}' 컬렉션 로드 성공")

# 모든 JSON 파일 처리
for path in JSON_FILES:
    if not os.path.isfile(path) or not path.endswith(".jsonl"):
        continue

    with open(path, "r", encoding="utf-8") as f:
        lines = [json.loads(line) for line in f]

    subject_name = os.path.basename(path).replace("_records.jsonl", "")
    texts = [rec["text"].strip() for rec in lines if rec.get("text")]
    metas = [{"subject": subject_name, **rec.get("metadata", {})} for rec in lines if rec.get("text")]
    ids = [hashlib.md5(text.encode("utf-8")).hexdigest() for text in texts]

    for i in tqdm(range(0, len(texts), BATCH_SIZE), desc=f"{subject_name} 저장"):
        batch_texts = texts[i:i+BATCH_SIZE]
        batch_metas = metas[i:i+BATCH_SIZE]
        batch_ids = ids[i:i+BATCH_SIZE]
        collection.add(
            documents=batch_texts,
            metadatas=batch_metas,
            ids=batch_ids
        )

print(f"✅ 저장 완료. Chroma DB 경로: {CHROMA_PATH}")

# 컬렉션 정보 출력
print("\n=== 컬렉션 정보 ===")
print(f"문서 개수: {collection.count()}")
print(f"메타데이터 키 예시: {collection.metadata}")

# 샘플 문서 조회 (최대 3개)
print("\n=== 샘플 문서 ===")
sample_docs = collection.peek(limit=3)
for i, (doc, meta) in enumerate(zip(sample_docs['documents'], sample_docs['metadatas'])):
    print(f"\n📄 문서 {i+1}:")
    print(f"내용: {doc[:100]}...")
    print("메타데이터:")
    pprint(meta)

# 검색 테스트
print("\n=== 검색 테스트 ===")
test_queries = ["계약 해지", "부동산 매매", "손해배상"]
for query in test_queries:
    print(f"\n🔍 검색어: '{query}'")
    results = collection.query(
        query_texts=[query],
        n_results=2,
        include=["documents", "metadatas", "distances"]
    )

    for i, (doc, meta, dist) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        print(f"\n결과 {i+1} (유사도: {1-dist:.2f}):")
        print(f"문서: {doc[:80]}...")
        print("메타데이터:")
        pprint(meta)

print("\n✅ 검증 완료")

5. Search Chroma DB

import os
from pprint import pprint
from chromadb import PersistentClient
from chromadb.utils import embedding_functions
from constants.embedding_models import embedding_models


# 설정
CHROMA_PATH = os.path.abspath("data/index/law_db")
COLLECTION_NAME = "law_all"
# EMBEDDING_MODEL_NAME = "upskyy/bge-m3-korean"
EMBEDDING_MODEL_NAME = embedding_models[1]  # 사용하고자 하는 모델 선택

# 임베딩 함수 설정
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBEDDING_MODEL_NAME)

# Chroma 클라이언트 및 컬렉션 로드
client = PersistentClient(path=CHROMA_PATH)
collection = client.get_collection(name=COLLECTION_NAME, embedding_function=embedding_fn)

def search_law_db(query: str, top_k: int = 3):
    print(f"\n🔍 검색어: '{query}'")
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    for i, (doc, meta, dist) in enumerate(zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    )):
        print(f"\n📄 결과 {i+1} (유사도: {1 - dist:.2f})")
        print(f"문서: {doc[:150]}...")
        print("메타데이터:")
        print(meta)

# 테스트 실행
if __name__ == "__main__":
    while True:
        query = input("\n검색어를 입력하세요 (종료하려면 'exit'): ").strip()
        if query.lower() in ["exit", "quit"]:
            print("✅ 종료합니다.")
            break
        search_law_db(query)

returns

📄 결과 3 (유사도: 0.92)
문서: 제8조(응시원서 등) ①시험에 응시하고자 하는 자는 국토교통부령이 정하는 바에 따라 응시원서를 제출하여야 한다. <개정 2008. 2. 29., 2013. 3. 23.> ②시험시행기관장은 응시수수료를 납부한 자가 다음 각 호의 어느 하나에 해당하는 경우에는 국토교통부 령...
메타데이터:
{'source': '공인중개사법 시행령(대통령령)(제34401호)(20240710)', 'subject': 'brokerage', 'article_title': '응시원서 등', 'enforcement_date': '2024. 7. 10.', 'article_num': '제8조'}

6. Caching

SQLite