fix: use Unicode-aware keyword extraction in InMemoryMemoryService#5502
fix: use Unicode-aware keyword extraction in InMemoryMemoryService#5502yodeee9 wants to merge 2 commits intogoogle:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Response from ADK Triaging Agent Hello @yodeee9, thank you for your contribution! It looks like you have not yet signed the Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to sign the agreement. Once you have done so, the CLA check will pass and we will be able to proceed with reviewing your PR. Thanks! |
Replace [A-Za-z]+ with \w+ so token extraction includes Unicode word characters. Add a non-ASCII containment fallback in search_memory() for scripts without whitespace word boundaries (Japanese, Chinese). Fixes google#5501
f538de1 to
d940392
Compare
|
Hi @yodeee9 , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Can you please fix the failing tests before we can proceed with the review. |
Fixes #5501
Related: #2438
Summary
_extract_words_lower()usesre.findall(r'[A-Za-z]+', text)which only extracts ASCII letters.search_memory()returns no results for non-Latin text — Korean, Cyrillic, Japanese, Chinese, etc. are all excluded from token extraction.Changes
[A-Za-z]+with\w+for Unicode-aware token extraction. Fixes search for space-delimited scripts (Korean, Cyrillic, Arabic, etc.)search_memory()for scripts without whitespace word boundaries (Japanese, Chinese). This enables direct keyword lookups but does not implement language-aware segmentation.This keeps the service lightweight and dependency-free. Languages without whitespace word boundaries work for direct keyword searches (e.g. "太郎" matches "私の名前は太郎です") but not for natural language queries. Production multilingual retrieval should use
VertexAiMemoryBankService.Testing plan
9 parametrized cases added to
tests/unittests/memory/test_in_memory_memory_service.py: