Skip to content

fix: use Unicode-aware keyword extraction in InMemoryMemoryService#5502

Open
yodeee9 wants to merge 2 commits intogoogle:mainfrom
yodeee9:fix/memory-non-latin-search
Open

fix: use Unicode-aware keyword extraction in InMemoryMemoryService#5502
yodeee9 wants to merge 2 commits intogoogle:mainfrom
yodeee9:fix/memory-non-latin-search

Conversation

@yodeee9
Copy link
Copy Markdown

@yodeee9 yodeee9 commented Apr 27, 2026

Fixes #5501
Related: #2438

Summary

_extract_words_lower() uses re.findall(r'[A-Za-z]+', text) which only extracts ASCII letters. search_memory() returns no results for non-Latin text — Korean, Cyrillic, Japanese, Chinese, etc. are all excluded from token extraction.

Changes

  • Replace [A-Za-z]+ with \w+ for Unicode-aware token extraction. Fixes search for space-delimited scripts (Korean, Cyrillic, Arabic, etc.)
  • Add a non-ASCII containment fallback in search_memory() for scripts without whitespace word boundaries (Japanese, Chinese). This enables direct keyword lookups but does not implement language-aware segmentation.

This keeps the service lightweight and dependency-free. Languages without whitespace word boundaries work for direct keyword searches (e.g. "太郎" matches "私の名前は太郎です") but not for natural language queries. Production multilingual retrieval should use VertexAiMemoryBankService.

Testing plan

9 parametrized cases added to tests/unittests/memory/test_in_memory_memory_service.py:

  • Japanese: substring match, no-match
  • Chinese: substring match, no-match
  • Korean: token match
  • Cyrillic: token match
  • Mixed Japanese + English: both scripts searchable
  • Latin partial-word regression: "thon" does not match "Python"
pytest tests/unittests/memory/test_in_memory_memory_service.py -v
======================== 21 passed in 1.01s =========================

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Apr 27, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@adk-bot adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Apr 27, 2026
@adk-bot
Copy link
Copy Markdown
Collaborator

adk-bot commented Apr 27, 2026

Response from ADK Triaging Agent

Hello @yodeee9, thank you for your contribution!

It looks like you have not yet signed the Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to sign the agreement. Once you have done so, the CLA check will pass and we will be able to proceed with reviewing your PR.

Thanks!

Replace [A-Za-z]+ with \w+ so token extraction includes Unicode word
characters. Add a non-ASCII containment fallback in search_memory() for
scripts without whitespace word boundaries (Japanese, Chinese).

Fixes google#5501
@yodeee9 yodeee9 force-pushed the fix/memory-non-latin-search branch from f538de1 to d940392 Compare April 27, 2026 08:55
@rohityan rohityan self-assigned this Apr 27, 2026
@rohityan rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label Apr 27, 2026
@rohityan
Copy link
Copy Markdown
Collaborator

Hi @yodeee9 , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Can you please fix the failing tests before we can proceed with the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

request clarification [Status] The maintainer need clarification or more information from the author services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InMemoryMemoryService search doesn't work with non-Latin text (Japanese, CJK, etc.)

3 participants