fix: use Unicode-aware keyword extraction in InMemoryMemoryService by yodeee9 · Pull Request #5502 · google/adk-python

yodeee9 · 2026-04-27T08:49:37Z

Fixes #5501
Related: #2438

Summary

_extract_words_lower() uses re.findall(r'[A-Za-z]+', text) which only extracts ASCII letters. search_memory() returns no results for non-Latin text — Korean, Cyrillic, Japanese, Chinese, etc. are all excluded from token extraction.

Changes

Replace [A-Za-z]+ with \w+ for Unicode-aware token extraction. Fixes search for space-delimited scripts (Korean, Cyrillic, Arabic, etc.)
Add a non-ASCII containment fallback in search_memory() for scripts without whitespace word boundaries (Japanese, Chinese). This enables direct keyword lookups but does not implement language-aware segmentation.

This keeps the service lightweight and dependency-free. Languages without whitespace word boundaries work for direct keyword searches (e.g. "太郎" matches "私の名前は太郎です") but not for natural language queries. Production multilingual retrieval should use VertexAiMemoryBankService.

Testing plan

9 parametrized cases added to tests/unittests/memory/test_in_memory_memory_service.py:

Japanese: substring match, no-match
Chinese: substring match, no-match
Korean: token match
Cyrillic: token match
Mixed Japanese + English: both scripts searchable
Latin partial-word regression: "thon" does not match "Python"

pytest tests/unittests/memory/test_in_memory_memory_service.py -v
======================== 21 passed in 1.01s =========================

google-cla · 2026-04-27T08:49:48Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

adk-bot · 2026-04-27T08:50:38Z

Response from ADK Triaging Agent

Hello @yodeee9, thank you for your contribution!

It looks like you have not yet signed the Contributor License Agreement (CLA). Please visit https://cla.developers.google.com/ to sign the agreement. Once you have done so, the CLA check will pass and we will be able to proceed with reviewing your PR.

Thanks!

Replace [A-Za-z]+ with \w+ so token extraction includes Unicode word characters. Add a non-ASCII containment fallback in search_memory() for scripts without whitespace word boundaries (Japanese, Chinese). Fixes google#5501

rohityan · 2026-04-27T22:36:12Z

Hi @yodeee9 , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Can you please fix the failing tests before we can proceed with the review.

adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Apr 27, 2026

yodeee9 force-pushed the fix/memory-non-latin-search branch from f538de1 to d940392 Compare April 27, 2026 08:55

rohityan self-assigned this Apr 27, 2026

Merge branch 'main' into fix/memory-non-latin-search

2f708f2

rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use Unicode-aware keyword extraction in InMemoryMemoryService#5502

fix: use Unicode-aware keyword extraction in InMemoryMemoryService#5502
yodeee9 wants to merge 2 commits intogoogle:mainfrom
yodeee9:fix/memory-non-latin-search

yodeee9 commented Apr 27, 2026

Uh oh!

google-cla Bot commented Apr 27, 2026

Uh oh!

adk-bot commented Apr 27, 2026

Uh oh!

rohityan commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yodeee9 commented Apr 27, 2026

Summary

Changes

Testing plan

Uh oh!

google-cla Bot commented Apr 27, 2026

Uh oh!

adk-bot commented Apr 27, 2026

Uh oh!

rohityan commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants