Skip to content

fix: Redact SSO PII before deletion#38425

Open
ktyagiapphelix2u wants to merge 18 commits intoopenedx:masterfrom
ktyagiapphelix2u:ktyagi/SSOPII
Open

fix: Redact SSO PII before deletion#38425
ktyagiapphelix2u wants to merge 18 commits intoopenedx:masterfrom
ktyagiapphelix2u:ktyagi/SSOPII

Conversation

@ktyagiapphelix2u
Copy link
Copy Markdown
Contributor

@ktyagiapphelix2u ktyagiapphelix2u commented Apr 23, 2026

Description

Implements automatic PII redaction for UserSocialAuth records before deletion to prevent personally identifiable information from persisting after records are removed.

Jira Ticket

https://2u-internal.atlassian.net/browse/BOMS-514

@ktyagiapphelix2u ktyagiapphelix2u marked this pull request as ready for review April 23, 2026 11:29
@ktyagiapphelix2u ktyagiapphelix2u requested a review from a team as a code owner April 23, 2026 11:29
Comment thread openedx/core/djangoapps/user_api/accounts/tests/test_utils.py Outdated
Comment thread openedx/core/djangoapps/user_api/accounts/tests/test_utils.py Outdated
Comment thread openedx/core/djangoapps/user_api/accounts/utils.py Outdated
Comment thread openedx/core/djangoapps/user_api/accounts/signals.py Outdated
Comment thread openedx/core/djangoapps/user_api/management/commands/retire_user.py Outdated
Comment thread openedx/core/djangoapps/user_api/accounts/signals.py Outdated
@ktyagiapphelix2u
Copy link
Copy Markdown
Contributor Author

@robrap We’re dealing with multiple ways SSO records can get deleted through Django admin, user actions like unlinking accounts, bulk retirement scripts. The challenge is that we don’t control all of these paths, so we can’t reliably add PII redaction directly into each one.

Instead, we’ve set up a two-layer approach.

The first layer is a Django signal that runs automatically right before any SSO record is deleted. This acts as a safety net. No matter how the deletion is triggered whether it’s from admin, user action, the signal ensures sensitive fields like the UID and extra data are redacted. It’s centralized, consistent, so it won’t cause issues if it runs more than once.

The second layer is used only in cases we fully control, like user retirement flows. There, we proactively run a bulk redaction step before deleting records. This is much faster because it uses efficient database operations. When the delete happens afterward, the signal still fires, but it detects that the data is already redacted and simply exits without doing extra work.

Together, these two layers cover both safety and performance. The signal guarantees we never miss redaction, even in code we don’t control, while the explicit bulk step keeps large-scale operations efficient.


try:
update_fields = {}
redacted_uid = f'redacted_{instance.pk}@retired.invalid'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We should use something more generic, like redact-before-delete@safe.com (or whatever we've come up with). This is not a retired email in all cases.
  2. Since we are using this email as a flag between the bulk retirement and here, we should be using a constant that both pieces of code make use of, to ensure they stay in sync.
  3. I'd comment the short-circuit code below to mention why we are doing it. Something like:
These fields may have already been redacted as part of a bulk retirement,
so we skip the update if it is already done to reduce query count.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[for future] I wonder if we should have generic code for models with annotated PII that automatically introduce this redaction into the pre_delete signal?

Comment on lines +149 to +153
social_auth_records = list(UserSocialAuth.objects.filter(user_id=user.id))
for auth in social_auth_records:
auth.uid = get_redacted_social_auth_uid(auth.pk)
auth.extra_data = {}
UserSocialAuth.objects.bulk_update(social_auth_records, ['uid', 'extra_data'])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. See "Using F() Expressions" in https://websitehurdles.com/django-bulk-update/ as an example of how you could refer to the pk without an extra call.
  2. Doing so may make it impossible to use the shared method, but you could use the constants you had started with.

UserRetirementStatus.create_retirement(user)
# Unlink LMS social auth accounts
# Redact and unlink LMS social auth accounts.
social_auth_records = list(UserSocialAuth.objects.filter(user_id=user.id))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to put UserSocialAuth.objects.filter(user_id=user.id) in a variable and call the bulk_update and delete off of this? It might make it more clear that we're working from the same set.

Comment on lines +211 to +220
social_auth_queryset = UserSocialAuth.objects.filter(user_id=user.id)
social_auth_queryset.update(
uid=Concat(
Value(REDACTED_SOCIAL_AUTH_UID_PREFIX),
Cast('id', output_field=CharField()),
Value(REDACTED_SOCIAL_AUTH_UID_SUFFIX),
),
extra_data={},
)
social_auth_queryset.delete()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to new method in utils called redact_and_delete_social_auth(user_id). This can be called from both locations, rather than duplicating code. Docstring can remind why we are redacting before deleting.

Comment on lines +150 to +160
# Redact and unlink LMS social auth accounts.
social_auth_queryset = UserSocialAuth.objects.filter(user_id=user.id)
social_auth_queryset.update(
uid=Concat(
Value(REDACTED_SOCIAL_AUTH_UID_PREFIX),
Cast('id', output_field=CharField()),
Value(REDACTED_SOCIAL_AUTH_UID_SUFFIX),
),
extra_data={},
)
social_auth_queryset.delete()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See related comment about redact_and_delete_social_auth.

and clears extra_data.
Blocks deletion if redaction fails to prevent PII leaks to downstream systems.
"""
if not instance or not instance.pk:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to greatly simplify this safety-net method to something like the following:

redacted_uid = get_redacted_social_auth_uid(instance.pk)

# safety-net in case the record wasn't redacted before delete.
if instance.extra_data or instance.uid != redacted_uid:
    logger.warn('Social auth link for ... was deleted without first being redacted.')
    redact_and_delete_social_auth(instance.user_id, skip_delete=True)
  • Reuses existing call (see other comments) and sends optional argument to skip the delete step.
  • The optional skip_delete is a little hacky, but the docstring for the method could note that it is only to be used with the delete signal, where delete was already called.
  • It is also a little hacky that the first link for the user will trigger redaction for all the remaining links (because it takes user_id and not pk, but it greatly simplifies the code to reuse.

I'd also hope that we wouldn't need any additional exception handling.

Comment on lines +13 to +15
# Prefix and suffix used to build a per-record redacted uid for UserSocialAuth.
REDACTED_SOCIAL_AUTH_UID_PREFIX = 'redacted-before-delete-'
REDACTED_SOCIAL_AUTH_UID_SUFFIX = '@safe.com'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make more sense for these to live in utils.py instead.


def get_redacted_social_auth_uid(pk):
"""
Return the redacted uid for a UserSocialAuth record. Single source of truth for this format.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following is a possible update for this docstring.

Suggested change
Return the redacted uid for a UserSocialAuth record. Single source of truth for this format.
Return the redacted uid for a UserSocialAuth record.
This must match the format used in redact_and_delete_social_auth.

# Safety-net in case the record wasn't redacted before delete.
if instance.extra_data or instance.uid != redacted_uid:
logger.warning(
'Social auth link for user_id=%s, provider=%s was deleted without first being redacted.',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should make it less scary, since we are fixing the issue.

Suggested change
'Social auth link for user_id=%s, provider=%s was deleted without first being redacted.',
'Social auth link for user_id=%s, provider=%s was deleted without first being redacted. Redacting in pre_delete.',

Copy link
Copy Markdown
Contributor

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to look at tests, but some minor comments. Looking good so far.

"""
Redact PII from all UserSocialAuth records for the given user, then delete them.

Redaction happens before deletion so that any observers see only sanitised data.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like when you include something like the following when documenting redact_and_delete functionality, because it is non-obvious:

Downstream copies of data may use soft-deletes, and redacting before deleting ensures PII for retired users (or future retirements) is not retained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants