UMUTextStats Python

Work in progress Python port of UMUTextStats, a linguistic feature extraction framework inspired by LIWC and stylometric analysis systems.

The project analyzes text datasets and extracts lexical, morphosyntactic, stylistic, pragmatic, psycholinguistic, social-media and error-related features from an XML configuration file.

Status

This is an active migration/adaptation of UMUTextStats to Python.

Current features include:

CSV input loading
XML configuration loading
Text preprocessing
Cached pipeline stages
Common feature computation
Stanza POS/NER annotation
Dictionary-based dimensions
Pattern-based dimensions
Composite dimensions
Feature summarization
Feature ranking
Sparse/zero-only feature detection
Aggregate statistics by metadata groups
CSV/JSON output
Optional profiling stats

Installation

pip install -e .

CLI Usage

Analyze

Extract linguistic features from a dataset.

umutextstats analyze dataset.csv -t tweet -o features.csv

With profiling stats:

umutextstats analyze dataset.csv \
  -t tweet \
  -o features.csv \
  --stats stats.csv

Using a custom XML configuration:

umutextstats analyze dataset.csv \
  -t tweet \
  -c path/to/config.xml \
  -o features.csv

Disable cache:

umutextstats analyze dataset.csv \
  -t tweet \
  -o features.csv \
  --no-cache

Summarize

Compute descriptive statistics from extracted features.

umutextstats summarize features.csv

Generate JSON summary:

umutextstats summarize features.csv \
  -o summary.json

Force long/tidy CSV format:

umutextstats summarize features.csv \
  --format long \
  -o summary.csv

Force nested JSON format:

umutextstats summarize features.csv \
  --format nested \
  -o summary.json

Aggregate

Compute grouped statistics using metadata columns.

Example dataset:

id	text	split	author
1	...	train	alice
2	...	test	bob

Example features:

id	feature_1	feature_2
1	...	...
2	...	...

Aggregate by split:

umutextstats aggregate \
  features.csv \
  --metadata dataset.csv \
  --group-by split \
  --id-column id \
  -o aggregate_by_split.json

Aggregate by author:

umutextstats aggregate \
  features.csv \
  --metadata dataset.csv \
  --group-by author \
  --id-column id \
  -o aggregate_by_author.json

Output Formats

Supported output formats:

.csv
.json

Outputs include:

Feature matrices
Dataset summaries
Feature rankings
Aggregate/group statistics

UNIX Pipes and Standard Input

UMUTextStats supports UNIX pipes and standard streams, making it easy to integrate into shell workflows.

Analyze from standard input

echo 'tweet
hola mundo
esto es una prueba' \
| umutextstats analyze - -t tweet --only stylometry --no-stanza

Analyze and summarize in a single pipeline

cat dataset.csv \
| umutextstats analyze - -t tweet \
    --only stylometry \
    --head 100 \
    --no-stanza \
| umutextstats summarize - \
    --rank-by mean \
    --top 10

Combine multiple feature families

cat dataset.csv \
| umutextstats analyze - -t tweet \
    --only 'stylometry|errors|register' \
| umutextstats summarize - \
    --rank-by nonzero_ratio

Save intermediate features while still using pipes

cat dataset.csv \
| umutextstats analyze - -t tweet -o features.csv

umutextstats summarize features.csv \
    --rank-by mean \
    --top 20

Quick exploratory analysis

cat dataset.csv \
| umutextstats analyze - -t tweet \
    --head 20 \
    --only psycholinguistic-processes \
| umutextstats summarize - \
    --sparse-threshold 0.01

Cache Management

umutextstats includes a local cache system to avoid recomputing expensive NLP pipelines, embeddings, and feature extraction steps.

The cache is automatically invalidated when relevant processing parameters change (for example processors, GPU usage, or batch sizes).

Show cache information

Display cache directory, total files, and size:

umutextstats cache info

Example output:

Cache dir: .umutextstats_cache
Exists: True
Files: 182
Size: 4.2 GB

List cache files

Show cached files ordered by most recent usage:

umutextstats cache list

Limit the number of displayed files:

umutextstats cache list --limit 100

Example output:

2025-08-12 11:30:42      12.4 MB  stanza/abc123.pkl
2025-08-12 11:28:01      84.1 MB  embeddings/def456.parquet

Prune old cache files

Delete cache files older than a given number of days:

umutextstats cache prune --older-than-days 30

Skip confirmation:

umutextstats cache prune --older-than-days 30 --yes

Clear the entire cache

Delete the full cache directory:

umutextstats cache clear

Skip confirmation:

umutextstats cache clear --yes

Custom cache directory

All cache commands support a custom cache directory:

umutextstats cache info --cache-dir .cache_dev

Notes

Cache keys are parameter-aware.
Different NLP processor configurations generate different cache entries.
Cache cleanup is safe and only removes cached artifacts.
Recommended for large NLP pipelines and embedding workflows.

Feature Analysis

UMUTextStats includes utilities for:

Descriptive statistics
Feature ranking
Sparse feature detection
Zero-only feature detection

Future analysis modules may include:

Mutual Information
Information Gain
Effect Size
Discriminative feature analysis
Stylometric profiling

Development

Run tests with:

pytest

Notes

This project is still under development.

Results, APIs and internal feature implementations may change while the Python version is being aligned with the original UMUTextStats behavior.

Acknowledgements

This work is part of the research project AIInFunds (PDC2021-121112-I00), funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. This work is also part of the research project LaTe4PSP (PID2019-107652RB-I00), funded by MCIN/AEI/10.13039/501100011033. Besides, it was partially supported by Fundacion Séneca -the Regional Agency for Science and Technology of Murcia (Spain)- through project 20963/PI/18. In addition, José Antonio García-Díaz is supported by Banco Santander and the University of Murcia through the Doctorado Industrial programme

Citation

@inproceedings{garcia2022umutextstats,
  title={Umutextstats: A linguistic feature extraction tool for spanish},
  author={Garc{\'\i}a-D{\'\i}az, Jos{\'e} Antonio and Vivancos-Vicente, Pedro Jos{\'e} and Almela, Angela and Valencia-Garc{\'\i}a, Rafael},
  booktitle={Proceedings of the thirteenth language resources and evaluation conference},
  pages={6035--6044},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
dataset.csv		dataset.csv
debug_dict.py		debug_dict.py
debug_gender_feats.py		debug_gender_feats.py
debug_pos.py		debug_pos.py
head		head
pyproject.toml		pyproject.toml
sample.py		sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UMUTextStats Python

Status

Installation

CLI Usage

Analyze

Summarize

Aggregate

Output Formats

UNIX Pipes and Standard Input

Analyze from standard input

Analyze and summarize in a single pipeline

Combine multiple feature families

Save intermediate features while still using pipes

Quick exploratory analysis

Cache Management

Show cache information

List cache files

Prune old cache files

Clear the entire cache

Custom cache directory

Notes

Feature Analysis

Development

Notes

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UMUTextStats Python

Status

Installation

CLI Usage

Analyze

Summarize

Aggregate

Output Formats

UNIX Pipes and Standard Input

Analyze from standard input

Analyze and summarize in a single pipeline

Combine multiple feature families

Save intermediate features while still using pipes

Quick exploratory analysis

Cache Management

Show cache information

List cache files

Prune old cache files

Clear the entire cache

Custom cache directory

Notes

Feature Analysis

Development

Notes

Acknowledgements

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages