Work in progress Python port of UMUTextStats, a linguistic feature extraction framework inspired by LIWC and stylometric analysis systems.
The project analyzes text datasets and extracts lexical, morphosyntactic, stylistic, pragmatic, psycholinguistic, social-media and error-related features from an XML configuration file.
This is an active migration/adaptation of UMUTextStats to Python.
Current features include:
- CSV input loading
- XML configuration loading
- Text preprocessing
- Cached pipeline stages
- Common feature computation
- Stanza POS/NER annotation
- Dictionary-based dimensions
- Pattern-based dimensions
- Composite dimensions
- Feature summarization
- Feature ranking
- Sparse/zero-only feature detection
- Aggregate statistics by metadata groups
- CSV/JSON output
- Optional profiling stats
pip install -e .Extract linguistic features from a dataset.
umutextstats analyze dataset.csv -t tweet -o features.csvWith profiling stats:
umutextstats analyze dataset.csv \
-t tweet \
-o features.csv \
--stats stats.csvUsing a custom XML configuration:
umutextstats analyze dataset.csv \
-t tweet \
-c path/to/config.xml \
-o features.csvDisable cache:
umutextstats analyze dataset.csv \
-t tweet \
-o features.csv \
--no-cacheCompute descriptive statistics from extracted features.
umutextstats summarize features.csvGenerate JSON summary:
umutextstats summarize features.csv \
-o summary.jsonForce long/tidy CSV format:
umutextstats summarize features.csv \
--format long \
-o summary.csvForce nested JSON format:
umutextstats summarize features.csv \
--format nested \
-o summary.jsonCompute grouped statistics using metadata columns.
Example dataset:
| id | text | split | author |
|---|---|---|---|
| 1 | ... | train | alice |
| 2 | ... | test | bob |
Example features:
| id | feature_1 | feature_2 |
|---|---|---|
| 1 | ... | ... |
| 2 | ... | ... |
Aggregate by split:
umutextstats aggregate \
features.csv \
--metadata dataset.csv \
--group-by split \
--id-column id \
-o aggregate_by_split.jsonAggregate by author:
umutextstats aggregate \
features.csv \
--metadata dataset.csv \
--group-by author \
--id-column id \
-o aggregate_by_author.jsonSupported output formats:
.csv
.json
Outputs include:
- Feature matrices
- Dataset summaries
- Feature rankings
- Aggregate/group statistics
UMUTextStats supports UNIX pipes and standard streams, making it easy to integrate into shell workflows.
echo 'tweet
hola mundo
esto es una prueba' \
| umutextstats analyze - -t tweet --only stylometry --no-stanzacat dataset.csv \
| umutextstats analyze - -t tweet \
--only stylometry \
--head 100 \
--no-stanza \
| umutextstats summarize - \
--rank-by mean \
--top 10cat dataset.csv \
| umutextstats analyze - -t tweet \
--only 'stylometry|errors|register' \
| umutextstats summarize - \
--rank-by nonzero_ratiocat dataset.csv \
| umutextstats analyze - -t tweet -o features.csv
umutextstats summarize features.csv \
--rank-by mean \
--top 20cat dataset.csv \
| umutextstats analyze - -t tweet \
--head 20 \
--only psycholinguistic-processes \
| umutextstats summarize - \
--sparse-threshold 0.01umutextstats includes a local cache system to avoid recomputing
expensive NLP pipelines, embeddings, and feature extraction steps.
The cache is automatically invalidated when relevant processing parameters change (for example processors, GPU usage, or batch sizes).
Display cache directory, total files, and size:
umutextstats cache infoExample output:
Cache dir: .umutextstats_cache
Exists: True
Files: 182
Size: 4.2 GB
Show cached files ordered by most recent usage:
umutextstats cache listLimit the number of displayed files:
umutextstats cache list --limit 100Example output:
2025-08-12 11:30:42 12.4 MB stanza/abc123.pkl
2025-08-12 11:28:01 84.1 MB embeddings/def456.parquet
Delete cache files older than a given number of days:
umutextstats cache prune --older-than-days 30Skip confirmation:
umutextstats cache prune --older-than-days 30 --yesDelete the full cache directory:
umutextstats cache clearSkip confirmation:
umutextstats cache clear --yesAll cache commands support a custom cache directory:
umutextstats cache info --cache-dir .cache_dev- Cache keys are parameter-aware.
- Different NLP processor configurations generate different cache entries.
- Cache cleanup is safe and only removes cached artifacts.
- Recommended for large NLP pipelines and embedding workflows.
UMUTextStats includes utilities for:
- Descriptive statistics
- Feature ranking
- Sparse feature detection
- Zero-only feature detection
Future analysis modules may include:
- Mutual Information
- Information Gain
- Effect Size
- Discriminative feature analysis
- Stylometric profiling
Run tests with:
pytestThis project is still under development.
Results, APIs and internal feature implementations may change while the Python version is being aligned with the original UMUTextStats behavior.
This work is part of the research project AIInFunds (PDC2021-121112-I00), funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. This work is also part of the research project LaTe4PSP (PID2019-107652RB-I00), funded by MCIN/AEI/10.13039/501100011033. Besides, it was partially supported by Fundacion Séneca -the Regional Agency for Science and Technology of Murcia (Spain)- through project 20963/PI/18. In addition, José Antonio García-Díaz is supported by Banco Santander and the University of Murcia through the Doctorado Industrial programme
@inproceedings{garcia2022umutextstats,
title={Umutextstats: A linguistic feature extraction tool for spanish},
author={Garc{\'\i}a-D{\'\i}az, Jos{\'e} Antonio and Vivancos-Vicente, Pedro Jos{\'e} and Almela, Angela and Valencia-Garc{\'\i}a, Rafael},
booktitle={Proceedings of the thirteenth language resources and evaluation conference},
pages={6035--6044},
year={2022}
}
