timeparser

Java 21 library for turning textual date expressions into machine-readable time spans, sortable day ranges, and DDB facet ids.

What it does

Given input such as Mai 2010, 15. Jh., or vor 500 Mio. Jahren, the parser can produce:

a compact legacy string like time_62100|time_62110 2455318|2455348
a structured result object with normalization and error metadata
an HTTP JSON response via the embedded demo server

The time_XXX identifiers refer to the DDB Zeitvokabular and are publicly browsable here:

DDB Zeitvokabular: https://xtree-public.digicult-verbund.de/vocnet/?uriVocItem=http://ddb.vocnet.org/zeitvokabular/&startNode=dat00113&lang=de&d=n

Requirements

Java 21
Maven Wrapper included, or Maven 3.9+

Maven

<dependency>
    <groupId>de.ddb.labs</groupId>
    <artifactId>timeparser</artifactId>
    <version>2.0.0-SNAPSHOT</version>
</dependency>

Quick start

import de.ddb.labs.timeparser.TimeParser;
import de.ddb.labs.timeparser.TimeParser.IndexDaysMode;

TimeParser parser = TimeParser.getInstance();

String julian = parser.parseTime("Mai 2010");
String legacy = parser.parseTime("Mai 2010", IndexDaysMode.LEGACY);

parseTime(...) is the fail-safe API: it returns "" on parse failure. For debugging or integration code, prefer parseTimeResult(...).

Output model

The compact output format is:

<facetString> <startIndexDay>|<endIndexDay>

Example:

time_62100|time_62110 2455318|2455348

Meaning:

facetString: pipe-separated DDB Zeitvokabular ids
startIndexDay / endIndexDay: sortable numeric day bounds
default index mode is JULIAN_DAY; LEGACY is kept for compatibility

Parsing pipeline

Every input passes through six steps. Each step is now a public method on TimeParser so you can call, test, or inspect them individually.

Raw input string
       │
       ▼  Step 1: applyNormalizationRules(input)
       │  normalizations.csv — expands abbreviations, spelling variants,
       │  century/millennium expressions
       │  e.g. "200000000 v. Chr."  →  "-200000000"
       │       "15. Jh."            →  "15. Jahrhundert"
       ▼
       │  Step 2: tokenizeMonthsAndWeekdays(preprocessed)
       │  Replaces month and weekday names with match tokens
       │  e.g. "März 2010"  →  "MM 2010"
       ▼
       │  Step 3: findMatchingRules(tokenized)    ← rules.csv
       │  Selects exactly one input mask; returns empty list (→ error)
       │  or more than one (→ MULTIPLE_RULES error)
       ▼
       │  Step 4: applyRule(preprocessed, rule)
       │  Applies the matched output pattern
       │  e.g. "1923 ?"  →  "ca. 1923"
       │       "MM 2010" →  "2010-05"           (transformedInput)
       ▼
       │  Step 5: new TimeSpanParser().parse(transformedInput)
       │  Converts the canonical expression to a concrete date range
       │  e.g. startDate=2010-05-01, endDate=2010-05-31   (TimeSpan)
       ▼
       │  Step 6a: resolveFacetNotations(timeSpan) ← facets.csv
       │           buildFacetString(facetNotations)
       │           e.g. "time_62100|time_62110"
       │
       │  Step 6b: computeIndexDay(date, indexDaysMode)
       │           Julian Day Number (default) or legacy algorithm
       │           e.g. startIndexDay=2455318, endIndexDay=2455348
       ▼
  "time_62100|time_62110 2455318|2455348"

Calling individual steps

TimeParser p = TimeParser.getInstance();

// Step 1: applyNormalizationRules — normalizations.csv
String preprocessed = p.applyNormalizationRules("März 2010");
// → "März 2010"  (no normalization rule matches this input)

// Step 2: tokenizeMonthsAndWeekdays — month/weekday tokenization
String tokenized = p.tokenizeMonthsAndWeekdays(preprocessed);
// → "MM 2010"

// Step 3: findMatchingRules — rule matching
List<Rule> rules = p.findMatchingRules(tokenized);
Rule rule = rules.get(0);

// Step 4: applyRule — rule application (uses preprocessed, not tokenized)
String transformedInput = p.applyRule(preprocessed, rule);
// → "2010-03"

// Step 5: TimeSpanParser.parse — time span parsing
TimeSpan span = new TimeSpanParser().parse(transformedInput);
LocalDate start = span.getStartDate();  // 2010-03-01
LocalDate end   = span.getEndDate();    // 2010-03-31

// Step 6a: resolveFacetNotations + buildFacetString
List<FacetNotation> notations = p.resolveFacetNotations(span);
String facetString = p.buildFacetString(notations);
// → "time_62100|time_62110"

// Step 6b: computeIndexDay — index day
long startDay = p.computeIndexDay(start, TimeParser.IndexDaysMode.JULIAN_DAY);
long endDay   = p.computeIndexDay(end,   TimeParser.IndexDaysMode.JULIAN_DAY);

For end-to-end parsing without inspecting intermediate steps, use parseTimeResult(...) which returns all of the above fields pre-computed in a single ParseResult.

Structured API

Available overloads:

String parseTime(String input)
String parseTime(String input, IndexDaysMode mode)
String parseTime(String input, String contextId)
String parseTime(String input, String contextId, IndexDaysMode mode)

ParseResult parseTimeResult(String input)
ParseResult parseTimeResult(String input, IndexDaysMode mode)
ParseResult parseTimeResult(String input, String contextId)
ParseResult parseTimeResult(String input, String contextId, IndexDaysMode mode)

Useful ParseResult fields:

normalizedInput — after step 1
matchingRules / matchedRule — after step 3
transformedInput — after step 4
timeSpan — after step 5
facetString / facetNotations — after step 6a
startIndexDay / endIndexDay — after step 6b
successful / errorType / errorMessage

Errors are aggregated internally and can be inspected via getErrorStats().

CSV-driven knowledge base

The parser behavior is data-driven:

src/main/resources/conf/timeparser/rules.csv — maps input masks and patterns to normalized parser expressions
src/main/resources/conf/timeparser/normalizations.csv — regex pre-normalization plus literal month/weekday token replacements
src/main/resources/conf/timeparser/facets.csv — maps year ranges to DDB facet ids and labels

In short: code provides the parsing engine, CSV files provide most of the vocabulary and transformation knowledge.

All rules.csv examples are regression-tested in src/test/java/de/ddb/labs/timeparser/TimeParserTest.java.

rules.csv format

Each row has eight columns (the first row is a header and is skipped):

Column	Name	Description
0	Input mask	Character-by-character type annotation for the input string
1	Input pattern	Concrete variable names aligned with the mask
2	Input example	A sample input string that must match and parse correctly
3	Output mask	Character-by-character type annotation for the output string
4	Output pattern	Concrete variable names aligned with the output mask
5	Output example	Expected result of applying this rule to the input example
6	Test	`NA` (currently unused)
7	Output example ISO	Optional. If non-empty: expected `startDate/endDate` from the full pipeline as ISO-8601 dates, separated by `/`

Mask and pattern syntax

Every rule defines its input and output through a mask and a pattern of identical length. The mask character at each position determines the token type; the pattern character at the same position names the variable or reproduces the literal text.

Token types

Mask char	Pattern char	Token type	Description
`#`	any letter (except `M`, `G`)	Generic variable	Matches exactly one digit in the input. All consecutive `#` positions with the same pattern letter form one variable. No two variables in the same input specification may share the same initial letter.
`M` + `M`	two identical letters	Month variable	A two-character pair of `M` in the mask captures the two-character month token produced by Step 2.
`G` + `G`	two identical letters	Weekday variable	A two-character pair of `G` in the mask captures the two-character weekday token produced by Step 2.
any other char	same char	Literal text	Mask and pattern character must be identical. The mask character is used verbatim during input matching.

Key constraint — mask matching: # accepts any digit; literal characters must match exactly; spaces must align with spaces. This is enforced by isMatching() in TimeParser.

Key constraint — duplicate variable initials: Within a single input specification, no two variables may begin with the same letter. This is enforced when the pattern is parsed (output specifications may repeat variable initials, e.g. the same year variable JJJJ used in both start and end of a range).

Variable naming conventions (the specific letters are free to choose, but the following names are used consistently throughout rules.csv):

Pattern letters	Meaning
`JJJJ`	4-digit year
`XXXX`, `ZZZZ`, `YYYY`	Second/third year in a range
`TT`	2-digit day
`XX`	Second day in a range
`MM`	Month variable (2-char month token, must use mask `MM`)
`YY`	Second month in a range (must use mask `MM`)
`GG`	Weekday variable (2-char weekday token, must use mask `GG`)

Example

Input März 2010 is tokenized by Step 2 to MM 2010. The matching rule is:

input  mask:    MM ####
input  pattern: MM JJJJ
output mask:    ####-##
output pattern: JJJJ-MM

The parser reads two tokens from the input: MM → month variable with value März (resolved to 03 during output), JJJJ → year variable with value 2010. The output template produces 2010-03.

Semantics and limits

A few important caveats:

the parser is optimized for historical and catalog-style date strings, not arbitrary natural language
exactly one rule must match after normalization; ambiguous inputs are rejected
disjoint expressions such as 1944/1945,1949 are currently rejected
very large years are bounded by java.time.LocalDate
inputs up to 999999999 and -1000000000 are supported in the current implementation; larger magnitudes return "" through the fail-safe API
the fail-safe parseTime(...) methods return "" on errors; if you need diagnostics, use parseTimeResult(...)
request input is intentionally capped at 2048 characters to protect the parser from excessive memory and CPU pressure
diagnostic logging and stored error-stat values are abbreviated to keep monitoring data bounded

HTTP demo server

The project ships with a minimal Javalin-based HTTP wrapper in de.ddb.labs.timeparser.TimeParserHttpServer.

Build and run:

./mvnw -q -DskipTests package
java -jar target/timeparser-2.0.0-SNAPSHOT-shaded.jar

Optional environment variables:

TIMEPARSER_HOST — default 127.0.0.1
TIMEPARSER_PORT — default 8080

Request:

GET /?date=Mai%202010&indexDaysMode=JULIAN_DAY

Successful response shape:

{
  "successful": true,
  "input": "Mai 2010",
  "indexDaysMode": "JULIAN_DAY",
  "normalizedInput": "MM 2010",
  "transformedInput": "2010-05",
  "timeSpan": {
    "parsedInputString": "2010-05",
    "startISODate": "2010-05-01",
    "endISODate": "2010-05-31"
  },
  "facetString": "time_62100|time_62110",
  "startIndexDay": 2455318,
  "endIndexDay": 2455348,
  "output": "time_62100|time_62110 2455318|2455348"
}

Notes:

startISODate and endISODate make the wire format explicit: these are ISO-serialized calendar dates
empty strings and empty arrays are omitted from failure JSON
errorType and errorMessage are only present on failures

Build, test, package

./mvnw test
./mvnw -DskipTests package

The package goal also produces a runnable shaded jar:

target/timeparser-2.0.0-SNAPSHOT-shaded.jar

Docker

docker build -t timeparser .
docker run --rm -p 8080:8080 -e TIMEPARSER_HOST=0.0.0.0 -e TIMEPARSER_PORT=8080 timeparser

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
.mvn		.mvn
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

timeparser

What it does

Requirements

Maven

Quick start

Output model

Parsing pipeline

Calling individual steps

Structured API

CSV-driven knowledge base

rules.csv format

Mask and pattern syntax

Semantics and limits

HTTP demo server

Build, test, package

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

timeparser

What it does

Requirements

Maven

Quick start

Output model

Parsing pipeline

Calling individual steps

Structured API

CSV-driven knowledge base

rules.csv format

Mask and pattern syntax

Semantics and limits

HTTP demo server

Build, test, package

Docker

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages