feat(operator): Add Case Sensitivity to Keyword Search Operator#5600
Open
SarahAsad23 wants to merge 3495 commits into
Open
feat(operator): Add Case Sensitivity to Keyword Search Operator#5600SarahAsad23 wants to merge 3495 commits into
SarahAsad23 wants to merge 3495 commits into
Conversation
### Purpose: Since "Hub" has been chosen as the more formal name for the community, the previous name needs to be replaced with "Hub." ### Change: The 'Community' in the left sidebar has been replaced with 'Hub'. ### Demo: Before:  After: 
…bles. (apache#3168) ### Purpose: Although the column indicating publication already exists in both the dataset and workflow tables, they have different names and types. ### Change: Modified the dataset and workflow tables to unify the format of is_public across different tables. **To complete the database update, please execute `18.sql` located in the update folder.**
This PR unifies the design of the output mode, and make it a property on
an output port.
Previously we had two modes:
1. SET_SNAPSHOT (return a snapshot of a table)
2. SET_DELTA (return a delta of tuples)
And different chart types (e.g., HTML, bar chart, line chart). However,
we are only using HTML type after switching to pyplot.
Additionally, the output mode and chart type are associated with a
logical operator, and passed along to the downstream sink operator, this
does not support multiple output ports operators.
### New design
1. Move OutputMode onto an output port's property.
2. Unify to three modes:
a. SET_SNAPSHOT (return a snapshot of a table)
b. SET_DELTA (return a delta of tuples)
c. SINGLE_SNAPSHOT (only used for visualizations to return a html)
The SINGLE_SNAPSHOT is needed now as we need a way to differenciate a
HTML output vs a normal data table output. This is due to the storage
with mongo is limited by 16 mb, and HTMLs are usually larger than 16 mb.
After we remove this limitation on the storage, we will remove the
SINGLE_SNAPSHOT and fall back to SET_SNAPSHOT.
This PR refactors the handling of sink operators in Texera by removing the sink descriptor and introducing a streamlined approach to creating physical sink operators during compilation and scheduling. Additionally, it shifts the storage assignment logic from the logical layer to the physical layer. 1. **Sink Descriptor Removal:** Removed the sink descriptor, physical sink operators are no longer created through descriptors. In the future, we will remove physical sink operators. 2. **Sink Operator Creation:** - Introduced a temporary factory for creating physical sink operators without relying on a descriptor. - Physical sink operators are now considered part of the sub-plan of their upstream logical operator. For example: If the HashJoin logical operator requires a sink, its physical sub-plan includes the building physicalOp, probing physicalOp, and the sink physicalOp. 3. **Storage Assignment Refactor:** - Merged the storage assignment logic into the physical layer, removing it from the logical layer. - When a physical sink operator is created (either during compilation or scheduling), its associated storage is also created at the same moment.
This PR fixes the issue that, when MongoDB is used as the result storage, the execution keeps failing. The root cause is: when using MongoDB as the result storage, the schema has to be extracted from the physical op the code implementation uses the physical ops in subPlan to extract the schema out, which is incorrect as subPlan's physical ops are not propagated for output schemas How the fix is done is: using physicalPlan.getOperator, instead of subPlan.getOperator. In this way, the operator that is flowing to the downstream is from `physicalPlan`, not `subPlan`
This PR fixes an issue where operators marked for viewing results did not display the expected results due to a mismatch caused by string-based comparison instead of using operator identities. By updating the matching logic to rely on operator identities, this change ensures accurate identification of marked operators and correct display of their results.
Schema propagation is now handled in the physical plan to ensure all ports, both external and internal, have an associated schema. As a result, schema propagation in the logical plan is no longer necessary. In addition, workflow context was used to give user information so that schema propagation can resolve file names. This is also no longer needed as we are resolving files through an explicit call as a step of compiling. WorkflowContext is no longer needed to be set to Logical Operator.
The cache logic will be rewritten at the physical plan layer, using ports as the caching unit. This version is being removed temporarily.
Previously, all results were tied to logical operators. This PR modifies the engine to associate results with output ports, enabling better granularity and support for operators with multiple outputs. ### Key Change: StorageKey with PortIdentity The most significant update in this PR is the adjustment of the storage key format to include both the logical operator ID and the port ID. This ensures that logical operators with multiple output ports (e.g., Split) can have distinct storages created for each output port. For now, the frontend retrieves results from the default output port (port 0). In future updates, the frontend will be enhanced to support retrieving results from additional output ports, providing more flexibility in how results are accessed and displayed.
There are a few operator descriptors written in Java, which makes them difficult to use and maintain. This PR converts all such descriptors to Scala to streamline the migration process to new APIs and facilitate future work, such as operator offloading. Changed Operators: - PythonUDFSourceOpDescV2 - RUDFSourceOpDesc - SentimentAnalysisOpDesc - SpecializedFilterOpDesc - TypeCastingOpDesc
…de-based IDEs (apache#3180) Add Scala([Metals](https://github.com/scalameta/metals)) generated folders to gitignore to support VSCode and Cursor IDE. Metals is a popular plugin for VSCode-based IDE to develop scala applications.
This PR moves all definitions of protobuf under workflow-core to be within the core package name.
The scalapb proto definition both present in workflow-core and amber. This PR removes the second copy.
The Python protobuf-generated code was outdated. This PR updates the generation script to include all protobuf definitions from the workflow-core and amber sub-projects, ensuring that the latest Python code is generated and aligned with the current protobuf definitions.
Previously, creating a physical operator during compilation also required the creation of its corresponding executor instances. To delay this process so that executor instances were created within workers, we used a lambda function (in `OpExecInitInfo`). However, the lambda approach had a critical limitation: it was not serializable and it is language dependent. This PR addresses this issue by replacing the lambda functions in `OpExecInitInfo` with fully serializable Protobuf entities. The serialized information now ensures compatibility with distributed environments and is language-independent. Two primary types of `OpExecInitInfo` are introduced: 1. **`OpExecWithClassName`**: - **Fields**: `className: String`, `descString: String`. - **Behavior**: The language compiler dynamically loads the class specified by `className` and uses `descString` as its initialization argument. 2. **`OpExecWithCode`**: - **Fields**: `code: String`, `language: String`. - **Behavior**: The language compiler compiles the provided `code` based on the specified `language`. The arguments are already pre-populated into the code string. ### Special Cases The `ProgressiveSink` and `CacheSource` executors are treated as special cases. These executors require additional unique information (e.g., `storageKey`, `workflowIdentity`, `outputMode`) to initialize their executor instances. While this PR preserves the handling of these special cases, these executors will eventually be refactored or removed as part of the plan to move storage management to the port layer.
This PR improves exception handling in AsyncRPCServer to unwrap the actual exception from InvocationTargetException. Old: <img width="889" alt="截屏2024-12-31 上午2 38 34" src="https://github.com/user-attachments/assets/8ec40cce-1b7a-4ecc-8518-a67b7e79888b" /> New: <img width="1055" alt="截屏2024-12-31 上午2 33 18" src="https://github.com/user-attachments/assets/de1c3e42-0dbb-4dbe-8b76-dee486d5bbb9" />
This PR removes all schema propagation functions from the logical plan. Developers are now required to implement `SchemaPropagationFunc` directly within the PhysicalPlan. This ensures that each PhysicalOp has its own distinct schema propagation logic, aligning schema handling more closely with the execution layer. To accommodate the need for schema propagation in the logical plan (primarily for testing purposes), a new method, `getExternalOutputSchemas`, has been introduced. This method facilitates the propagation of schemas across all PhysicalOps within a logical operator, ensuring compatibility with existing testing workflows.
Each port must have exactly one schema. If multiple links are connected to the same port, they are required to share the same schema. This PR introduces a validation step during schema propagation to ensure this constraint is enforced as part of the compilation process. For example, consider a Union operator with a single input port that supports multiple links. If upstream operators produce differing output schemas, the validation will fail with an appropriate error message: 
…he#3156) #### This PR introduces the `CostEstimator` trait which estimates the cost of a region, given some resource units. - The cost estimator is used by `CostBasedScheduleGenerator` to calculate the cost of a schedule during search. - Currently we only consider one type of schedule for each region plan, which is a total order of the regions. The cost of the schedule (and also the cost of the region plan) is thus the summation of the cost of each region. - The resource units are currently passed as placeholders because we assume a region will have all the resources when doing the estimation. The units may be used in the future if we consider different methods of schedule-generation. For example, if we allow two regions to run concurrently, the units will be split in half for each region. #### A `DefaultCostEstimator` implementation is also added, which uses past execution statistics to estimate the wall-clock runtime of a region: - The runtime of each region is represented by the runtime of its longest-running operator. - The runtime of operators are estimated using the statistics from the **latest successful execution** of the workflow. - If such statistics do not exist (e.g., if it is the first execution, or if past executions all failed), we fall back to using number of materialized edges as the cost. - Added test cases using mock mysql data.
To simplify schema creation, this PR removes the Schema.builder() pattern and makes Schema immutable. All modifications now result in the creation of a new Schema instance.
PhysicalOp relies on the input port number to determine if an operator is a source operator. For Python UDF, from the changes in apache#3183, the input ports are not correctly associated with the PhysicalOp, causing all the Python UDFs to be recognized as source operators. This PR fixes the issue.
) The ubuntu-latest image has been updated to 24.04 from 22.04 in recent days. However, the new image is incompatible with libncurses5, requiring an upgrade to libncurses6. Unfortunately, after upgrading, sbt no longer functions as expected, an issue also documented here: [actions/setup-java#712](actions/setup-java#712). It appears that the 24.04 image does not include sbt by default. This PR addresses the issue by pinning the image to ubuntu-22.04. We can revisit and update the version when the 24.04 image becomes more stable and resolves these compatibility problems.
In this PR, we add the user avatar to the execution history panel. <img width="462" alt="Screenshot 2025-01-06 at 3 53 23 PM" src="https://github.com/user-attachments/assets/e4e662af-c1a9-4686-90f4-b6bef155a36b" />
…e#3147) # Implement Apache Iceberg for Result Storage <img width="556" alt="Screenshot 2025-01-06 at 3 18 19 PM" src="https://github.com/user-attachments/assets/4edadb64-ee28-48ee-8d3c-1d1891d69d6a" /> ## How to Enable Iceberg Result Storage 1. Update `storage-config.yaml`: - Set `result-storage-mode` to `iceberg`. ## Major Changes - **Introduced `IcebergDocument`**: A thread-safe `VirtualDocument` implementation for storing and reading results in Iceberg tables. - **Introduced `IcebergTableWriter`**: Append-only writer for Iceberg tables with configurable buffer size. - **Catalog and Data storage for Iceberg**: Uses a local file system (`file:/`) via `HadoopCatalog` and `HadoopFileIO`. This ensures Iceberg operates without relying on external storage services. - `ProgressiveSinkOpExec` with a new parameter `workerId` is added. Each writer of the result storage will take this `workerId` as one new parameter. ## Dependencies - Added Apache Iceberg-related libraries. - Introduced Hadoop-related libraries to support Iceberg's `HadoopCatalog` and `HadoopFileIO`. These libraries are used for placeholder configuration but do not enforce runtime dependency on HDFS. ## Overview of Iceberg Components ### `IcebergDocument` - Manages reading and organizing data in Iceberg tables. - Supports iterator-based incremental reads with thread-safe operations for reading and clearing data. - Initializes or overrides the Iceberg table during construction. ### `IcebergTableWriter` - Writes data as immutable Parquet files in an append-only manner. - Each writer uniquely prefixes its files to avoid conflicts (`workerIndex_fileIndex` format). - Not thread-safe—single-thread access is recommended. ## Data Storage via Iceberg Tables - **Write**: - Tables are created per `storage key`. - Writers append Parquet files to the table, ensuring immutability. - **Read**: - Readers use `IcebergDocument.get` to fetch data via an iterator. - The iterator reads data incrementally while ensuring data order matches the commit sequence of the data files. ## Data Reading Using File Metadata - Data files are read using `getUsingFileSequenceOrder`, which: - Retrieves and sorts metadata files (`FileScanTask`) by sequence numbers. - Reads records sequentially, skipping files or records as needed. - Supports range-based reading (`from`, `until`) and incremental reads. - Sorting ensures data consistency and order preservation. ## Hadoop Usage Without HDFS - The `HadoopCatalog` uses an empty Hadoop configuration, defaulting to the local file system (`file:/`). - This enables efficient management of Iceberg tables in local or network file systems without requiring HDFS infrastructure. --------- Co-authored-by: Shengquan Ni <13672781+shengquan-ni@users.noreply.github.com>
This PR removes the `MemoryDocument` class and its usage. Additionally, it updates the fallback mechanism for `MongoDocument`, changing it from `Memory` to `Iceberg`.
…t functioning properly (apache#3200) ### Purpose: Currently, a 400 error occurs when calling `persistWorkflow`. The reason is that the parameter sent from the frontend to the backend is `isPublished` instead of `isPublic`. ### Change: Change the name of the parameter sent to the backend. ### Demos: Before:  After: 
Remove the redundant Flarum user registration service from the Texera backend, as it is unnecessary. User registration for Flarum can be handled directly by calling the Flarum user registration API from the Texera frontend. This PR does not affect the functionality or lifecycle of either Texera or Flarum.
This PR addresses schema normalization and logic improvements for tracking operator runtime statistics in a workflow execution system. It introduces changes to the database schema, migration scripts, and Scala code responsible for inserting and managing runtime statistics. The goal is to reduce redundancy, improve maintainability, and ensure data consistency between `operator_executions` and `operator_runtime_statistics`. ### Schema Design 1. New Table Design: - `operator_executions`: Tracks execution metadata for each operator in a workflow execution. Each row contains `operator_execution_id`, `workflow_execution_id`, and `operator_id`. This table ensures that operator executions are uniquely identifiable. - `operator_runtime_statistics`: Tracks runtime statistics for each operator execution at specific timestamps. It includes `operator_execution_id` as a foreign key, ensuring a direct reference to `operator_executions`. 2. Normalization Improvements: - Replaced repeated `execution_id` and `operator_id` in `workflow_runtime_statistics` with a single foreign key `operator_execution_id`, pointing to `operator_executions`. - Split the previous large `workflow_runtime_statistics` table into smaller, more manageable tables, eliminating redundancy and improving data integrity. 3. Indexes and Keys: - Added a composite index on `operator_execution_id` and `time` in `operator_runtime_statistics` to speed up joins and queries ordered by time. ### Testing The `core/scripts/sql/update/19.sql` will create the two new tables, `operator_executions` and `operator_runtime_statistics`, and migrate the data from `workflow_runtime_statistics` to those two tables. --------- Co-authored-by: Kunwoo Park <kunwoopark@dhcp-10-8-059-073.mobile.reshsg.uci.edu> Co-authored-by: Kunwoo Park <kunwoopark@Kunwoos-MacBook-Pro.local> Co-authored-by: Kunwoo Park <kunwoopark@dhcp-10-8-034-248.mobile.reshsg.uci.edu> Co-authored-by: Kunwoo Park <kunwoopark@vcv070211.vpn.uci.edu> Co-authored-by: Kunwoo Park <kunwoopark@dhcp-10-8-012-059.mobile.reshsg.uci.edu> Co-authored-by: Kunwoo Park <kunwoopark@dhcp-172-31-252-248.mobile.uci.edu>
…3208) ### Purpose It looks to restricted/inactive users as they can clone a workflow, when they actually cannot. To reduce the confusion, the clone button is disabled in the GUI. fix apache#3066 ### Changes Added variables isAdminOrRegularUser in hub-workflow-detail-component.ts. Check the user role. If the user is not an admin or regular user, disable the clone button. https://github.com/user-attachments/assets/c426c50d-fb31-4377-a2b7-3de2a24da512 --------- Co-authored-by: Texera <texera@MacBook-Air.local>
…nel (apache#3479) ## Change Previously, we had operator descriptions but we did not use them in front end. This PR include the operator description defined in `OperatorInfo` in backend in operator editing panel.   
| Before | After | |--------------|--------------| | ||
This PR fix the issue that README.md displays the logo in the incorrect ratio due to the update of the logo source file. | Before | After | |--------------|--------------| |||
…ing Behavior (apache#3489) ### **Purpose:** - Introduce three independent site‐wide settings, logo, mini logo, and favicon, admin can customize the expanded sidebar logo, the collapsed sidebar icon, and the browser tab icon. - Provide a default fallback for each asset and support individual resets. - Eliminate the brief “flash” of default assets on page reload by applying both logo and mini logo before the main dashboard renders. ### **Changes** - In admin-settings.service.ts: moved all logo-path and favicon-path retrieval into the service, with default fallbacks. - In admin-settings.component.ts/html: Split out three file inputs and previews for logo, mini logo, and favicon, update changes and reset methods. - In dashboard.component.ts/html: Simply subscribe to loadLogos()), ensuring the sidebar only displays once the final URLs are available. ### **Demonstration** Default logo, mini logo, and favicon. <img width="1667" alt="logo1" src="https://github.com/user-attachments/assets/57c0e47e-d58d-46f9-a298-a6d84db314fb" /> <img width="43" alt="mini1" src="https://github.com/user-attachments/assets/a80adef0-6344-4fde-898c-5c7386a0c2c5" /> <img width="110" alt="favicon1" src="https://github.com/user-attachments/assets/5c7964d0-dace-44c4-9054-cbaf42bf619a" /> Admin update in Settings <img width="882" alt="change" src="https://github.com/user-attachments/assets/f0422fef-cfbb-423d-923f-82ffe4c500cb" /> Dashboard after Save <img width="1293" alt="logo2" src="https://github.com/user-attachments/assets/1b3fee1b-8d23-4ce2-94eb-365a7028bc80" /> <img width="44" alt="mini2" src="https://github.com/user-attachments/assets/47aaa3d3-13cb-46c7-802d-e22b471afc80" /> <img width="113" alt="favicon2" src="https://github.com/user-attachments/assets/8e392a33-94b3-40b2-9f26-4186e3fae7e6" /> Hard Refresh https://github.com/user-attachments/assets/f0959dca-7ae9-4f7d-bd0a-12e09786a9ac --------- Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
### Implementation Issue There is an issue with the current Output Port Materialization Writer Thread's `closeOutputStorageWriterIfNeeded` method where we terminate all the writer threads for all the ports when calling `closeOutputStorageWriterIfNeeded` for only 1 port. ### Bug The consequence of this incorrect logic is that an operator with multiple output ports will not terminate all its output ports if there are multiple output ports having materialization reader threads. Before apache#3460 this issue was not exposed because we have no such use cases. After apache#3460, A split operator connecting to a training operator will have both its output ports materialized, causing Split operator to misbehave. ### Fix This PR fixes that issue by letting each output port only terminate its own thread.
## Fix When the property is being checked to see if it is valid or not, null is mapped to "" using `c.value ?? ""` to make sure when an optional property is deselected, it will cause no unintended error.
This PR introduces an alpha parameter (between 0 and 1) to the Scatter Plot operator, allowing users to control the opacity of points in the scatter plot visualization. Example: <img width="1284" alt="scatterPlot" src="https://github.com/user-attachments/assets/6310e033-3f34-45f2-b02f-c1d4efd09f94" />
## New feature Users may need to export result of one or multiple operators as Parquet format. This includes exporting to local or dataset. ## How implemented In `IcebergDocument` we implemented `asInputStream` as the underlying function to zip all Parquet files stored in Iceberg. By implementing this function in `IcebergDocument`, we are now able to directly access the files stored by storage service. In our case, since we need Parquet format, we can directly stream back the files saved by Iceberg without need to convert and consume extra computation. ## Feature behavior As shown in the picture, now users can select Parquet as export type. 
…f input schemas (apache#3501) ## Overview This PR refines the workflow compilation during editing time by: - `/api/compile` now returns the mapping of operator ID to its output schemas - Frontend only sends "valid" (json schema wise) operators and links to the `WorkflowCompilingService` - When calculating the input schema of some operators, i.e. invalid operators but still need schema information, use the output schema mapping to do the inference. ## User Experience Differences ### Before When users just drop the operator, without configuring the required properties, the compiler is already complaining about the error:  ### After When the operator is not fully configured, it will NOT be sent to the CompilingService to do the compilation. Meanwhile, users can still get the schema auto-completion of this operator.  Also, frontend can use the output schemas of the upstream operators to detect if users connect two output ports with different schemas to the same input port: 
…. Case-sensitive keyword search feature to be re-applied in follow-up commit at the new common/workflow-operator paths.
…com/SarahAsad23/texera into sarah-keyword-search-case-sensitive # Conflicts: # common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/keywordSearch/CaseSensitiveAnalyzer.scala
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5600 +/- ##
============================================
- Coverage 52.38% 52.34% -0.05%
+ Complexity 2484 2481 -3
============================================
Files 1070 1071 +1
Lines 41359 41365 +6
Branches 4441 4442 +1
============================================
- Hits 21666 21651 -15
- Misses 18427 18439 +12
- Partials 1266 1275 +9
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
|
Hi @SarahAsad23 could you please add test cases so that we know this feature is working as expected? Thanks |
Contributor
|
@SarahAsad23 it also might be better to squash your commit history. This PR currently contains 3000+ commits that are unrelated to the PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
Supersedes #3510. Since the PR is old and the repository structure has changed, this PR reapplies the relevant changes on top of the current master branch.
This PR adds an option for case sensitivity to the keyword search operator. Users can now use a checkbox to specify whether their search should be case sensitive or case insensitive. This functionality is enabled through the addition of a CaseSensitiveAnalyzer that extends the base Lucene Analyzer for case sensitive searches, while the original StandardAnalyzer is used for case insensitive searches.
Any related issues, documentation, discussions?
Closes #3045.
How was this PR tested?
Tested Manually.
Was this PR authored or co-authored using generative AI tooling?
Co-authored using: Claude Code