diff --git a/get-started/sample-datasets/index.mdx b/get-started/sample-datasets/index.mdx index 7e65f005e..d1447644d 100644 --- a/get-started/sample-datasets/index.mdx +++ b/get-started/sample-datasets/index.mdx @@ -7,6 +7,8 @@ title: 'Tutorials and example datasets' doc_type: 'landing-page' --- +import { SampleDatasetExplorer } from '/snippets/components/SampleDatasetExplorer/SampleDatasetExplorer.jsx' + These tutorials work with any ClickHouse deployment, including [ClickHouse Cloud](/get-started/setup/cloud). @@ -20,39 +22,4 @@ In addition, the sample datasets provide a great experience on working with Clic learning important techniques and tricks, and seeing how to take advantage of the many powerful functions in ClickHouse. The sample datasets include: -{/* The following table is automatically generated at build time -by https://github.com/ClickHouse/clickhouse-docs/blob/main/scripts/autogenerate-table-of-contents.sh */} - -{/*AUTOGENERATED_START*/} -| Page | Description | -|-----|-----| -| [Amazon customer review](/get-started/sample-datasets/amazon-reviews) | Over 150M customer reviews of Amazon products | -| [AMPLab Big Data Benchmark](/get-started/sample-datasets/amplab-benchmark) | A benchmark dataset used for comparing the performance of data warehousing solutions. | -| [Analyzing Stack Overflow data with ClickHouse](/get-started/sample-datasets/stackoverflow) | Analyzing Stack Overflow data with ClickHouse | -| [Anonymized web analytics](/get-started/sample-datasets/anon-web-analytics-metrica) | Dataset consisting of two tables containing anonymized web analytics data with hits and visits | -| [Brown University Benchmark](/get-started/sample-datasets/brown-benchmark) | A new analytical benchmark for machine-generated log data | -| [COVID-19 open data](/get-started/sample-datasets/covid19) | COVID-19 Open-Data is a large, open-source database of COVID-19 epidemiological data and related factors like demographics, economics, and government responses | -| [dbpedia dataset](/get-started/sample-datasets/dbpedia) | Dataset containing 1 million articles from Wikipedia and their vector embeddings | -| [Environmental sensors data](/get-started/sample-datasets/environmental-sensors) | Over 20 billion records of data from Sensor.Community, a contributors-driven global sensor network that creates Open Environmental Data. | -| [Foursquare places](/get-started/sample-datasets/foursquare-os-places) | Dataset with over 100 million records containing information about places on a map, such as shops, restaurants, parks, playgrounds, and monuments. | -| [Geo data using the cell tower dataset](/get-started/sample-datasets/cell-towers) | Learn how to load OpenCelliD data into ClickHouse, connect Apache Superset to ClickHouse and build a dashboard based on data | -| [GitHub events dataset](/get-started/sample-datasets/github-events) | Dataset containing all events on GitHub from 2011 to Dec 6 2020, with a size of 3.1 billion records. | -| [Hacker News dataset](/get-started/sample-datasets/hacker-news) | Dataset containing 28 million rows of hacker news data. | -| [Hacker News vector search dataset](/get-started/sample-datasets/hacker-news-vector-search) | Dataset containing 28+ million Hacker News postings & their vector embeddings | -| [LAION 5B dataset](/get-started/sample-datasets/laion5b) | Dataset containing 100 million vectors from the LAION 5B dataset | -| [Laion-400M dataset](/get-started/sample-datasets/laion) | Dataset containing 400 million images with English image captions | -| [New York Public Library "What's on the Menu?" dataset](/get-started/sample-datasets/menus) | Dataset containing 1.3 million records of historical data on the menus of hotels, restaurants and cafes with the dishes along with their prices. | -| [New York taxi data](/get-started/sample-datasets/nyc-taxi) | Data for billions of taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009 | -| [NOAA Global Historical Climatology Network](/get-started/sample-datasets/noaa) | 2.5 billion rows of climate data for the last 120 yrs | -| [NYPD complaint data](/get-started/sample-datasets/nypd-complaint-data) | Ingest and query Tab Separated Value data in 5 steps | -| [OnTime](/get-started/sample-datasets/ontime) | Dataset containing the on-time performance of airline flights | -| [Star Schema Benchmark (SSB, 2009)](/get-started/sample-datasets/star-schema) | The Star Schema Benchmark (SSB) data set and queries | -| [Taiwan historical weather datasets](/get-started/sample-datasets/tw-weather) | 131 million rows of weather observation data for the last 128 yrs | -| [Terabyte click logs from Criteo](/get-started/sample-datasets/criteo) | A terabyte of click logs from Criteo | -| [The UK property prices dataset](/get-started/sample-datasets/uk-price-paid) | Learn how to use projections to improve the performance of queries that you run frequently using the UK property dataset, which contains data about prices paid for real-estate property in England and Wales | -| [TPC-DS (2012)](/get-started/sample-datasets/tpcds) | The TPC-DS benchmark data set and queries. | -| [TPC-H (1999)](/get-started/sample-datasets/tpch) | The TPC-H benchmark data set and queries. | -| [WikiStat](/get-started/sample-datasets/wikistat) | Explore the WikiStat dataset containing 0.5 trillion records. | -| [Writing queries in ClickHouse using GitHub data](/get-started/sample-datasets/github) | Dataset containing all of the commits and changes for the ClickHouse repository | -| [YouTube dataset of dislikes](/get-started/sample-datasets/youtube-dislikes) | A collection of dislikes of YouTube videos. | -{/*AUTOGENERATED_END*/} + diff --git a/images/sample-datasets-grid/benchmarks-dark.jpg b/images/sample-datasets-grid/benchmarks-dark.jpg new file mode 100644 index 000000000..a28104aa0 Binary files /dev/null and b/images/sample-datasets-grid/benchmarks-dark.jpg differ diff --git a/images/sample-datasets-grid/benchmarks-light.jpg b/images/sample-datasets-grid/benchmarks-light.jpg new file mode 100644 index 000000000..f2d1840bb Binary files /dev/null and b/images/sample-datasets-grid/benchmarks-light.jpg differ diff --git a/images/sample-datasets-grid/geo-location-dark.jpg b/images/sample-datasets-grid/geo-location-dark.jpg new file mode 100644 index 000000000..000687358 Binary files /dev/null and b/images/sample-datasets-grid/geo-location-dark.jpg differ diff --git a/images/sample-datasets-grid/geo-location-light.jpg b/images/sample-datasets-grid/geo-location-light.jpg new file mode 100644 index 000000000..d05d15e97 Binary files /dev/null and b/images/sample-datasets-grid/geo-location-light.jpg differ diff --git a/images/sample-datasets-grid/public-records-dark.jpg b/images/sample-datasets-grid/public-records-dark.jpg new file mode 100644 index 000000000..dcb7ae9c8 Binary files /dev/null and b/images/sample-datasets-grid/public-records-dark.jpg differ diff --git a/images/sample-datasets-grid/public-records-light.jpg b/images/sample-datasets-grid/public-records-light.jpg new file mode 100644 index 000000000..56bb5e1e6 Binary files /dev/null and b/images/sample-datasets-grid/public-records-light.jpg differ diff --git a/images/sample-datasets-grid/time-series-sensors-dark.jpg b/images/sample-datasets-grid/time-series-sensors-dark.jpg new file mode 100644 index 000000000..82b83c404 Binary files /dev/null and b/images/sample-datasets-grid/time-series-sensors-dark.jpg differ diff --git a/images/sample-datasets-grid/time-series-sensors-light.jpg b/images/sample-datasets-grid/time-series-sensors-light.jpg new file mode 100644 index 000000000..1c6550330 Binary files /dev/null and b/images/sample-datasets-grid/time-series-sensors-light.jpg differ diff --git a/images/sample-datasets-grid/vector-search-dark.jpg b/images/sample-datasets-grid/vector-search-dark.jpg new file mode 100644 index 000000000..d4a0ae256 Binary files /dev/null and b/images/sample-datasets-grid/vector-search-dark.jpg differ diff --git a/images/sample-datasets-grid/vector-search-light.jpg b/images/sample-datasets-grid/vector-search-light.jpg new file mode 100644 index 000000000..45cc12dce Binary files /dev/null and b/images/sample-datasets-grid/vector-search-light.jpg differ diff --git a/images/sample-datasets-grid/web-social-analytics-dark.jpg b/images/sample-datasets-grid/web-social-analytics-dark.jpg new file mode 100644 index 000000000..da4d635ac Binary files /dev/null and b/images/sample-datasets-grid/web-social-analytics-dark.jpg differ diff --git a/images/sample-datasets-grid/web-social-analytics-light.jpg b/images/sample-datasets-grid/web-social-analytics-light.jpg new file mode 100644 index 000000000..1fc7c46de Binary files /dev/null and b/images/sample-datasets-grid/web-social-analytics-light.jpg differ diff --git a/snippets/components/SampleDatasetExplorer/SampleDatasetExplorer.jsx b/snippets/components/SampleDatasetExplorer/SampleDatasetExplorer.jsx new file mode 100644 index 000000000..0bde32521 --- /dev/null +++ b/snippets/components/SampleDatasetExplorer/SampleDatasetExplorer.jsx @@ -0,0 +1,273 @@ +// SampleDatasetExplorer +// A 3x2 grid of sample-dataset *categories*. Clicking a category expands it into +// a grid of cards for that category's child dataset pages, with an animated +// (staggered fade/scale) transition between the two views. +// +// Child pages don't have their own images yet, so they render as icon Cards. +// +// NOTE: Mintlify eval's ONLY the exported component function, so every constant +// (ACCENT, CATEGORIES) and helper MUST live inside the component body — module-level +// declarations are not in scope at render time and throw "X is not defined". + +export const SampleDatasetExplorer = ({ categories }) => { + const ACCENT = '#FAFF69'; + + // Each category: id, title (also baked into the banner image), an icon used for + // its child cards, the two banner images, and the child dataset pages. + const CATEGORIES = [ + { + id: 'benchmarks', + title: 'Benchmarks', + icon: 'gauge', + imgLight: '/images/sample-datasets-grid/benchmarks-light.jpg', + imgDark: '/images/sample-datasets-grid/benchmarks-dark.jpg', + datasets: [ + { title: 'AMPLab Big Data Benchmark', href: '/get-started/sample-datasets/amplab-benchmark' }, + { title: 'Brown University Benchmark', href: '/get-started/sample-datasets/brown-benchmark' }, + { title: 'Star Schema Benchmark (SSB)', href: '/get-started/sample-datasets/star-schema' }, + { title: 'TPC-DS', href: '/get-started/sample-datasets/tpcds' }, + { title: 'TPC-H', href: '/get-started/sample-datasets/tpch' }, + ], + }, + { + id: 'geo-location', + title: 'Geo & location', + icon: 'map-pin', + imgLight: '/images/sample-datasets-grid/geo-location-light.jpg', + imgDark: '/images/sample-datasets-grid/geo-location-dark.jpg', + datasets: [ + { title: 'Cell towers (OpenCelliD)', href: '/get-started/sample-datasets/cell-towers' }, + { title: 'Foursquare places', href: '/get-started/sample-datasets/foursquare-os-places' }, + { title: 'New York taxi data', href: '/get-started/sample-datasets/nyc-taxi' }, + ], + }, + { + id: 'public-records', + title: 'Public records & open data', + icon: 'landmark', + imgLight: '/images/sample-datasets-grid/public-records-light.jpg', + imgDark: '/images/sample-datasets-grid/public-records-dark.jpg', + datasets: [ + { title: 'COVID-19 open data', href: '/get-started/sample-datasets/covid19' }, + { title: 'NYPD complaint data', href: '/get-started/sample-datasets/nypd-complaint-data' }, + { title: 'OnTime (airline flights)', href: '/get-started/sample-datasets/ontime' }, + { title: 'UK property prices', href: '/get-started/sample-datasets/uk-price-paid' }, + { title: "What's on the Menu? (NYPL)", href: '/get-started/sample-datasets/menus' }, + ], + }, + { + id: 'time-series-sensors', + title: 'Time series & sensors', + icon: 'activity', + imgLight: '/images/sample-datasets-grid/time-series-sensors-light.jpg', + imgDark: '/images/sample-datasets-grid/time-series-sensors-dark.jpg', + datasets: [ + { title: 'Environmental sensors data', href: '/get-started/sample-datasets/environmental-sensors' }, + { title: 'NOAA Global Historical Climatology Network', href: '/get-started/sample-datasets/noaa' }, + { title: 'Taiwan historical weather', href: '/get-started/sample-datasets/tw-weather' }, + ], + }, + { + id: 'vector-search', + title: 'Vector search and embeddings', + icon: 'search', + imgLight: '/images/sample-datasets-grid/vector-search-light.jpg', + imgDark: '/images/sample-datasets-grid/vector-search-dark.jpg', + datasets: [ + { title: 'dbpedia dataset', href: '/get-started/sample-datasets/dbpedia' }, + { title: 'Hacker News vector search', href: '/get-started/sample-datasets/hacker-news-vector-search' }, + { title: 'LAION 5B dataset', href: '/get-started/sample-datasets/laion5b' }, + { title: 'Laion-400M dataset', href: '/get-started/sample-datasets/laion' }, + ], + }, + { + id: 'web-social', + title: 'Web and social analytics', + icon: 'globe', + imgLight: '/images/sample-datasets-grid/web-social-analytics-light.jpg', + imgDark: '/images/sample-datasets-grid/web-social-analytics-dark.jpg', + datasets: [ + { title: 'Amazon customer reviews', href: '/get-started/sample-datasets/amazon-reviews' }, + { title: 'Analyzing Stack Overflow data', href: '/get-started/sample-datasets/stackoverflow' }, + { title: 'Anonymized web analytics', href: '/get-started/sample-datasets/anon-web-analytics-metrica' }, + { title: 'Criteo terabyte click logs', href: '/get-started/sample-datasets/criteo' }, + { title: 'GitHub events dataset', href: '/get-started/sample-datasets/github-events' }, + { title: 'Hacker News dataset', href: '/get-started/sample-datasets/hacker-news' }, + { title: 'Querying GitHub data', href: '/get-started/sample-datasets/github' }, + { title: 'WikiStat', href: '/get-started/sample-datasets/wikistat' }, + { title: 'YouTube dataset of dislikes', href: '/get-started/sample-datasets/youtube-dislikes' }, + ], + }, + ]; + + const cats = categories || CATEGORIES; + + const [selectedId, setSelectedId] = useState(null); + const selected = cats.find((c) => c.id === selectedId) || null; + + // Theme visibility is handled by explicit `.dark` descendant selectors in the + //