Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 3 additions & 36 deletions get-started/sample-datasets/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ title: 'Tutorials and example datasets'
doc_type: 'landing-page'
---

import { SampleDatasetExplorer } from '/snippets/components/SampleDatasetExplorer/SampleDatasetExplorer.jsx'

<Tip>
These tutorials work with any ClickHouse deployment, including [ClickHouse Cloud](/get-started/setup/cloud).
</Tip>
Expand All @@ -20,39 +22,4 @@ In addition, the sample datasets provide a great experience on working with Clic
learning important techniques and tricks, and seeing how to take advantage of the many powerful
functions in ClickHouse. The sample datasets include:

{/* The following table is automatically generated at build time
by https://github.com/ClickHouse/clickhouse-docs/blob/main/scripts/autogenerate-table-of-contents.sh */}

{/*AUTOGENERATED_START*/}
| Page | Description |
|-----|-----|
| [Amazon customer review](/get-started/sample-datasets/amazon-reviews) | Over 150M customer reviews of Amazon products |
| [AMPLab Big Data Benchmark](/get-started/sample-datasets/amplab-benchmark) | A benchmark dataset used for comparing the performance of data warehousing solutions. |
| [Analyzing Stack Overflow data with ClickHouse](/get-started/sample-datasets/stackoverflow) | Analyzing Stack Overflow data with ClickHouse |
| [Anonymized web analytics](/get-started/sample-datasets/anon-web-analytics-metrica) | Dataset consisting of two tables containing anonymized web analytics data with hits and visits |
| [Brown University Benchmark](/get-started/sample-datasets/brown-benchmark) | A new analytical benchmark for machine-generated log data |
| [COVID-19 open data](/get-started/sample-datasets/covid19) | COVID-19 Open-Data is a large, open-source database of COVID-19 epidemiological data and related factors like demographics, economics, and government responses |
| [dbpedia dataset](/get-started/sample-datasets/dbpedia) | Dataset containing 1 million articles from Wikipedia and their vector embeddings |
| [Environmental sensors data](/get-started/sample-datasets/environmental-sensors) | Over 20 billion records of data from Sensor.Community, a contributors-driven global sensor network that creates Open Environmental Data. |
| [Foursquare places](/get-started/sample-datasets/foursquare-os-places) | Dataset with over 100 million records containing information about places on a map, such as shops, restaurants, parks, playgrounds, and monuments. |
| [Geo data using the cell tower dataset](/get-started/sample-datasets/cell-towers) | Learn how to load OpenCelliD data into ClickHouse, connect Apache Superset to ClickHouse and build a dashboard based on data |
| [GitHub events dataset](/get-started/sample-datasets/github-events) | Dataset containing all events on GitHub from 2011 to Dec 6 2020, with a size of 3.1 billion records. |
| [Hacker News dataset](/get-started/sample-datasets/hacker-news) | Dataset containing 28 million rows of hacker news data. |
| [Hacker News vector search dataset](/get-started/sample-datasets/hacker-news-vector-search) | Dataset containing 28+ million Hacker News postings & their vector embeddings |
| [LAION 5B dataset](/get-started/sample-datasets/laion5b) | Dataset containing 100 million vectors from the LAION 5B dataset |
| [Laion-400M dataset](/get-started/sample-datasets/laion) | Dataset containing 400 million images with English image captions |
| [New York Public Library "What's on the Menu?" dataset](/get-started/sample-datasets/menus) | Dataset containing 1.3 million records of historical data on the menus of hotels, restaurants and cafes with the dishes along with their prices. |
| [New York taxi data](/get-started/sample-datasets/nyc-taxi) | Data for billions of taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009 |
| [NOAA Global Historical Climatology Network](/get-started/sample-datasets/noaa) | 2.5 billion rows of climate data for the last 120 yrs |
| [NYPD complaint data](/get-started/sample-datasets/nypd-complaint-data) | Ingest and query Tab Separated Value data in 5 steps |
| [OnTime](/get-started/sample-datasets/ontime) | Dataset containing the on-time performance of airline flights |
| [Star Schema Benchmark (SSB, 2009)](/get-started/sample-datasets/star-schema) | The Star Schema Benchmark (SSB) data set and queries |
| [Taiwan historical weather datasets](/get-started/sample-datasets/tw-weather) | 131 million rows of weather observation data for the last 128 yrs |
| [Terabyte click logs from Criteo](/get-started/sample-datasets/criteo) | A terabyte of click logs from Criteo |
| [The UK property prices dataset](/get-started/sample-datasets/uk-price-paid) | Learn how to use projections to improve the performance of queries that you run frequently using the UK property dataset, which contains data about prices paid for real-estate property in England and Wales |
| [TPC-DS (2012)](/get-started/sample-datasets/tpcds) | The TPC-DS benchmark data set and queries. |
| [TPC-H (1999)](/get-started/sample-datasets/tpch) | The TPC-H benchmark data set and queries. |
| [WikiStat](/get-started/sample-datasets/wikistat) | Explore the WikiStat dataset containing 0.5 trillion records. |
| [Writing queries in ClickHouse using GitHub data](/get-started/sample-datasets/github) | Dataset containing all of the commits and changes for the ClickHouse repository |
| [YouTube dataset of dislikes](/get-started/sample-datasets/youtube-dislikes) | A collection of dislikes of YouTube videos. |
{/*AUTOGENERATED_END*/}
<SampleDatasetExplorer />
Binary file added images/sample-datasets-grid/benchmarks-dark.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/sample-datasets-grid/benchmarks-light.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
273 changes: 273 additions & 0 deletions snippets/components/SampleDatasetExplorer/SampleDatasetExplorer.jsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,273 @@
// SampleDatasetExplorer
// A 3x2 grid of sample-dataset *categories*. Clicking a category expands it into
// a grid of cards for that category's child dataset pages, with an animated
// (staggered fade/scale) transition between the two views.
//
// Child pages don't have their own images yet, so they render as icon Cards.
//
// NOTE: Mintlify eval's ONLY the exported component function, so every constant
// (ACCENT, CATEGORIES) and helper MUST live inside the component body — module-level
// declarations are not in scope at render time and throw "X is not defined".

export const SampleDatasetExplorer = ({ categories }) => {
const ACCENT = '#FAFF69';

// Each category: id, title (also baked into the banner image), an icon used for
// its child cards, the two banner images, and the child dataset pages.
const CATEGORIES = [
{
id: 'benchmarks',
title: 'Benchmarks',
icon: 'gauge',
imgLight: '/images/sample-datasets-grid/benchmarks-light.jpg',
imgDark: '/images/sample-datasets-grid/benchmarks-dark.jpg',
datasets: [
{ title: 'AMPLab Big Data Benchmark', href: '/get-started/sample-datasets/amplab-benchmark' },
{ title: 'Brown University Benchmark', href: '/get-started/sample-datasets/brown-benchmark' },
{ title: 'Star Schema Benchmark (SSB)', href: '/get-started/sample-datasets/star-schema' },
{ title: 'TPC-DS', href: '/get-started/sample-datasets/tpcds' },
{ title: 'TPC-H', href: '/get-started/sample-datasets/tpch' },
],
},
{
id: 'geo-location',
title: 'Geo & location',
icon: 'map-pin',
imgLight: '/images/sample-datasets-grid/geo-location-light.jpg',
imgDark: '/images/sample-datasets-grid/geo-location-dark.jpg',
datasets: [
{ title: 'Cell towers (OpenCelliD)', href: '/get-started/sample-datasets/cell-towers' },
{ title: 'Foursquare places', href: '/get-started/sample-datasets/foursquare-os-places' },
{ title: 'New York taxi data', href: '/get-started/sample-datasets/nyc-taxi' },
],
},
{
id: 'public-records',
title: 'Public records & open data',
icon: 'landmark',
imgLight: '/images/sample-datasets-grid/public-records-light.jpg',
imgDark: '/images/sample-datasets-grid/public-records-dark.jpg',
datasets: [
{ title: 'COVID-19 open data', href: '/get-started/sample-datasets/covid19' },
{ title: 'NYPD complaint data', href: '/get-started/sample-datasets/nypd-complaint-data' },
{ title: 'OnTime (airline flights)', href: '/get-started/sample-datasets/ontime' },
{ title: 'UK property prices', href: '/get-started/sample-datasets/uk-price-paid' },
{ title: "What's on the Menu? (NYPL)", href: '/get-started/sample-datasets/menus' },
],
},
{
id: 'time-series-sensors',
title: 'Time series & sensors',
icon: 'activity',
imgLight: '/images/sample-datasets-grid/time-series-sensors-light.jpg',
imgDark: '/images/sample-datasets-grid/time-series-sensors-dark.jpg',
datasets: [
{ title: 'Environmental sensors data', href: '/get-started/sample-datasets/environmental-sensors' },
{ title: 'NOAA Global Historical Climatology Network', href: '/get-started/sample-datasets/noaa' },
{ title: 'Taiwan historical weather', href: '/get-started/sample-datasets/tw-weather' },
],
},
{
id: 'vector-search',
title: 'Vector search and embeddings',
icon: 'search',
imgLight: '/images/sample-datasets-grid/vector-search-light.jpg',
imgDark: '/images/sample-datasets-grid/vector-search-dark.jpg',
datasets: [
{ title: 'dbpedia dataset', href: '/get-started/sample-datasets/dbpedia' },
{ title: 'Hacker News vector search', href: '/get-started/sample-datasets/hacker-news-vector-search' },
{ title: 'LAION 5B dataset', href: '/get-started/sample-datasets/laion5b' },
{ title: 'Laion-400M dataset', href: '/get-started/sample-datasets/laion' },
],
},
{
id: 'web-social',
title: 'Web and social analytics',
icon: 'globe',
imgLight: '/images/sample-datasets-grid/web-social-analytics-light.jpg',
imgDark: '/images/sample-datasets-grid/web-social-analytics-dark.jpg',
datasets: [
{ title: 'Amazon customer reviews', href: '/get-started/sample-datasets/amazon-reviews' },
{ title: 'Analyzing Stack Overflow data', href: '/get-started/sample-datasets/stackoverflow' },
{ title: 'Anonymized web analytics', href: '/get-started/sample-datasets/anon-web-analytics-metrica' },
{ title: 'Criteo terabyte click logs', href: '/get-started/sample-datasets/criteo' },
{ title: 'GitHub events dataset', href: '/get-started/sample-datasets/github-events' },
{ title: 'Hacker News dataset', href: '/get-started/sample-datasets/hacker-news' },
{ title: 'Querying GitHub data', href: '/get-started/sample-datasets/github' },
{ title: 'WikiStat', href: '/get-started/sample-datasets/wikistat' },
{ title: 'YouTube dataset of dislikes', href: '/get-started/sample-datasets/youtube-dislikes' },
],
},
];

const cats = categories || CATEGORIES;

const [selectedId, setSelectedId] = useState(null);
const selected = cats.find((c) => c.id === selectedId) || null;

// Theme visibility is handled by explicit `.dark` descendant selectors in the
// <style> block below (Mintlify's class strategy — same approach as
// IntegrationGrid). Tailwind `dark:` utilities are NOT reliable here: they
// compile against the OS media query, so they'd ignore the in-app light/dark
// toggle. Note the reversed-colour scheme: light mode shows the *dark* (black)
// banner art, dark mode shows the *light* (yellow) art.
const Banner = ({ cat, className }) => (
<>
<img className={`sde-img-dark ${className || ''}`} src={cat.imgDark} alt={cat.title} />
<img className={`sde-img-light ${className || ''}`} src={cat.imgLight} alt={cat.title} />
</>
);

return (
<div className="sde-root my-8">
<style dangerouslySetInnerHTML={{ __html: `
@keyframes sde-pop {
from { opacity: 0; transform: translateY(14px) scale(0.96); }
to { opacity: 1; transform: translateY(0) scale(1); }
}
@keyframes sde-fade {
from { opacity: 0; }
to { opacity: 1; }
}
.sde-view { animation: sde-fade 0.25s ease both; }
/* Reversed scheme: dark (black) art in light mode, light (yellow) art in dark mode.
Use explicit .dark selectors — Tailwind dark: utilities follow the OS here. */
.sde-root .sde-img-dark { display: block; }
.sde-root .sde-img-light { display: none; }
.dark .sde-root .sde-img-dark { display: none; }
.dark .sde-root .sde-img-light { display: block; }
.sde-tile {
position: relative;
display: block;
width: 100%;
padding: 0;
border: none;
background: transparent;
border-radius: 0.9rem;
overflow: hidden;
cursor: pointer;
animation: sde-pop 0.4s cubic-bezier(0.22, 1, 0.36, 1) both;
transition: transform 0.25s cubic-bezier(0.22, 1, 0.36, 1), box-shadow 0.25s ease;
box-shadow: 0 1px 3px rgba(0,0,0,0.12);
}
.sde-tile:hover {
transform: translateY(-4px) scale(1.015);
box-shadow: 0 12px 28px rgba(0,0,0,0.22);
}
.sde-tile:active { transform: translateY(-1px) scale(0.995); }
.sde-tile img {
width: 100%;
height: 100%;
object-fit: cover;
margin: 0;
transition: transform 0.4s cubic-bezier(0.22, 1, 0.36, 1);
}
.sde-tile:hover img { transform: scale(1.04); }
/* hover hint overlay */
.sde-tile-hint {
position: absolute;
inset: 0;
display: flex;
align-items: flex-end;
justify-content: space-between;
gap: 8px;
padding: 12px 14px;
background: linear-gradient(to top, rgba(0,0,0,0.55), rgba(0,0,0,0) 55%);
opacity: 0;
transition: opacity 0.25s ease;
pointer-events: none;
}
.sde-tile:hover .sde-tile-hint { opacity: 1; }
.sde-count {
font-size: 0.78rem;
font-weight: 600;
color: #fff;
}
.sde-explore {
font-size: 0.78rem;
font-weight: 700;
color: ${ACCENT};
display: inline-flex;
align-items: center;
gap: 4px;
}
.sde-child { animation: sde-pop 0.45s cubic-bezier(0.22, 1, 0.36, 1) both; }
.sde-back {
display: inline-flex;
align-items: center;
gap: 6px;
font-size: 0.875rem;
font-weight: 600;
padding: 6px 12px;
border-radius: 9999px;
cursor: pointer;
background: transparent;
border: 1px solid rgba(156,163,175,0.5);
color: inherit;
transition: all 0.2s ease;
}
.sde-back:hover { border-color: ${ACCENT}; }
.sde-detail-banner {
width: 100%;
max-height: 220px;
object-fit: cover;
border-radius: 0.9rem;
margin: 0 0 1.5rem 0;
box-shadow: 0 8px 24px rgba(0,0,0,0.18);
animation: sde-pop 0.4s cubic-bezier(0.22, 1, 0.36, 1) both;
}
`}} />

{!selected ? (
<div className="sde-view">
<div className="grid grid-cols-1 sm:grid-cols-2 lg:grid-cols-3 gap-6">
{cats.map((cat, i) => (
<button
key={cat.id}
type="button"
className="sde-tile"
style={{ animationDelay: `${i * 60}ms`, aspectRatio: '4 / 3' }}
onClick={() => setSelectedId(cat.id)}
aria-label={`Explore ${cat.title} datasets`}
>
<Banner cat={cat} />
<span className="sde-tile-hint">
<span className="sde-count">
{cat.datasets.length} dataset{cat.datasets.length === 1 ? '' : 's'}
</span>
<span className="sde-explore">
Explore
<svg width="14" height="14" viewBox="0 0 16 16" fill="none" xmlns="http://www.w3.org/2000/svg">
<path d="M6 4l4 4-4 4" stroke="currentColor" strokeWidth="1.6" strokeLinecap="round" strokeLinejoin="round" />
</svg>
</span>
</span>
</button>
))}
</div>
</div>
) : (
<div className="sde-view" key={selected.id}>
<div className="mb-6">
<button type="button" className="sde-back" onClick={() => setSelectedId(null)}>
<svg width="14" height="14" viewBox="0 0 16 16" fill="none" xmlns="http://www.w3.org/2000/svg">
<path d="M10 4L6 8l4 4" stroke="currentColor" strokeWidth="1.6" strokeLinecap="round" strokeLinejoin="round" />
</svg>
All categories
</button>
</div>

<Banner cat={selected} className="sde-detail-banner" />

<div className="grid grid-cols-1 sm:grid-cols-2 lg:grid-cols-3 gap-6">
{selected.datasets.map((ds, i) => (
<div className="sde-child" key={ds.href} style={{ animationDelay: `${i * 50}ms` }}>
<Card title={ds.title} icon={selected.icon} href={ds.href} />
</div>
))}
</div>
</div>
)}
</div>
);
};