简单说,DuckDB 是一款在类似于 SQLite 的嵌入到应用内使用的 OLAP 数据库,没有一般的 OLAP 数据库那么“重”,也能够为各种非大数据但是数据量也蛮大的 Dashboard 型的数据场景提供助力。
![[Quack, World!20240930155201.png]]
### 方便
马上开始使用,比安装各种 \*sql 数据库方便多了
```shell
brew install duckdb
duckdb
```
### SQL in Python
- DuckDB can also directly query Pandas DataFrames, Polars DataFrames and Arrow tables.
- 我一直用不顺 Pandas(虽然也有 sql 型的查询方式),极大阻碍了我的技能发展,这一次,可能真的不同。
```python
import duckdb
# directly query a Pandas DataFrame
import pandas as pd
pandas_df = pd.DataFrame({"a": [42]})
duckdb.sql("SELECT * FROM pandas_df")
# directly query a Polars DataFrame
import polars as pl
polars_df = pl.DataFrame({"a": [42]})
duckdb.sql("SELECT * FROM polars_df")
# directly query a pyarrow table
import pyarrow as pa
arrow_table = pa.Table.from_pydict({"a": [42]})
duckdb.sql("SELECT * FROM arrow_table")
```
### SQL 新特性
- USING 设置关联条件。
```sql
两个表里的 regionkey 是同一个字段
-- return the regions for the nations
SELECT n.*, r.*
FROM l_nations n JOIN l_regions r USING (regionkey);
```
- Semi and Anti Joins
- Semi joins return rows from the left table that have at least one match in the right table.
- Anti joins return rows from the left table that have no matches in the right table.
- 少写一个 where 啊!
- Positional Joins 按记录的顺序拼接列
- Always full outer joins
```sql
-- treat two data frames as a single table
SELECT df1.*, df2.*
FROM df1
POSITIONAL JOIN df2;
```
- GROUP BY ALL 和 ORDER BY ALL
- 少做一次 Copy 啊!代码也短一些。
```sql
-- Group by city and street_name to remove any duplicate values
SELECT city, street_name
FROM addresses
GROUP BY ALL;
-- GROUP BY city, street_name
-- compute the average income per city per street_name
-- Since income is wrapped in an aggregate function, do not include it in the GROUP BY
SELECT city, street_name, avg(income)
FROM addresses
GROUP BY ALL;
-- GROUP BY city, street_name
```
- 按拼音顺序排列。
![[Quack, World!20240930155308.png]]
- 抽样跑数
- 之前我都是自己 `rand()` 一列,均匀分布地抽取一些数据来处理的。
- 啊,这个“不努力”分布到底张啥样来的?
```sql
-- select a sample of 1% of the addresses table using default (system) sampling
SELECT *
FROM addresses
USING SAMPLE 1%;
-- select a sample of 1% of the addresses table using bernoulli sampling
SELECT *
FROM addresses
USING SAMPLE 1% (bernoulli);
```
- 展开 list 或者一个 struct,直接 Unnest `SELECT unnest({'a': 42, 'b': 84});`
- Qualify 过滤 window function 的结果
- 少套一个 where
```sql
-- Filter based on a WINDOW function defined in the SELECT clause
SELECT
...
row_number() OVER (...) AS function_rank
FROM ...
QUALIFY
function_rank < 3;
```
- filter 写指标更加优雅了。
- 好像也差不多?
```sql
-- "Pivot" the data out by year (move each year out to a separate column)
SELECT
count(i) FILTER (year = 2022) AS "2022",
count(CASE WHEN year = 2023 THEN i END) AS "2023"
FROM stacked_data;
```
- 一口气取一堆列,懒到家了
```sql
-- 不取 col1
SELECT * EXCLUDE (col1) FROM tbl;
-- 只改 col1
SELECT * REPLACE (col1 / 1000 AS col1) FROM tbl;
-- 一起来 aggregate
SELECT min(COLUMNS(*)), count(COLUMNS(*)) FROM numbers;
-- 懒得写字段名,那就 regex 一下
SELECT COLUMNS('(id|numbers?)') FROM numbers;
```