This tutorial will walk you through how to use Scroll for data analysis and visualization, from basic concepts to advanced techniques.
Scroll combines the simplicity of markdown-style syntax with powerful data transformation and visualization capabilities. You can:
Let's dive in!
Scroll comes with several sample datasets. Let's start with the famous iris dataset:
iris
printTable
| sepal_length | sepal_width | petal_length | petal_width | species |
|---|---|---|---|---|
| 6.1 | 3 | 4.9 | 1.8 | virginica |
| 5.6 | 2.7 | 4.2 | 1.3 | versicolor |
| 5.6 | 2.8 | 4.9 | 2 | virginica |
| 6.2 | 2.8 | 4.8 | 1.8 | virginica |
| 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 5.3 | 3.7 | 1.5 | 0.2 | setosa |
| 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 4.9 | 2.5 | 4.5 | 1.7 | virginica |
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 5 | 3.4 | 1.5 | 0.2 | setosa |
You can also load datasets from Vega's collection:
sampleData zipcodes.csv
limit 0 5
printTable
| zip_code | latitude | longitude | city | state | county |
|---|---|---|---|---|---|
| 501 | 40.922326 | -72.637078 | Holtsville | NY | Suffolk |
| 544 | 40.922326 | -72.637078 | Holtsville | NY | Suffolk |
| 601 | 18.165273 | -66.722583 | Adjuntas | PR | Adjuntas |
| 602 | 18.393103 | -67.180953 | Aguada | PR | Aguada |
| 603 | 18.455913 | -67.14578 | Aguadilla | PR | Aguadilla |
Let's explore some basic operations on the iris dataset:
iris
summarize
printTable
| name | type | incompleteCount | uniqueCount | count | sum | median | mean | min | max | mode |
|---|---|---|---|---|---|---|---|---|---|---|
| sepal_length | number | 0 | 8 | 10 | 57.699999999999996 | 5.6 | 5.77 | 4.9 | 7.7 | 5.6 |
| sepal_width | number | 0 | 8 | 10 | 31.599999999999998 | 3.2 | 3.1599999999999997 | 2.5 | 3.8 | 2.8 |
| petal_length | number | 0 | 8 | 10 | 39.8 | 4.65 | 3.9799999999999995 | 1.4 | 6.7 | 4.9 |
| petal_width | number | 0 | 7 | 10 | 13.699999999999996 | 1.75 | 1.3699999999999997 | 0.2 | 2.3 | 0.2 |
| species | string | 0 | 3 | 10 | virginica |
This gives us summary statistics for each column.
Let's look at filtering:
iris
where species = setosa
printTable
where species oneOf setosa virginica
printTable
| sepal_length | sepal_width | petal_length | petal_width | species |
|---|---|---|---|---|
| 5.3 | 3.7 | 1.5 | 0.2 | setosa |
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 5 | 3.4 | 1.5 | 0.2 | setosa |
| sepal_length | sepal_width | petal_length | petal_width | species |
|---|---|---|---|---|
| 6.1 | 3 | 4.9 | 1.8 | virginica |
| 5.6 | 2.8 | 4.9 | 2 | virginica |
| 6.2 | 2.8 | 4.8 | 1.8 | virginica |
| 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 5.3 | 3.7 | 1.5 | 0.2 | setosa |
| 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 4.9 | 2.5 | 4.5 | 1.7 | virginica |
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 5 | 3.4 | 1.5 | 0.2 | setosa |
Let's start with a simple scatterplot of the iris data:
iris
scatterplot
x sepal_width
y sepal_length
title Sepal Length vs Width
fill species
Let's look at some time series data:
sampleData seattle-weather.csv
parseDate date
linechart
x date
y temp_max
title Maximum Temperature in Seattle
stroke steelblue
Let's create a bar chart showing precipitation:
sampleData seattle-weather.csv
groupBy weather
reduce precipitation mean precip_avg
barchart
x weather
y precip_avg
fill teal
title Average Precipitation by Weather Type
Let's look at some more complex transformations:
sampleData weather.csv
groupBy weather
reduce temp_max mean avg_max_temp
reduce temp_min mean avg_min_temp
orderBy -avg_max_temp
printTable
| count | weather | avg_max_temp | avg_min_temp |
|---|---|---|---|
| 129 | drizzle | 18.555813953488368 | 10.143410852713178 |
| 459 | rain | 15.535294117647041 | 9.04727668845315 |
| 1674 | sun | 18.064157706093184 | 8.87275985663083 |
| 78 | snow | 4.528205128205127 | -1.4346153846153844 |
| 582 | fog | 15.261855670103111 | 8.527319587628869 |
Let's add some computed columns:
iris
compute ratio {sepal_length}/{sepal_width}
where ratio > 2
printTable
| sepal_length | sepal_width | petal_length | petal_width | species | ratio |
|---|---|---|---|---|---|
| 6.1 | 3 | 4.9 | 1.8 | virginica | 2.033333333333333 |
| 5.6 | 2.7 | 4.2 | 1.3 | versicolor | 2.074074074074074 |
| 6.2 | 2.8 | 4.8 | 1.8 | virginica | 2.2142857142857144 |
| 7.7 | 3.8 | 6.7 | 2.2 | virginica | 2.0263157894736845 |
Let's create a heatmap of annual precipitation values:
sampleData seattle-weather.csv
splitYear
groupBy year
reduce precipitation mean precipitation_mean
select year precipitation_mean
transpose
heatrix
You can create multiple visualizations:
iris
scatterplot
x sepal_length
y sepal_width
fill species
barchart
x species
y sepal_length
fill teal
title Sepal Length by Species
This tutorial covered the basics of data science with Scroll. Some key takeaways: