qsv-data-wrangling

qsv: Blazing-fast Data-Wrangling toolkit

![Linux build status](https://github.com/dathere/qsv/actions/workflows/rust.yml) ![Windows build status](https://github.com/dathere/qsv/actions/workflows/rust-windows.yml) ![macOS build status](https://github.com/dathere/qsv/actions/workflows/rust-macos.yml) ![Security audit](https://github.com/dathere/qsv/actions/workflows/security-audit.yml) ![Crates.io](https://crates.io/crates/qsv) ![Discussions](https://github.com/dathere/qsv/discussions) ![Minimum supported Rust version](#minimum-supported-rust-version) ![FOSSA Status](https://app.fossa.com/projects/git%2Bgithub.com%2Fjqnatividad%2Fqsv?ref=badge_shield) ![DOI](https://doi.org/10.5281/zenodo.17851335) ![Wiki](https://github.com/dathere/qsv/wiki)

| Table of Contents :--------------------------|:------------------------- !qsv logo _Hi-ho "Quicksilver" away!_ original logo details Base AI-reimagined logo Event logo archive |qsv is a data-wrangling toolkit for querying, slicing, sorting, analyzing, filtering, enriching, transforming, validating, joining, formatting, converting, chatting, FAIRifying & documenting tabular data (CSV, Excel, etc). Commands are simple, composable & ___"blazing fast"___. Commands  Installation: CLI • MCP Server • Cowork Plugin Whirlwind Tour / Notebooks / Lessons & Exercises  FAQ Performance Tuning  👉 Benchmarks 🚀 Environment Variables  Feature Flags Goals/Non-goals  Testing NYC SOD 2022/csv,conf,v8/PyConUS 2025/    csv,conf,v9/NYC SOD 2026  _"Have we achieved ACI?"_ series - 1 • 2 • 3 * Sponsor </div> <div align="center">

Try it out at qsv.dathere.com!

</div>

| <a name="available-commands">Command | Description | | --- | --- | | apply✨ 📇🧠🤖🚀🔣👆⛩️ | Apply series of string, date, math & currency transformations to given CSV column/s. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex, language & name gender) detection. Its summarize subcommand condenses a column or group of columns using an OpenAI API-compatible LLM (local or commercial) with customizable, Mini Jinja-templated per-record prompts. | | applydp✨ 📇🚀🔣👆 !CKAN | <a name="applydp_deeplink"></a>applydp is a slimmed-down version of apply with only Datapusher+ relevant subcommands/operations (qsvdp binary variant only). | | behead | Drop headers from a CSV. | | blake3 🚀 | Compute or check BLAKE3 hashes of files. | | cat 🗄️ | Concatenate CSV files by row or by column. | | clean | Remove qsv-generated cache files (.idx index, stats & frequency caches) to reduce clutter & simplify data packaging. With --stale, only removes caches whose source changed or is gone. Opt-in flags also clean schema, validate & moarstats outputs. | | clipboard✨ 🖥️ | Provide input from the clipboard or save output to the clipboard. | | color✨ 🤯🐻‍❄️🖥️ | Outputs tabular data as a pretty, colorized table that always fits into the terminal. Apart from CSV and its dialects, Arrow, Avro/IPC, Parquet, JSON array & JSONL formats are supported with the "polars" feature. | | count 📇🐻‍❄️🏎️ | Count the rows and optionally compile record width statistics of a CSV file. (11.87 seconds for a 15gb, 28m row NYC 311 dataset without an index. Instantaneous with an index.) If the polars feature is enabled, uses Polars' multithreaded, mem-mapped CSV reader for fast counts even without an index | | datefmt 📇🚀👆 | Formats recognized date fields (19 formats recognized) to a specified date format using strftime date format specifiers. | | dedup 🤯🚀👆 | Remove duplicate rows (See also extdedup, extsort, sort & sortcheck commands). | | describegpt 📇🗃️🤖🌐🪄📚⛩️ !CKAN | <a name="describegpt_deeplink"></a>Infer a "neuro-symbolic" Data Dictionary, Description & Tags or ask questions about a CSV with a configurable, Mini Jinja prompt file, using any OpenAI API-compatible LLM, including local LLMs. (e.g. Markdown, JSON, TOON, JSON Schema, Semantic Markdown, OKF, Everything, Spanish, Mandarin, Controlled Tags; --prompt "What are the top 10 complaint types by community board & borough by year?" - deterministic, hallucination-free SQL RAG result; iterative, session-based SQL RAG refinement - refined SQL RAG result) | | diff 🚀 | Find the difference between two CSVs with ludicrous speed! e.g. _compare two CSVs with 1M rows x 9 columns in under 600ms!_ | | edit | Replace the value of a cell specified by its row and column. | | enum 👆 | Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value. | | excel 🚀 | Exports a specified Excel/ODS sheet to a CSV file. | | exclude 📇👆 | Removes a set of CSV data from another set based on the specified columns. | | explode 🔣👆 | Explode rows into multiple ones by splitting a column value based on the given separator. The inverse of implode. | | extdedup 👆 | Remove duplicate rows from an arbitrarily large CSV/text file using a memory-mapped, on-disk hash table. Unlike the dedup command, this command does not load the entire file into memory nor does it sort the deduped file. | | extsort 📇🚀👆 | Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm. | | fetch✨ 📇🧠🌐 | Send/Fetch data to/from web services for every row using HTTP Get. Comes with HTTP/2 adaptive flow control, jaq JSON query language support, dynamic throttling (RateLimit) & caching with available persistent caching using Redis or a disk-cache. | | fetchpost✨ 📇🧠🌐⛩️ | Similar to fetch, but uses HTTP Post (HTTP GET vs POST methods). Supports HTML form (application/x-www-form-urlencoded), JSON (application/json) and custom content types - with the ability to render payloads using CSV data using the Mini Jinja template engine. | | fill 👆 | Fill empty values. | | fixlengths | Force a CSV to have same-length records by either padding or truncating them. | | flatten | A flattened view of CSV records. Useful for viewing one record at a time. e.g. qsv slice -i 5 data.csv \| qsv flatten. | | fmt | Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.) | | foreach✨ | Execute a shell command once per record in a given CSV file. | | frequency 📇😣🏎️👆🪄!Luau | Build frequency distribution tables) of each column. Uses multithreading to go faster if an index is present (Examples: CSV JSON TOON). | | get✨ 📇🧠🌐 !CKAN | <a name="get_deeplink"></a>Get tabular data from local files, URLs (http/https & dathere://) & CKAN (ckan://) into a managed, queryable disk cache - with conditional revalidation (ETag/Last-Modified), transparent zstd compression, BLAKE3 hashing & automatic indexing. Cached resources are reusable by ANY qsv command via the dc: prefix (e.g. qsv stats dc:data.csv), with stale entries auto-refreshed. Efficiently seeds luau lookup tables, validate dynamicEnum reference data & speeds up Datapusher+ harvesting. | | geocode✨ 📇🧠🚀🌐🔣👆🌎 | Geocodes a location against an updatable local copy of the Geonames cities & the Maxmind GeoLite2 databases — with caching and multi-threading, this offline path geocodes up to 360,000 records/sec! Can also geocode online (forward & reverse) via the OpenCage geocoder. | | geoconvert✨ 🌎 | Convert between various spatial formats and CSV/SVG including GeoJSON, SHP, and more. | | headers 🗄️ | Show the headers of a CSV. Or show the intersection of all headers between many CSV files. | | implode 😣👆 | Implode rows by grouping on key column(s) and joining a value column with a given separator. The inverse of explode. | | index | Create an index (📇) for a CSV. This is very quick (even the 15gb, 28m row NYC 311 dataset takes all of 14 seconds to index) & provides constant time indexing/random access into the CSV. With an index, count, sample & slice work instantaneously; random access mode is enabled in luau; and multithreading (🏎️) is enabled for the frequency, split, stats & schema commands. | | input | Read CSV data with special commenting, quoting, trimming, line-skipping & non-UTF8 encoding handling rules. Typically used to "normalize" a CSV for further processing with other qsv commands. | | join 📇😣👆 | Inner, outer, right, cross, anti & semi joins. Automatically creates a simple, in-memory hash index to make it fast. | | joinp✨ 🐻‍❄️🚀🪄 | Inner, outer, right, cross, anti, semi, non-equi & asof joins using the Pola.rs engine. Unlike the join command, joinp can process files larger than RAM, is multithreaded, has join key validation, a maintain row order option, pre and post-join filtering, join keys unicode normalization, supports "special" non-equi joins and asof joins (which is particularly useful for time series data) & its output columns can be coalesced. | | json 👆 | Convert JSON array to CSV. | jsonl 🚀🔣 | Convert newline-delimited JSON (JSONL/NDJSON) to CSV. See tojsonl command to convert CSV to JSONL. | lens✨ 🗃️🐻‍❄️🖥️ | Interactively view, search & filter tabular data files using the csvlens engine. Apart from CSV and its dialects, Arrow, Avro/IPC, Parquet, JSON array & JSONL formats are supported with the "polars" feature. | | luau✨ 📇🌐🔣📚 !CKAN !Luau | <a name="luau_deeplink"></a>Create multiple new computed columns, filter rows, compute aggregations and build complex data pipelines by executing a Luau 0.725 expression/script for every row of a CSV file (sequential mode), or using random access with an index (random access mode). Can process a single Luau expression or full-fledged data-wrangling scripts using lookup tables with discrete BEGIN, MAIN and END sections. It is not just another qsv command, it is qsv's Domain-specific Language (DSL) with numerous qsv-specific helper functions to build production data pipelines. | | moarstats 📇🏎️ | Add up to an additional 55 statistical measures, including extended outlier, robust & bivariate statistics to an existing stats CSV file. (example).| | partition 👆 | Partition a CSV based on a column value. | | pivotp✨ 🐻‍❄️🚀🪄 | Pivot CSV data. Features "smart" aggregation auto-selection based on data type & stats. | | pragmastat 📇🤯🎲🪄 | Compute pragmatic statistics using the Pragmastat library. Uses the stats cache to auto-filter non-numeric columns and support Date/DateTime columns. | | pro | Interact with the qsv pro API. | | profile✨ 📇🧠🤖📚⛩️ !CKAN | Extract, derive & infer metadata from a CSV (local path or URL) - using the statistical profile of a dataset, mapped and driven by a configurable metadata scheming YAML spec (DCAT-US v3, DCAT-AP v3 and Croissant 1.1 bundled; Geoconnex when built with the geoconnex feature), with optional CKAN/DCAT metadata discovery for URL inputs. This enables FAIRification at scale. | | prompt✨ 🐻‍❄️🖥️ | Open a file dialog to either pick a file as input or save output to a file. | | pseudo 🔣👆 | Pseudonymise the value of the given column by replacing them with an incremental identifier. | | py✨ 📇🔣 | Create a new computed column or filter rows by evaluating a Python expression on every row of a CSV file. Python's f-strings is particularly useful for extended formatting, with the ability to evaluate Python expressions as well. Requires Python 3.11 or greater. | | rename | Rename the columns of a CSV efficiently. | | replace 📇🏎️👆 | Replace CSV data using a regex. Applies the regex to each field individually. | | reverse 📇🤯 | Reverse order of rows in a CSV. Unlike the sort --reverse command, it preserves the order of rows with the same key. If an index is present, it works with constant memory. Otherwise, it will load all the data into memory. | | safenames !CKAN | <a name="safenames_deeplink"></a>Modify headers of a CSV to only have "safe" names - guaranteed "database-ready"/"CKAN-ready" names. | | sample 📇🏎️🌐🎲🪄 | Randomly draw rows (with optional seed) from a CSV using ten different sampling methods - reservoir (default), indexed, bernoulli, systematic, stratified, weighted, varopt, mergeable-reservoir, cluster & timeseries sampling. The --varopt & --mergeable-reservoir modes support mergeable sketch I/O (--sketch-out/--sketch-in) so sharded inputs can be sampled and combined without re-reading the corpus. Supports sampling from CSVs on remote URLs. Uses the stats cache to skip unnecessary scanning and inform its sampling strategies. | | schema 📇😣🐻‍❄️🏎️👆🪄 | <a name="schema_deeplink"></a>Infer either a JSON Schema Validation Draft 2020-12 (Example) or Polars Schema (Example) from CSV data. In JSON Schema Validation mode, it produces a .schema.json file replete with inferred data type & domain/range validation rules derived from stats. Uses multithreading to go faster if an index is present. See validate command to use the generated JSON Schema to validate if similar CSVs comply with the schema. With the --polars option, it produces a .pschema.json file that all polars commands (sqlp, joinp & pivotp) use to determine the data type of each column & to optimize performance. Both schemas are editable and can be fine-tuned. For JSON Schema, to refine the inferred validation rules. For Polars Schema, to change the inferred Polars data types. | | scoresql✨ 🐻‍❄️🪄 | Analyze a SQL query against CSV file caches (stats, moarstats, frequency) to produce a performance score with actionable optimization suggestions BEFORE running the query. Supports Polars (default) and DuckDB modes. | | search 📇🏎️👆 | Run a regex over a CSV. Applies the regex to selected fields & shows only matching rows. | | searchset 📇🏎️👆 | _Run multiple regexes over a CSV in a single pass._ Applies the regexes to each field individually & shows only matching rows. | | select 👆 | Select, re-order, reverse, duplicate or drop columns. | | slice 📇🗃️🏎️ | Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice). | | snappy 🚀🌐 | <a name="snappy_deeplink"></a>Does streaming compression/decompression of the input using Google's Snappy framing format (more info). | | sniff 📇🤖🌐 !CKAN | Quickly sniff & infer CSV metadata (delimiter, header row, preamble rows, quote character, flexible, is_utf8, average record length, number of records, content length & estimated number of records if sniffing a CSV on a URL, number of fields, field names & data types). It is also a general mime type detector. | | sort 🤯🚀👆🎲 | Sorts CSV data in lexicographical, natural, numerical, reverse, unique or random (with optional seed) order (Also see extsort & sortcheck commands). | | sortcheck 👆 | Check if a CSV is sorted. With the --json options, also retrieve record count, sort breaks & duplicate count. | | split 📇🏎️ | Split one CSV file into many CSV files. It can split by number of rows, number of chunks or file size. Uses multithreading to go faster if an index is present when splitting by rows or chunks. | | sqlp✨ 📇🗄️🐻‍❄️🚀🪄 | <a name="sqlp_deeplink"></a>Run Polars SQL (a PostgreSQL dialect) queries against several CSVs, Parquet, JSONL and Arrow files - converting queries to blazing-fast Polars LazyFrame expressions, processing larger than memory CSV files. Query results can be saved in CSV, JSON, JSONL, Parquet, Apache Arrow IPC and Apache Avro formats. | | stats 📇🤯🏎️👆🪄 | <a name="stats_deeplink"></a>Compute up to 48 summary statistics & make GUARANTEED data type inferences (Null, String, Float, Integer, Date, DateTime, Boolean) for each column in a CSV (Example). Uses multithreading to go faster if an index is present. With an index, can compile "streaming" stats on a 1M row sample of NYC's 311 data in less than 0.25 seconds vs 2.24 seconds without one. | | synthesize 📇🎲🤖 | <a name="synthesize_deeplink"></a>Generate a synthetic CSV that is statistically faithful to a source CSV. Runs stats + frequency on the source so synthesized columns reproduce its per-column attributes — frequency-weighted sampling for categorical columns, quartile-bucketed numeric/date generation, null-ratio preservation. With a Data Dictionary from describegpt --dictionary --infer-content-type, semantic Content Types pick realistic fake-rs fakers (names, emails, addresses, UUIDs, etc.) for non-enumerable columns. A dictionary relationships array preserves inter-column structure within each row — joint (functional dependencies like city/state/zip), ordered (monotonic chains like created_date ≤ closed_date) and correlated (numeric correlation via a Gaussian copula). Fully reproducible with --seed. | | table 🤯 | Align output of a CSV using elastic tabstops for viewing; or to create an "aligned TSV" file or Fixed Width Format file. To interactively view a CSV, use the lens command. | | template 📇🚀🔣📚⛩️ !CKAN | Renders a template using CSV data with the Mini Jinja template engine (Example). | | to✨ 🗄️🐻‍❄️🚀 | Convert CSV files to Parquet, PostgreSQL, SQLite, Excel (XLSX), LibreOffice Calc (ODS) and Data Package. | | tojsonl 📇😣🗃️🚀🔣🪄 | Smartly converts CSV to a newline-delimited JSON (JSONL/NDJSON). By scanning the CSV first, it "smartly" infers the appropriate JSON data type for each column. See jsonl command to convert JSONL to CSV. | | transpose 🤯👆 | Transpose rows/columns of a CSV. | | validate 📇🗄️🚀🌐📚 !CKAN | <a name="validate_deeplink"></a>Validate CSV data _blazingly-fast_ using JSON Schema Validation (Draft 2020-12) (e.g. _up to 780,031 rows/second_[^1] using NYC's 311 schema generated by the schema command) & put invalid records into a separate file along with a detailed validation error report. Supports several custom JSON Schema formats & keywords: currency custom format with ISO-4217 validation  dynamicEnum custom keyword that supports enum validation against a CSV on the filesystem or a URL (http/https/ckan & dathere URL schemes supported) * uniqueCombinedWith custom keyword to validate uniqueness across multiple columns for composite key validation. If no JSON schema file is provided, validates if a CSV conforms to the RFC 4180 standard and is UTF-8 encoded. | | viz✨ 🪄👆 | <a name="viz_deeplink"></a>Generate interactive charts (bar, line, scatter, histogram, box, pie, heatmap, candlestick/ohlc, sankey, radar, geographic maps) and an auto-dashboard (viz smart) from CSV data using plotly. viz smart "automagically" picks an appropriate chart per column from the dataset's statistics & frequency distributions (box plots for continuous columns from precomputed quartiles; frequency bars for low-cardinality/boolean columns; a correlation heatmap when there are 2+ eligible continuous numeric columns; a map panel when a lat/lon column pair is detected). Outputs self-contained, interactive HTML (charts work offline; map basemaps fetch their tiles over the network unless the white-bg style is used) - or static PNG/SVG/PDF/JPEG/WebP with the viz_static feature - and can --open the result in your browser. |

<div style="text-align: right">Performance metrics compiled on an M2 Pro 12-core Mac Mini with 32gb RAM</div>

<a name="legend_deeplink">✨</a>: enabled by a feature flag. 📇: uses an index when available. 🤯: loads entire CSV into memory, though dedup, stats & transpose have "streaming" modes as well. 😣: uses additional memory proportional to the cardinality of the columns in the CSV. 🧠: expensive operations are memoized with available inter-session Redis/Disk caching for fetch commands. 🗄️: Extended input support. 🗃️: Limited Extended input support. 🐻‍❄️: command powered/accelerated by ![polars 0.54.4:py-1.42.0](https://github.com/pola-rs/polars/releases/tag/py-1.42.0) vectorized query engine. 🤖: command uses Natural Language Processing or Generative AI. 🏎️: multithreaded and/or faster when an index (📇) is available. 🚀: multithreaded even without an index. !CKAN : has CKAN-aware integration options. 🌐: has web-aware options. 🔣: requires UTF-8 encoded input. 👆: has powerful column selector support. See select for syntax. 🪄: "automagical" commands that uses stats and/or frequency tables to work "smarter" & "faster". 📚: has lookup table support, enabling runtime "lookups" against local or remote reference CSVs. 🌎: has geospatial capabilities. ⛩️: uses Mini Jinja template engine. !Luau : uses Luau 0.725 as an embedded scripting DSL. 🎲: randomly generated or randomized output with a --seed option for reproducibility. 🖥️: part of the User Interface (UI) feature group.

[^1]: see validate_index benchmark

Installation Options

> [!TIP] > To install the qsv MCP Server and/or the qsv Claude Cowork plugin, see the Getting Started guide.

Option 0: qsv pro

If you prefer to explore your data using a graphical interface instead of the command-line, feel free to try out qsv pro. Leveraging qsv, qsv pro can help you quickly analyze spreadsheet data by just dropping a file, along with many other interactive features. Learn more at qsvpro.dathere.com or download qsv pro directly by clicking one of the badges below.

Option 1: Download Prebuilt Binaries

Full-featured prebuilt binary variants of the latest qsv version for Linux, macOS & Windows are available for download, including binaries compiled with Rust Nightly (more info). You may click a badge below based on your platform to download a ZIP with pre-built binaries.

Prebuilt binaries for Apple Silicon, Windows for ARM, IBM Power Servers (PowerPC64 LE Linux) and IBM Z mainframes (s390x) have CPU optimizations enabled (target-cpu=native). The macOS Apple Silicon, Linux x86_64 (GNU), Linux ARM64 (GNU) and Windows x86_64 (MSVC) prebuilts are also compiled with Profile Guided Optimization (PGO) for even more performance gains.

We do not enable CPU optimizations on prebuilt binaries on x86_64 platforms as there are too many CPU variants which often lead to Illegal Instruction (SIGILL) faults. If you still get SIGILL faults, "portable" binaries (all CPU optimizations disabled) are also included in the release zip archives (qsv with a "p for portable" suffix - e.g. qsvp, qsvplite qsvpdp).

For Windows, an MSI "Easy installer" for the x86_64 MSVC qsvp binary is also available. After downloading and installing the Easy installer, launch the Easy installer and click "Install qsv" to download the latest qsvp pre-built binary to a folder that is added to your PATH. Afterwards qsv should be installed and you may launch a new terminal to use qsv.

For macOS, "ad-hoc" signatures are used to sign our binaries, so you will need to set appropriate Gatekeeper security settings or run the following command to remove the quarantine attribute from qsv before you run it for the first time:

# replace qsv with qsvmcp, qsvlite, qsvdp, qsvpy* if you installed those binary variants
sudo xattr -d com.apple.quarantine qsv

An additional benefit of using the prebuilt binaries is that they have the self_update feature enabled, allowing you to quickly update qsv to the latest version with a simple qsv --update. For further security, the self_update feature only fetches releases from this GitHub repo and automatically verifies the signature of the downloaded zip archive before installing the update.

> [!NOTE] > The luau feature is not available in musl prebuilt binaries[^3].

Manually verifying the Integrity of the Prebuilt Binaries Zip Archives

All prebuilt binaries zip archives are signed with zipsign with the following public key qsv-zipsign-public.key. To verify the integrity of the downloaded zip archives:

# if you don't have zipsign installed yet
cargo install zipsign

# verify the integrity of the downloaded prebuilt binary zip archive
# after downloading the zip archive and the qsv-zipsign-public.key file.
# replace <PREBUILT-BINARY-ARCHIVE.zip> with the name of the downloaded zip archive
# e.g. zipsign verify zip qsv-0.118.0-aarch64-apple-darwin.zip qsv-zipsign-public.key
zipsign verify zip <PREBUILT-BINARY-ARCHIVE.zip> qsv-zipsign-public.key

Option 2: Package Managers & Distributions

qsv is also distributed by several package managers and distros.

![Packaging status](https://repology.org/project/qsv/versions)

Here are the relevant commands for installing qsv using the various package managers and distros:

# Arch Linux Extra Repository (https://archlinux.org/packages/extra/x86_64/qsv/)
pacman -S qsv

# Homebrew on macOS/Linux (https://formulae.brew.sh/formula/qsv#default)
brew install qsv

# MacPorts on macOS (https://ports.macports.org/port/qsv/)
sudo port install qsv

# Mise on Linux/macOS/Windows (https://mise.jdx.dev)
mise use -g qsv@latest

# Nixpkgs on Linux/macOS (https://search.nixos.org/packages?channel=unstable&show=qsv&from=0&size=50&sort=relevance&type=packages&query=qsv)
nix-shell -p qsv

# Scoop on Windows (https://scoop.sh/#/apps?q=qsv)
scoop install qsv

# Void Linux (https://voidlinux.org/packages/?arch=x86_64&q=qsv)
sudo xbps-install qsv

# Conda-forge (https://anaconda.org/conda-forge/qsv)
conda install conda-forge::qsv

Note that qsv provided by these package managers/distros enable different features (Homebrew, for instance, enables the apply, fetch, foreach, geocode, lens, luau and to features. However, it does automatically install shell completion for bash, fish and zsh shells).

To find out what features are enabled in a package/distro's qsv, run qsv --version (more info).

In the true spirit of open source, these packages are maintained by volunteers who wanted to make qsv easier to install in various environments. They are much appreciated, and we loosely collaborate with the package maintainers through GitHub, but know that these packages are maintained by third-parties.

Debian package

datHere also maintains a Debian package targeting the latest Ubuntu LTS on x86_64 architecture to make it easier to install qsv with DataPusher+.

To install qsv on Ubuntu/Debian:

wget -O - https://dathere.github.io/qsv-deb-releases/qsv-deb.gpg | sudo gpg --dearmor -o /usr/share/keyrings/qsv-deb.gpg
echo "deb [signed-by=/usr/share/keyrings/qsv-deb.gpg] https://dathere.github.io/qsv-deb-releases ./" | sudo tee /etc/apt/sources.list.d/qsv.list
sudo apt update
sudo apt install qsv

Option 3: Compile from Source

If you have Rust installed, you can compile from source[^2]:

git clone https://github.com/dathere/qsv.git
cd qsv
cargo build --release --locked --bin qsv --features all_features

The compiled binary will end up in ./target/release/.

To compile different variants and enable optional features:

# to compile qsv with all features enabled
cargo build --release --locked --bin qsv --features feature_capable,apply,fetch,foreach,geocode,luau,mcp,magika,polars,python,self_update,to,ui
# shorthand
cargo build --release --locked --bin qsv -F all_features
# enable all CPU optimizations for the current CPU (warning: creates non-portable binary)
CARGO_BUILD_RUSTFLAGS='-C target-cpu=native' cargo build --release --locked --bin qsv -F all_features

# or build qsv with only the fetch and foreach features enabled
cargo build --release --locked --bin qsv -F feature_capable,fetch,foreach

# for qsvmcp - MCP server optimized variant
cargo build --release --locked --bin qsvmcp -F qsvmcp

# for qsvlite
cargo build --release --locked --bin qsvlite -F lite

# for qsvdp
cargo build --release --locked --bin qsvdp -F datapusher_plus

[^2]: Of course, you'll also need a linker & a C compiler. Linux users should generally install GCC or Clang, according to their distribution’s documentation. For example, if you use Ubuntu, you can install the build-essential package. On macOS, you can get a C compiler by running $ xcode-select --install. For Windows, this means installing Visual Studio 2022. When prompted for workloads, include "Desktop Development with C++", the Windows 10 or 11 SDK & the English language pack, along with any other language packs your require.

> [!NOTE] > To build with Rust nightly, see Nightly Release Builds. The feature_capable, qsvmcp, lite and datapusher_plus are MUTUALLY EXCLUSIVE features. See Special Build Features for more info.

Variants

There are five binary variants of qsv:

qsv - feature-capable(✨), with the prebuilt binaries enabling all applicable features except Python [^3]
qsvpy - same as qsv but with the Python feature enabled. Three subvariants are available - qsvpy311, qsvpy312 & qsvpy313 - which are compiled with the latest patch version of Python 3.11, 3.12 & 3.13 respectively. We need to have a binary for each Python version as Python is dynamically linked (more info).
qsvmcp - optimized for MCP (Model Context Protocol) server use with geocode, get, get_cloud, mcp, polars, profile, self_update, synthesize, to, and viz_static features enabled. Shares src/main.rs with qsv.
qsvlite - all features disabled (~16% of the size of qsv). If you are migrating from xsv and want the same experience and feature set, this is the variant for you.
qsvdp - optimized for use with DataPusher+ with only DataPusher+ relevant commands; an embedded luau interpreter; applydp, a slimmed-down version of the apply feature; the --progressbar option disabled; and the self-update only checking for new releases, requiring an explicit --update (~16% of the size of qsv).

> [!NOTE] > There are "portable" subvariants of qsv available with the "p" suffix - qsvp, qsvplite and qsvpdp. These subvariants are compiled without any CPU features enabled. Use these subvariants if you have an old CPU architecture or getting "Illegal instruction (SIGILL)" errors when running the regular qsv binaries.

[^3]: The luaufeature is NOT enabled by default on the prebuilt binaries for musl platforms. This is because we cross-compile using GitHub Action Runners using Ubuntu 20.04 LTS with the musl libc toolchain. However, Ubuntu is a glibc-based, not a musl-based distro. We get around this by cross-compiling. Unfortunately, this prevents us from cross-compiling binaries with the luau feature enabled as doing so requires statically linking the host OS libc library. If you need the luau feature on musl, you will need to compile from source on your own musl-based Linux Distro (e.g. Alpine, Void, etc.).

Shell Completion

qsv has extensive, extendable shell completion support. It currently supports the following shells: bash, zsh, powershell, fish, nushell, fig & elvish. You may download a shell completions script for your shell by clicking one of the badges below:

To customize shell completions, see the Shell Completion documentation. If you're using Bash, you can also follow the step-by-step tutorial at 100.dathere.com to learn how to enable the Bash shell completions.

Regular Expression Syntax

The --select option and several commands (apply, applydp, datefmt, exclude, fetchpost, replace, schema, search, searchset, select, sqlp, stats & validate) allow the user to specify regular expressions. We use the regex crate to parse, compile and execute these expressions. [^4]

[^4]: This is the same regex engine used by ripgrep - the blazingly fast grep replacement that powers Visual Studio's magical "Find in Files" feature.

Its syntax can be found here and "is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case O(m n) time complexity, where m is proportional to the size of the regex and n is proportional to the size of the string being searched."*

If you want to test your regular expressions, regex101 supports the syntax used by the regex crate. Just select the "Rust" flavor.

> [!CAUTION] > JSON SCHEMA VALIDATION REGEX: The schema command, when inferring a JSON Schema Validation file, will derive a regex expression for the selected columns when the --pattern-columns option is used. Though the derived regex is guaranteed to work, it may not be the most efficient. Before using the generated JSON Schema file in production with the validate command, it is recommended that users inspect and optimize the derived regex as required. While doing so, note that the validate command in JSON Schema Validation mode, can also support "fancy" regex expressions with look-around and backreferences using the --fancy-regex option.

File formats

qsv recognizes UTF-8/ASCII encoded, CSV (.csv), SSV (.ssv) and TSV files (.tsv & .tab). CSV files are assumed to have "," (comma) as a delimiter, SSV files have ";" (semicolon) as a delimiter and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter command-line option or with the QSV_DEFAULT_DELIMITER environment variable or automatically detected when QSV_SNIFF_DELIMITER is set.

When using the --output option, qsv will UTF-8 encode the file & automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, semicolon for .ssv, tab for .tsv & .tab files.

JSON files are recognized & converted to CSV with the json command. JSONL/NDJSON files are also recognized & converted to/from CSV with the jsonl and tojsonl commands respectively.

The fetch & fetchpost commands also produces JSONL files when its invoked without the --new-column option & TSV files with the --report option.

The excel, safenames, sniff, sortcheck & validate commands produce JSON files with their JSON options following the JSON API 1.1 specification, so it can return detailed machine-friendly metadata that can be used by other systems.

The schema command produces a JSON Schema Validation (Draft 2020-12) file with the ".schema.json" file extension, which can be used with the validate command to validate other CSV files with an identical schema.

The describegpt and frequency commands also both produce TOON files. TOON is a compact, human-readable encoding of the JSON data model for LLM prompts.

The excel command recognizes Excel & Open Document Spreadsheet(ODS) files (.xls, .xlsx, .xlsm, .xlsb & .ods files).

Speaking of Excel, if you're having trouble opening qsv-generated CSV files in Excel, set the QSV_OUTPUT_BOM environment variable to add a Byte Order Mark to the beginning of the generated CSV file. This is a workaround for Excel's UTF-8 encoding detection bug.

The to command converts CSVs to Parquet, Excel .xlsx, LibreOffice/OpenOffice Calc .ods & Data Package formats, and populates PostgreSQL and SQLite databases.

The sqlp command returns query results in CSV, JSON, JSONL, Parquet, Apache Arrow IPC & Apache AVRO formats. Polars SQL also supports reading external files directly in various formats with its read_csv, read_ndjson, read_parquet & read_ipc table functions.

The sniff command can also detect the mime type of any file with the --no-infer or --just-mime options, may it be local or remote (http and https schemes supported). It can detect more than 130 file formats, including MS Office/Open Document files, JSON, XML, PDF, PNG, JPEG and specialized geospatial formats like GPX, GML, KML, TML, TMX, TSX, TTML. Click here for a complete list.

> [!TIP] > When the polars feature is enabled, qsv can also natively read .parquet, .ipc, .arrow, .json & .jsonl files.

Extended Input Support

The cat, headers, sqlp, to & validate commands have extended input support (🗄️). If the input is - or empty, the command will try to use stdin as input. If it's not, it will check if its a directory, and if so, add all the files in the directory as input files.

If its a file, it will first check if it has an .infile-list extension. If it does, it will load the text file and parse each line as an input file path. This is a much faster and convenient way to process a large number of input files, without having to pass them all as separate command-line arguments. Further, the file paths can be anywhere in the file system, even on separate volumes. If an input file path is not fully qualified, it will be treated as relative to the current working directory. Empty lines and lines starting with # are ignored. Invalid file paths will be logged as warnings and skipped.

For both directory and .infile-list input, snappy compressed files with a .sz or .zip extension will be automatically decompressed.

Finally, if its just a regular file, it will be treated as a regular input file.

Limited Extended Input Support

The describegpt, lens, slice & tojsonl commands have limited extended input support (🗃️). They are different in that they only process one file. If provided an .infile-list or a compressed .sz or .zip file, they will only process the first file.

> Note on .zip inputs. qsv treats a .zip archive as a container of delimited-text files (CSV/TSV/TAB/SSV). The first such entry (in archive order) is used; commands with full Extended Input Support use _all_ of them. Directory and system entries (__MACOSX, .DS_Store, …) are skipped, and path-traversal ("zip-slip") entries are rejected. Nesting a special binary format (Parquet, Avro, or Arrow) inside a .zip is not a recommended workflow — those formats are already compressed, so zipping them serves no purpose — and it is not handled uniformly: most commands select only the first CSV/TSV/TAB/SSV entry (and error if a .zip contains none), while the commands with full Extended Input Support (cat, headers, sqlp, to & validate) _do_ extract non-tabular supported entries. For predictable results, provide .parquet/.avro/.arrow files _directly_ instead (qsv reads them natively). See #3988.

Automatic Compression/Decompression

qsv supports _automatic compression/decompression_ using the Snappy frame format. Snappy was chosen instead of more popular compression formats like gzip because it was designed for high-performance streaming compression & decompression (up to 2.58 gb/sec compression, 0.89 gb/sec decompression).

For all commands except the index, extdedup & extsort commands, if the input file has an ".sz" extension, qsv will _automatically_ do streaming decompression as it reads it. Further, if the input file has an extended CSV/TSV ".sz" extension (e.g nyc311.csv.sz/nyc311.tsv.sz/nyc311.tab.sz), qsv will also use the file extension to determine the delimiter to use.

Similarly, if the --output file has an ".sz" extension, qsv will _automatically_ do streaming compression as it writes it. If the output file has an extended CSV/TSV ".sz" extension, qsv will also use the file extension to determine the delimiter to use.

Note however that compressed files cannot be indexed, so index-accelerated commands (frequency, schema, split, stats, tojsonl) will not be multithreaded. Random access is also disabled without an index, so slice will not be instantaneous and luau's random-access mode will not be available.

There is also a dedicated snappy command with four subcommands for direct snappy file operations — a multithreaded compress subcommand (4-5x faster than the built-in, single-threaded auto-compression); a decompress subcommand with detailed compression metadata; a check subcommand to quickly inspect if a file has a Snappy header; and a validate subcommand to confirm if a Snappy file is valid.

The snappy command

Summary

Install to Claude Code