# Deploying the SRA metagenomes index The [deployment for the demo database](deploy.md) is useful when a new search index is built from zero, but in case you want to use an existing branchwater database there is a more direct way that skips the `wort` signature download and indexing into a branchwater database. To demonstrate the process we will use the SRA metagenomes index from the [main instance](https://branchwater.sourmash.bio), althought at a larger scaled value. The `s=1000` index from the main instance is ~1.3 TiB as of 2024-11-28, but we provide downsampled versions to `s=100,000` for `k={21,31,51}`, containing the same `1,161,119` SRA metagenomes, and they are easier to download and run locally: - [`20241128-k21-s100000`](https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-k21-s100000/) (14.1 GiB) - [`20241128-k31-s100000`](https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-k31-s100000/) (32.2 GiB) - [`20241128-k51-s100000`](https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-k51-s100000/) (35.8 GiB) :::{note} These are the branchwater search index, they don't contain the signatures used to build the index. This is mostly because - we only need the index to run the branchwater service - adding the signatures would increase the download size significantly (it's ~5TiB of data for a specific _k_-size, ~15TiB for all three). But the signatures are available and can be individually downloaded from [wort](https://wort.sourmash.bio) ::: For this example we will use `k=21,s=100,000` to bring up a new local instance of branchwater. ## Clone the repo ```bash git clone https://github.com/sourmash-bio/branchwater cd branchwater ``` ## Set up dependencies In case you haven't done the [deployment for the demo database](deploy.md) tutorial, follow the [setup instructions](deploy.md#set-up-dependencies) to download `pixi` and a docker runtime. ## Create a new directory to hold the index and metadata Let's create a new directory to hold the data for this service: ```bash mkdir -p bw_k21 ``` By the point we are ready to start the service it will look like this: ``` bw_k21 ├── bqKey.json # Only if building metadata yourself using BigQuery ├── index/ # branchwater index ├── metadata.duckdb # metadata for accesions, either prepared or built from index. Loaded from parquet. ├── metadata.parquet # metadata for accesions, either prepared or built from index └── sraids # Only if building metadata yourself ``` ## Edit `docker-compose.yml` In the `volumes` section for the `index` and `app` service, replace `bw_db` with `bw_k21`: ```diff volumes: - - ./bw_db:/data/ + - ./bw_k21:/data/ ``` And in the `index` service, add the following lines to pass new parameters to the command that initializes the service: ```yaml command: > /app/bin/branchwater-server --port 80 -k 21 --scaled 100000 --location /dev/null /data/index ``` :::{note} Since we don't have the signatures, we point `--location` to `/dev/null`. Weird, but it works =] ::: ## Download the `k=21` index Let's start by downloading the index. Here is a `wget` invocation to save it into `index/`: ```bash pixi exec wget \ -c --recursive \ --no-parent -nH \ --cut-dirs=3 --reject "index.html*" \ -P bw_k21/index/ \ https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-k21-s100000/ ``` ## Prepared metadata for index (or build your own!) ::::{tab-set} :::{tab-item} Use prepared metadata The metadata for these indices is already available at so you don't need to build it locally. You can download it and put in the correct place with ```bash pixi exec wget -c -O bw_k21/metadata.parquet \ https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-metadata.parquet ``` ::: :::{tab-item} Build your own metadata ### Extract accessions from index `sraids` is a list of all the SRA accessions used to build the index, and we need it to retrieve the SRA metadata that is presented in the frontend. This information is contained in the manifest used to build the index, so we can extract it from the index and manifest by running ```bash pixi run cargo run --release -p branchwater-index metadata bw_k21/index --acc-only -o bw_k21/sraids ``` ### BigQuery credentials If you did the demo deployment you can copy and reuse it: ``` cp bw_db/bqKey.json bw_k21/bqKey.json ``` Otherwise, follow [these instructions](deploy.md#prepare-a-bigquery-access-key) to create the BigQuery credential file. ### Build metadata The final file we need is `metadata.parquet`, and we have all the pieces in place to generate it. Run ```bash pixi run metadata_bq bw_k21 ``` to create it. ::: :::: ## Load metadata into duckdb We can load `metadata.parquet` into duckdb by running ```bash pixi run load_duckdb bw_k21 ``` This will be printed after the command finishes: ```bash 1,160,375 accessions imported to duckdb Full duckdb size is 350.7 MiB, average document size is 316 bytes ``` :::{note} Did you notice that 1,160,375 != 1,161,119? What is up with the missing metadata? There is a longer discussion in but it is mostly because there are metadata changes and datasets that were in previous versions of branchwater might not have updated metadata anymore (due to retractions), and we are downloading the most up-to-date metadata to serve. We avoid removing the datasets from the search index to keep maintenance easier, but they won't show up in the frontend without the metadata. It is left as an exercise to the reader to figure out how to retrieve the ghost matches =] ::: ## Bring up the search index and frontend Finally, we can bring up the index and frontend app: ``` pixi run deploy up -d ``` The website is now available at