Deploying the SRA metagenomes index

The deployment for the demo database is useful when a new search index is built from zero, but in case you want to use an existing branchwater database there is a more direct way that skips the wort signature download and indexing into a branchwater database.

To demonstrate the process we will use the SRA metagenomes index from the main instance, althought at a larger scaled value. The s=1000 index from the main instance is ~1.3 TiB as of 2024-11-28, but we provide downsampled versions to s=100,000 for k={21,31,51}, containing the same 1,161,119 SRA metagenomes, and they are easier to download and run locally:

Note

These are the branchwater search index, they don’t contain the signatures used to build the index. This is mostly because

  • we only need the index to run the branchwater service

  • adding the signatures would increase the download size significantly (it’s ~5TiB of data for a specific k-size, ~15TiB for all three).

But the signatures are available and can be individually downloaded from wort

For this example we will use k=21,s=100,000 to bring up a new local instance of branchwater.

Clone the repo

git clone https://github.com/sourmash-bio/branchwater
cd branchwater

Set up dependencies

In case you haven’t done the deployment for the demo database tutorial, follow the setup instructions to download pixi and a docker runtime.

Create a new directory to hold the index and metadata

Let’s create a new directory to hold the data for this service:

mkdir -p bw_k21

By the point we are ready to start the service it will look like this:

bw_k21
├── bqKey.json        # Only if building metadata yourself using BigQuery
├── index/            # branchwater index
├── metadata.duckdb   # metadata for accesions, either prepared or built from index. Loaded from parquet.
├── metadata.parquet  # metadata for accesions, either prepared or built from index
└── sraids            # Only if building metadata yourself

Edit docker-compose.yml

In the volumes section for the index and app service, replace bw_db with bw_k21:

     volumes:
-      - ./bw_db:/data/
+      - ./bw_k21:/data/

And in the index service, add the following lines to pass new parameters to the command that initializes the service:

    command: >
      /app/bin/branchwater-server
        --port 80
        -k 21
        --scaled 100000
        --location /dev/null
        /data/index

Note

Since we don’t have the signatures, we point --location to /dev/null. Weird, but it works =]

Download the k=21 index

Let’s start by downloading the index. Here is a wget invocation to save it into index/:

pixi exec wget \
    -c --recursive \
    --no-parent -nH \
    --cut-dirs=3 --reject "index.html*" \
    -P bw_k21/index/ \
    https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-k21-s100000/

Prepared metadata for index (or build your own!)

The metadata for these indices is already available at https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-metadata.parquet so you don’t need to build it locally. You can download it and put in the correct place with

pixi exec wget -c -O bw_k21/metadata.parquet \
    https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-metadata.parquet

Extract accessions from index

sraids is a list of all the SRA accessions used to build the index, and we need it to retrieve the SRA metadata that is presented in the frontend. This information is contained in the manifest used to build the index, so we can extract it from the index and manifest by running

pixi run cargo run --release -p branchwater-index metadata bw_k21/index --acc-only -o bw_k21/sraids

BigQuery credentials

If you did the demo deployment you can copy and reuse it:

cp bw_db/bqKey.json bw_k21/bqKey.json

Otherwise, follow these instructions to create the BigQuery credential file.

Build metadata

The final file we need is metadata.parquet, and we have all the pieces in place to generate it. Run

pixi run metadata_bq bw_k21

to create it.

Load metadata into duckdb

We can load metadata.parquet into duckdb by running

pixi run load_duckdb bw_k21

This will be printed after the command finishes:

1,160,375 accessions imported to duckdb
Full duckdb size is 350.7 MiB, average document size is 316 bytes

Note

Did you notice that 1,160,375 != 1,161,119? What is up with the missing metadata?

There is a longer discussion in https://github.com/sourmash-bio/branchwater/issues/24#issuecomment-2067814713 but it is mostly because there are metadata changes and datasets that were in previous versions of branchwater might not have updated metadata anymore (due to retractions), and we are downloading the most up-to-date metadata to serve. We avoid removing the datasets from the search index to keep maintenance easier, but they won’t show up in the frontend without the metadata.

It is left as an exercise to the reader to figure out how to retrieve the ghost matches =]

Bring up the search index and frontend

Finally, we can bring up the index and frontend app:

pixi run deploy up -d

The website is now available at http://localhost:8000