Deploying the SRA metagenomes index¶
The deployment for the demo database is useful when a
new search index is built from zero,
but in case you want to use an existing branchwater database there is
a more direct way that skips the wort signature download and indexing into a branchwater database.
To demonstrate the process we will use the SRA metagenomes index from the main instance,
althought at a larger scaled value.
The s=1000 index from the main instance is ~1.3 TiB as of 2024-11-28,
but we provide downsampled versions to s=100,000 for k={21,31,51},
containing the same 1,161,119 SRA metagenomes,
and they are easier to download and run locally:
20241128-k21-s100000(14.1 GiB)20241128-k31-s100000(32.2 GiB)20241128-k51-s100000(35.8 GiB)
Note
These are the branchwater search index, they don’t contain the signatures used to build the index. This is mostly because
we only need the index to run the branchwater service
adding the signatures would increase the download size significantly (it’s ~5TiB of data for a specific k-size, ~15TiB for all three).
But the signatures are available and can be individually downloaded from wort
For this example we will use k=21,s=100,000 to bring up a new local instance of branchwater.
Clone the repo¶
git clone https://github.com/sourmash-bio/branchwater
cd branchwater
Set up dependencies¶
In case you haven’t done the deployment for the demo database tutorial,
follow the setup instructions to download pixi and a docker runtime.
Create a new directory to hold the index and metadata¶
Let’s create a new directory to hold the data for this service:
mkdir -p bw_k21
By the point we are ready to start the service it will look like this:
bw_k21
├── bqKey.json # Only if building metadata yourself using BigQuery
├── index/ # branchwater index
├── metadata.duckdb # metadata for accesions, either prepared or built from index. Loaded from parquet.
├── metadata.parquet # metadata for accesions, either prepared or built from index
└── sraids # Only if building metadata yourself
Edit docker-compose.yml¶
In the volumes section for the index and app service,
replace bw_db with bw_k21:
volumes:
- - ./bw_db:/data/
+ - ./bw_k21:/data/
And in the index service, add the following lines to pass new parameters to the command that initializes the service:
command: >
/app/bin/branchwater-server
--port 80
-k 21
--scaled 100000
--location /dev/null
/data/index
Note
Since we don’t have the signatures,
we point --location to /dev/null.
Weird, but it works =]
Download the k=21 index¶
Let’s start by downloading the index. Here is a wget invocation to save it into index/:
pixi exec wget \
-c --recursive \
--no-parent -nH \
--cut-dirs=3 --reject "index.html*" \
-P bw_k21/index/ \
https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-k21-s100000/
Prepared metadata for index (or build your own!)¶
The metadata for these indices is already available at https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-metadata.parquet so you don’t need to build it locally. You can download it and put in the correct place with
pixi exec wget -c -O bw_k21/metadata.parquet \
https://farm.cse.ucdavis.edu/~irber/branchwater/20241128-metadata.parquet
Extract accessions from index
sraids is a list of all the SRA accessions used to build the index,
and we need it to retrieve the SRA metadata that is presented in the frontend.
This information is contained in the manifest used to build the index,
so we can extract it from the index and manifest by running
pixi run cargo run --release -p branchwater-index metadata bw_k21/index --acc-only -o bw_k21/sraids
BigQuery credentials
If you did the demo deployment you can copy and reuse it:
cp bw_db/bqKey.json bw_k21/bqKey.json
Otherwise, follow these instructions to create the BigQuery credential file.
Build metadata
The final file we need is metadata.parquet,
and we have all the pieces in place to generate it.
Run
pixi run metadata_bq bw_k21
to create it.
Load metadata into duckdb¶
We can load metadata.parquet into duckdb by running
pixi run load_duckdb bw_k21
This will be printed after the command finishes:
1,160,375 accessions imported to duckdb
Full duckdb size is 350.7 MiB, average document size is 316 bytes
Note
Did you notice that 1,160,375 != 1,161,119? What is up with the missing metadata?
There is a longer discussion in https://github.com/sourmash-bio/branchwater/issues/24#issuecomment-2067814713 but it is mostly because there are metadata changes and datasets that were in previous versions of branchwater might not have updated metadata anymore (due to retractions), and we are downloading the most up-to-date metadata to serve. We avoid removing the datasets from the search index to keep maintenance easier, but they won’t show up in the frontend without the metadata.
It is left as an exercise to the reader to figure out how to retrieve the ghost matches =]
Bring up the search index and frontend¶
Finally, we can bring up the index and frontend app:
pixi run deploy up -d
The website is now available at http://localhost:8000