3.1. Index and search
Haystack
The Haystack software enables the Django framework to run third party search engines such as Elasticsearch and Solr. Even though you can use any search engine supported by Haystack, machado was tested using Elasticsearch.
3.1.1. Install Elasticsearch
Elasticsearch 7.x is required. Install Java first:
sudo apt install openjdk-11-jdk
Then install Elasticsearch:
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.26-amd64.deb
sudo dpkg -i elasticsearch-7.17.26-amd64.deb
sudo systemctl daemon-reload
sudo systemctl enable elasticsearch.service
sudo systemctl start elasticsearch.service
Install the Python client inside your virtualenv:
pip install 'elasticsearch>=7,<8'
3.1.2. Enable search in machado
Uncomment the Elasticsearch settings in your .env file:
ELASTICSEARCH_URL=http://127.0.0.1:9200/
HAYSTACK_INDEX_NAME=haystack
When ELASTICSEARCH_URL is set, machado automatically adds haystack to
INSTALLED_APPS and configures HAYSTACK_CONNECTIONS — no manual editing
of settings.py is needed.
You can also configure which feature types are indexed:
MACHADO_VALID_TYPES=gene,mRNA,polypeptide
If MACHADO_VALID_TYPES is not set, the default is gene,mRNA,polypeptide.
3.1.3. Indexing the data
After loading data into the database, build the search index:
python manage.py rebuild_index
Note
It is necessary to run rebuild_index whenever additional data is
loaded into the database or when search-related settings change.
Rebuilding the index can be faster if you increase the number of workers:
python manage.py rebuild_index -k 4
3.1.4. Increasing the results limit
The Elasticsearch server has a 10,000 results limit by default. In most cases it will not affect the results since they are paginated. The links to export .tsv or .fasta files might be truncated because of this limit. You can increase it with:
curl -XPUT "http://localhost:9200/haystack/_settings" \
-d '{ "index" : { "max_result_window" : 500000 } }' \
-H "Content-Type: application/json"