杰瑞科技汇

Elasticsearch Python如何高效连接与查询数据?

Of course! Here is a comprehensive guide to using Elasticsearch with Python, covering everything from setup and connection to indexing, searching, and best practices.

Elasticsearch Python如何高效连接与查询数据?-图1
(图片来源网络,侵删)

We will focus on the official Elasticsearch Python client, which is the recommended and most powerful way to interact with Elasticsearch from Python.


Prerequisites

Before you start, make sure you have:

  1. Python 3.6+ installed on your system.
  2. An Elasticsearch instance running.
    • The easiest way is via Docker:
      docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    • Or, you can download and run it directly from the Elasticsearch website.
  3. Verify your cluster is running by visiting http://localhost:9200 in your browser. You should see a JSON response with cluster information.

Installation

First, you need to install the official Elasticsearch Python client. It's recommended to use a virtual environment.

# Create and activate a virtual environment (optional but good practice)
python -m venv es-env
source es-env/bin/activate  # On Windows: es-env\Scripts\activate
# Install the elasticsearch client library
pip install elasticsearch

Connecting to Elasticsearch

The first step in any Python script is to establish a connection to your Elasticsearch cluster.

Elasticsearch Python如何高效连接与查询数据?-图2
(图片来源网络,侵删)
from elasticsearch import Elasticsearch
# By default, it tries to connect to localhost:9200
es = Elasticsearch()
# You can also specify the host and port explicitly
# es = Elasticsearch(["http://localhost:9200"])
# To verify the connection, you can ping the cluster
if es.ping():
    print("Successfully connected to Elasticsearch!")
else:
    print("Could not connect to Elasticsearch!")
# To see the cluster's information
# print(es.info())

For production, you should use environment variables for configuration (e.g., ELASTICSEARCH_URL).


Indexing Data (Creating Documents)

In Elasticsearch, you store data in indices (similar to databases in SQL). Within an index, you store documents (similar to rows/records).

There are two main ways to index data:

a) Indexing a Single Document

You use the index() method. If the document ID is not provided, Elasticsearch will generate one automatically.

Elasticsearch Python如何高效连接与查询数据?-图3
(图片来源网络,侵删)
# Define the document data
doc = {
    "author": "John Doe",
    "text": "Elasticsearch is a powerful search and analytics engine.",
    "timestamp": "2025-10-27T10:00:00",
    "tags": ["search", "database", "nosql"]
}
# Index the document into the 'articles' index with ID 1
# The 'refresh' parameter makes the document searchable immediately (good for testing)
response = es.index(index="articles", id=1, document=doc, refresh="wait_for")
print(f"Document indexed with ID: {response['_id']}")
print(f"Version: {response['_version']}")

b) Indexing Multiple Documents (Bulk Indexing)

For better performance, it's highly recommended to use the bulk() helper function when indexing many documents.

from elasticsearch.helpers import bulk
# Define a list of documents to index
docs = [
    {
        "_index": "articles",
        "_id": 2,
        "_source": {
            "author": "Jane Smith",
            "text": "Python is a versatile programming language.",
            "timestamp": "2025-10-27T11:00:00",
            "tags": ["python", "programming"]
        }
    },
    {
        "_index": "articles",
        "_id": 3,
        "_source": {
            "author": "John Doe",
            "text": "Data analysis is made easy with Python libraries like Pandas.",
            "timestamp": "2025-10-27T12:00:00",
            "tags": ["python", "data", "analysis"]
        }
    }
]
# Use the bulk helper to index all documents at once
success, failed = bulk(es, docs)
print(f"Successfully indexed {success} documents.")
print(f"Failed to index {len(failed)} documents.")

Searching Data

This is where Elasticsearch shines. You can search using a simple query string or a powerful JSON-based query language (Query DSL).

a) Simple Query String Search

Good for quick, simple searches.

# Search for the term 'python' in all fields
query = {
    "query": {
        "query_string": {
            "query": "python"
        }
    }
}
# Execute the search
response = es.search(index="articles", body=query)
# Print the results
print(f"Found {response['hits']['total']['value']} documents.")
for hit in response['hits']['hits']:
    print(f"ID: {hit['_id']}, Author: {hit['_source']['author']}, Text: {hit['_source']['text']}")

b) Using the Query DSL (More Powerful & Recommended)

This gives you full control over your search. Let's search for documents where the author is "John Doe" AND the text contains "search".

# Define a more complex query
query = {
    "query": {
        "bool": {
            "must": [  # All clauses must match
                { "match": { "author": "John Doe" } },
                { "match": { "text": "search" } }
            ]
        }
    }
}
response = es.search(index="articles", body=query)
print(f"Found {response['hits']['total']['value']} documents matching the query.")
for hit in response['hits']['hits']:
    print(f"Score: {hit['_score']} -> ID: {hit['_id']}, Text: {hit['_source']['text']}")

Common Operations

a) Getting a Document by ID

# Get the document with ID '1'
response = es.get(index="articles", id=1)
if 'found' in response and response['found']:
    doc = response['_source']
    print(f"Found document: {doc}")
else:
    print("Document not found.")

b) Updating a Document

You can update a document entirely or use scripts for partial updates.

# Update the entire document with ID '1'
updated_doc = {
    "author": "John Doe (Updated)",
    "text": "Elasticsearch is a powerful search and analytics engine. It scales well!",
    "timestamp": "2025-10-27T10:00:00",
    "tags": ["search", "database", "nosql", "updated"]
}
es.index(index="articles", id=1, document=updated_doc, refresh="wait_for")
# Partial update using a script (e.g., increment a counter)
# script = {
#     "source": "ctx._source.views += 1",
#     "lang": "painless"
# }
# es.update(index="my_index", id=1, body={"script": script})

c) Deleting a Document

# Delete the document with ID '2'
response = es.delete(index="articles", id=2)
if response['result'] == 'deleted':
    print("Document deleted successfully.")

d) Deleting an Index

Warning: This is a destructive operation and will delete all data in the index.

# Delete the entire 'articles' index
if es.indices.exists(index="articles"):
    es.indices.delete(index="articles")
    print("Index 'articles' deleted.")
else:
    print("Index 'articles' does not exist.")

Working with Mappings (Data Types)

Mappings define the schema of your index, including the data type of each field. It's good practice to define mappings beforehand to ensure correct data handling and enable powerful features like full-text search.

# Define the mapping for the 'articles' index
mapping = {
    "mappings": {
        "properties": {
            "author": {
                "type": "text"  # Full-text search field
            },
            "text": {
                "type": "text",
                "analyzer": "english" # Use the English analyzer for better stemming
            },
            "timestamp": {
                "type": "date",
                "format": "strict_date_optional_time||epoch_millis"
            },
            "tags": {
                "type": "keyword"  # Exact value field, good for filtering and aggregations
            }
        }
    }
}
# Create the index with the mapping
if not es.indices.exists(index="articles"):
    es.indices.create(index="articles", body=mapping)
    print("Index 'articles' created with mapping.")
else:
    print("Index 'articles' already exists.")

Best Practices

  1. Use Bulk Operations: Always use elasticsearch.helpers.bulk for indexing, updating, or deleting large numbers of documents. It's significantly faster than making individual requests.
  2. Manage Connections: For long-running applications (like web servers), create a single Elasticsearch client instance and reuse it. Don't create a new client for every request.
  3. Handle Timeouts: Network issues can cause requests to hang. Use the timeout parameter in your calls (e.g., es.search(..., timeout=30)).
  4. Error Handling: Elasticsearch operations can raise exceptions (e.g., connection errors, NotFoundError for a missing document). Use try...except blocks to handle them gracefully.
  5. Use Async for High-Performance Apps: If you're building a high-performance application (e.g., an API), consider using the elasticsearch-async library for non-blocking, asynchronous operations.
  6. Security: In production, always enable security features (TLS/SSL, authentication). The client library supports this by passing the appropriate URL (https://...) and API key/username credentials.
分享:
扫描分享到社交APP
上一篇
下一篇