杰瑞科技汇

Python如何用Thrift连接Hbase?

Of course! Using Python with HBase via the Thrift API is a common pattern, especially in environments where you can't use the native Java HBase client (like many data science or web service stacks).

Python如何用Thrift连接Hbase?-图1
(图片来源网络,侵删)

Here's a comprehensive guide covering everything from the setup to a complete, runnable Python example.

Overview: How it Works

  1. HBase Server: You have a running HBase cluster.
  2. HBase Thrift Server: This is a separate process that runs on your HBase cluster (or a gateway machine). It acts as a proxy or gateway, translating Thrift RPC calls into native HBase Java API calls.
  3. Python Client: Your Python application uses a Thrift library (like thrift) to make network calls to the HBase Thrift Server.
  4. Thrift IDL (Interface Definition Language): This file (Hbase.thrift) defines all the possible operations (like get, put, scan, etc.). The Thrift compiler uses this file to generate Python stub code that makes the network communication easy.

Step 1: Prerequisites

  1. A Running HBase Cluster: You need an active HBase installation. This can be a standalone setup for development or a full distributed cluster for production.
  2. Java Development Kit (JDK): The Thrift server runs on the JVM, so you need Java installed.
  3. Thift: You need the Thrift compiler installed on your machine to generate the Python bindings if you're doing a custom build. However, for HBase, you usually get pre-compiled Thrift libraries.

Step 2: Setting up the HBase Thrift Server

This is the most critical step. If the server isn't running, your Python client will fail to connect.

  1. Download HBase: If you haven't already, download the latest stable HBase release from the Apache HBase website.

  2. Start the Thrift Server: Navigate to your HBase directory and run the thrift server script.

    Python如何用Thrift连接Hbase?-图2
    (图片来源网络,侵删)
    # Go to your HBase installation directory
    cd /path/to/hbase
    # Start the Thrift server
    # The 'nonblocking' option is generally recommended for better performance
    bin/hbase-daemon.sh start thrift --infoport 9095 --nonblocking
    • --infoport 9095: Sets an optional port for JMX monitoring.
    • --nonblocking: Runs the server in a non-blocking mode, which is more efficient under high load.
  3. Verify the Server is Running:

    • Check the HBase logs for any startup errors.
    • Use jps on the server machine to see the Java processes. You should see a HBaseThrift process.
    jps
    # Output should include:
    # 1234 HRegionServer
    # 5678 HMaster
    # 9012 HBaseThrift
  4. Check the Port: The default Thrift port is 9090. You can check if it's listening with netstat or ss.

    # On Linux
    sudo netstat -tuln | grep 9090
    # Or
    ss -tuln | grep 9090

Step 3: Setting up the Python Environment

You need to install the Python Thrift library. The thrift-sasl library is highly recommended as it simplifies handling SASL authentication, which is common in secure Hadoop clusters.

pip install thrift
pip install thrift-sasl

Step 4: The Python Client Code

Now for the fun part! Here is a complete, commented Python script that connects to the HBase Thrift server and performs basic CRUD (Create, Read, Update, Delete) operations.

Python如何用Thrift连接Hbase?-图3
(图片来源网络,侵删)

First, make sure you have a table named python_test in HBase with a column family named cf. You can create it with the HBase shell:

# In the HBase shell
create 'python_test', 'cf'

Now, save the following code as hbase_thrift_client.py:

import sys
import struct
from thrift.transport import TSocket, TTransport
from thrift.protocol import TBinaryProtocol
from thrift_sasl import TSaslClientTransport
# The generated Python code from the Hbase.thrift file
# You usually find this in the HBase lib directory.
# If you don't have it, you can generate it with the thrift compiler.
# For simplicity, we'll assume it's in the same directory or PYTHONPATH.
from hbase import THbase
from hbase.ttypes import TTableName, TColumnDescriptor, TColumn, TGet, TPut, TScan
# --- Configuration ---
# Host and port of your HBase Thrift server
HOST = 'localhost'
PORT = 9090
# Name of the table and column family you want to use
TABLE_NAME = 'python_test'
COLUMN_FAMILY = 'cf'
def main():
    """
    Main function to demonstrate HBase Thrift operations from Python.
    """
    # 1. Setup the Thrift transport and protocol
    # The SASL transport is recommended for secure clusters.
    # For non-secure clusters, you can use:
    # transport = TSocket.TSocket(HOST, PORT)
    # transport = TTransport.TBufferedTransport(transport)
    transport = TSaslClientTransport(
        lambda: TSocket.TSocket(HOST, PORT),
        mechanism='PLAIN',
        server_name='hbase' # This should match the service principal in a Kerberos setup
    )
    protocol = TBinaryProtocol.TBinaryProtocol(transport)
    client = THase.Client(protocol)
    # 2. Open the transport
    try:
        transport.open()
        print("Successfully connected to HBase Thrift server.")
    except Exception as e:
        print(f"Error connecting to HBase: {e}")
        sys.exit(1)
    # --- Perform Operations ---
    # 3. Put (Insert/Update) data
    print("\n--- Performing PUT operations ---")
    row_key1 = 'row1'
    # Create a TPut object
    put1 = TPut(row=row_key1, columns={})
    # Add a column value
    put1.columns[f'{COLUMN_FAMILY}:data1'] = b'value1_for_row1'
    put1.columns[f'{COLUMN_FAMILY}:data2'] = b'value2_for_row1'
    client.put(TABLE_NAME, put1)
    print(f"Put data for row '{row_key1}'")
    row_key2 = 'row2'
    put2 = TPut(row=row_key2, columns={})
    put2.columns[f'{COLUMN_FAMILY}:data1'] = b'value1_for_row2'
    client.put(TABLE_NAME, put2)
    print(f"Put data for row '{row_key2}'")
    # 4. Get (Read) a single row
    print("\n--- Performing GET operation ---")
    get = TGet(row=row_key1, columns=[f'{COLUMN_FAMILY}:data1'])
    result = client.get(TABLE_NAME, get)
    if result:
        print(f"Got row '{row_key1}':")
        for column, value in result.columns.items():
            # The column name is a TCell object
            print(f"  Column: {column}, Value: {value.value.decode('utf-8')}")
    else:
        print(f"Row '{row_key1}' not found.")
    # 5. Scan (Read multiple rows)
    print("\n--- Performing SCAN operation ---")
    # Create a TScan object
    scan = TScan(
        startRow=row_key1,  # Optional: start scanning from this row
        stopRow=row_key2,   # Optional: stop scanning before this row
        columns=[f'{COLUMN_FAMILY}:data1'] # Optional: specify which columns to fetch
    )
    scanner = client.scannerOpenWithScan(TABLE_NAME, scan)
    results = []
    try:
        while True:
            # scannerGetNext() returns a list of TResult objects
            batch = client.scannerGetNext(scanner)
            if not batch:
                break
            results.extend(batch)
    except Exception as e:
        print(f"Error during scan: {e}")
    finally:
        # It's important to close the scanner!
        client.scannerClose(scanner)
    print(f"Scan found {len(results)} rows:")
    for result in results:
        print(f"  Row: {result.row}")
        for column, cell in result.columns.items():
            print(f"    {column}: {cell.value.decode('utf-8')}")
    # 6. Delete data
    print("\n--- Performing DELETE operation ---")
    # Delete a specific column
    client.deleteAll(TABLE_NAME, row_key1, f'{COLUMN_FAMILY}:data2')
    print(f"Deleted column 'cf:data2' from row '{row_key1}'")
    # Verify the delete
    get_after_delete = TGet(row=row_key1)
    result_after_delete = client.get(TABLE_NAME, get_after_delete)
    print(f"Row '{row_key1}' after delete:")
    if result_after_delete.columns:
        for column, value in result_after_delete.columns.items():
            print(f"  Column: {column}, Value: {value.value.decode('utf-8')}")
    else:
        print("  No columns found.")
    # 7. Close the transport
    transport.close()
    print("\nConnection closed.")
if __name__ == '__main__':
    main()

How to Run the Code

  1. Make sure your HBase Thrift server is running (localhost:9090 by default).

  2. Make sure you have the python_test table created.

  3. Save the code above as hbase_thrift_client.py.

  4. Crucially, you need the generated Python stubs. These are usually located in your HBase installation at hbase-thrift/target/generated-sources/thrift/. You'll need to copy the hbase directory (which contains THase.py, ttypes.py, etc.) into the same directory as your script or add it to your PYTHONPATH.

  5. Run the script from your terminal:

    python hbase_thrift_client.py

You should see output showing the connection, the operations being performed, and the data being read.

Troubleshooting Common Issues

  • Thrift.TException: TTransportException: Could not connect to localhost:9090

    • Cause: The HBase Thrift server is not running, or it's running on a different host/port.
    • Solution: Check if the server is running with jps and netstat. Verify the HOST and PORT variables in your Python script.
  • ImportError: No module named hbase

    • Cause: The Python generated stubs (THase.py, ttypes.py, etc.) are not in your Python's search path.
    • Solution: Make sure you've copied the hbase directory from the HBase thrift target directory into your project or added its location to PYTHONPATH.
  • TTransportException: TSaslTransport: ... No authentication mechanism specified

    • Cause: You are trying to connect to a secure (SASL/Kerberos enabled) HBase cluster from a non-secure client.
    • Solution: You must use the TSaslClientTransport as shown in the example. Ensure you have the correct server_name and that your environment is set up for Kerberos if required.
  • Performance is slow.

    • Cause: Thrift has some overhead. Making many individual get or put calls in a loop is very inefficient.
    • Solution: Use batch operations whenever possible. For bulk inserts, use client.putMultiple(). For reading, use client.scannerOpenWithScan() and iterate through the results instead of making individual get calls for each row.
分享:
扫描分享到社交APP
上一篇
下一篇