oozie python action-杰瑞科技汇

Of course! Using a Python action in an Oozie workflow is a common and powerful way to execute custom scripts. This guide will walk you through everything you need to know, from the basic concepts to a complete, working example.

（图片来源网络，侵删）

Core Concepts

When you use a Python action in Oozie, you are essentially telling Oozie to:

Prepare a Node: Oozie creates a working directory on the machine where the action will run (the YARN Application Master).
Upload Files: It uploads all the files you've specified (your Python script, any .py dependencies, data files, etc.) into that working directory.
Execute the Command: It runs a shell command on that machine. The default command for a python action is python <main-script.py> <args...>.
Capture Output: It captures the stdout and stderr of your script. If the script exits with a non-zero code, Oozie considers the action to have failed.

Key Oozie XML Elements for a Python Action

Here are the main XML tags you'll use inside the <workflow-app>:

<action name="...">: Defines a single step in your workflow.
<python>: The specific action type for running a Python script.
- name: A descriptive name for the action.
- main: (Required) The path to your main Python script. This path is relative to the workflow's app.path.
- python: (Optional) The path to the Python interpreter on the cluster nodes (e.g., /usr/bin/python3). If omitted, it defaults to python.
<file>: (Crucial) Specifies a file that needs to be uploaded to the working directory. You must include your main script and any other .py files it imports.
<arg>: Passes a command-line argument to your Python script.
<capture-output>: Captures the stdout of your script and makes it available to subsequent actions via ${wf:actionData('action_name')}.
<ok to="...">: The name of the workflow node to transition to if the action succeeds (exit code 0).
<error to="...">: The name of the workflow node to transition to if the action fails (exit code non-zero).

Complete Step-by-Step Example

Let's create a simple workflow that runs a Python script to process a file.

Step 1: The Python Script (`process_data.py`)

This script will take an input file and an output directory as arguments, read the file, and write a processed version. It will also print a value to stdout that we'll capture.

（图片来源网络，侵删）

Directory Structure:

my_oozie_project/
├── lib/
│   └── my_helper.py      # A dependency
├── process_data.py       # The main Python script
└── workflow.xml          # The Oozie workflow definition

lib/my_helper.py

# A simple helper module to be imported
def process_line(line):
    """A simple processing function."""
    return line.strip().upper()

process_data.py

#!/usr/bin/env python
import sys
import os
# This import works because Oozie puts 'my_helper.py' in the same directory
from my_helper import process_line
def main():
    # Oozie passes arguments via sys.argv
    # sys.argv[0] is the script name itself
    input_file_path = sys.argv[1]
    output_dir = sys.argv[2]
    print(f"INFO: Starting processing. Input: {input_file_path}, Output: {output_dir}")
    # Ensure the output directory exists
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    output_file_path = os.path.join(output_dir, "processed_data.txt")
    try:
        with open(input_file_path, 'r') as infile, open(output_file_path, 'w') as outfile:
            for line in infile:
                processed_line = process_line(line)
                outfile.write(processed_line + "\n")
        # Print a key-value pair to stdout. This will be captured by <capture-output>
        print("status=SUCCESS")
        print(f"output_file={output_file_path}")
    except Exception as e:
        # Print an error message to stderr
        print(f"ERROR: Failed to process file. {e}", file=sys.stderr)
        sys.exit(1) # Exit with a non-zero code to fail the Oozie action
if __name__ == "__main__":
    main()

Step 2: The Oozie Workflow (`workflow.xml`)

This XML file defines the workflow. It tells Oozie to run our Python script.

（图片来源网络，侵删）

workflow.xml

<workflow-app name="python-action-demo" xmlns="uri:oozie:workflow:0.5">
    <start to="run-python-script"/>
    <!-- This is our Python action -->
    <action name="run-python-script">
        <python>
            <name>Python Data Processor</name>
            <main>process_data.py</main>
            <!-- Optional: specify a python interpreter if not 'python' -->
            <!-- <python>/usr/bin/python3</python> -->
            <!-- Files to be uploaded to the working directory -->
            <file>process_data.py</file>
            <file>lib/my_helper.py</file>
            <!-- Command-line arguments for the script -->
            <arg>${inputData}</arg>         <!-- Will be passed as the first argument -->
            <arg>${outputDir}</arg>         <!-- Will be passed as the second argument -->
            <!-- Capture the output of the script -->
            <capture-output/>
        </python>
        <!-- Define what to do on success or failure -->
        <ok to="end-workflow"/>
        <error to="fail-workflow"/>
    </action>
    <!-- Success end node -->
    <kill name="fail-workflow">
        <message>Python action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end-workflow"/>
    <!-- Failure end node -->
    <kill name="fail-workflow">
        <message>Python action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end-workflow"/>
</workflow-app>

Note: I've included <kill> and <end> nodes for both success and failure to make the example runnable. A typical workflow would have one of each.

Step 3: Prepare the Job Directory on HDFS

Oozie needs to find your workflow definition and all its dependencies (like the Python files) in HDFS.

Create a directory on HDFS for your application:

hdfs dfs -mkdir -p /user/<your-username>/oozie_app/python_demo

Upload the files:

# Upload the workflow definition
hdfs dfs -put workflow.xml /user/<your-username>/oozie_app/python_demo/
# Create a 'lib' directory in HDFS and upload the Python files
hdfs dfs -mkdir -p /user/<your-username>/oozie_app/python_demo/lib
hdfs dfs -put process_data.py /user/<your-username>/oozie_app/python_demo/
hdfs dfs -put lib/my_helper.py /user/<your-username>/oozie_app/python_demo/lib/

Step 4: Prepare Input Data

Let's create a sample input file and upload it.

Create a local input file:

echo -e "hello world\nfoo bar\nbaz qux" > my_input.txt

Upload it to HDFS:

hdfs dfs -mkdir -p /user/<your-username>/oozie_data
hdfs dfs -put my_input.txt /user/<your-username>/oozie_data/

Step 5: Submit the Oozie Job

You'll need a properties file to define the variables and the queue.

job.properties

# Name of your Oozie job
oozie.wf.application.path=hdfs:///user/<your-username>/oozie_app/python_demo
# Input data file path
inputData=/user/<your-username>/oozie_data/my_input.txt
# Output directory path (will be created by the script)
outputDir=/user/<your-username>/oozie_data/output
# Oozie queue
oozie.use.system.libpath=true
oozie.job.queue.name=default

Submit the job:

oozie job -oozie https://<your-oozie-server>:11000/oozie -config job.properties -run

Step 6: Check the Results

Check the Oozie job status:
```
oozie job -info <job-id>
```

Check the output directory in HDFS:

hdfs dfs -cat /user/<your-username>/oozie_data/output/processed_data.txt

You should see:

HELLO WORLD
FOO BAR
BAZ QUX

Check the Oozie action logs for the captured output: You can see the captured output in the Oozie web console or by using the Oozie CLI. The status=SUCCESS and output_file=... lines printed by the script will be available for other actions to consume if needed.

Advanced Topics & Best Practices

Using Virtual Environments (`venv`)

If your script has complex dependencies (e.g., pandas, numpy), you should use a Python virtual environment.

Locally:
- Create a virtual environment: python3 -m venv my_venv
- Activate it: source my_venv/bin/activate
- Install your packages: pip install pandas
- Zip the entire my_venv directory: zip -r venv.zip my_venv/

In Oozie:

Upload venv.zip to HDFS.
Modify your <python> action to unzip it and run the correct Python executable from within the environment.

<action name="run-python-with-venv">
    <shell xmlns="uri:oozie:shell-action:0.2">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <exec>bash</exec>
        <argument>-c</argument>
        <argument>
            #!/bin/bash
            set -x
            # Unzip the venv
            unzip -q venv.zip
            # Run the script using the python from the venv
            ./my_venv/bin/python process_data.py inputData outputDir
        </argument>
        <file>process_data.py</file>
        <file>venv.zip#venv.zip</file> <!-- The '#' tells Oozie to rename it on upload -->
        ...
    </shell>
    ...
</action>

Note: This often requires using a <shell> action instead of a <python> action, as it gives you more control over the execution environment.

Using Conda Environments

For more complex dependency management, Conda is an excellent choice. The process is similar to using venv: you would package the Conda environment (e.g., as a .tar.gz file) and then extract and activate it within a <shell> action before running your Python script.

oozie python action

Core Concepts

Key Oozie XML Elements for a Python Action

Complete Step-by-Step Example

Step 1: The Python Script (`process_data.py`)

Step 2: The Oozie Workflow (`workflow.xml`)

Step 3: Prepare the Job Directory on HDFS

Step 4: Prepare Input Data

Step 5: Submit the Oozie Job

Step 6: Check the Results

Advanced Topics & Best Practices

Using Virtual Environments (`venv`)

Using Conda Environments

99ANYc3cd6

Python getattribute方法如何正确使用？

Java String如何安全转int？

Java中int转string有哪些方法？

Ansys 13.0安装教程具体步骤是怎样的？

Java中int转String有哪些方法？

AVS Video Editor教程怎么学？新手必看指南？

Ubuntu vimrc如何优化Python开发配置？

Java equals与==到底有啥区别？

Java从入门到精通第3版，如何高效掌握核心技能？

Java中for关键字的具体含义是什么？

Java的setAttribute如何正确使用？

Java字节如何转为文件？

Linux下Python path如何正确设置？

python 3 htmlparser

python socket 10061

Linux 操作系统实用教程，如何快速上手？

oozie python action

Core Concepts

Key Oozie XML Elements for a Python Action

Complete Step-by-Step Example

Step 1: The Python Script (process_data.py)

Step 2: The Oozie Workflow (workflow.xml)

Step 3: Prepare the Job Directory on HDFS

Step 4: Prepare Input Data

Step 5: Submit the Oozie Job

Step 6: Check the Results

Advanced Topics & Best Practices

Using Virtual Environments (venv)

Using Conda Environments

相关推荐

AVS Video Editor教程怎么学？新手必看指南？

Step 1: The Python Script (`process_data.py`)

Step 2: The Oozie Workflow (`workflow.xml`)

Using Virtual Environments (`venv`)