杰瑞科技汇

python stanford ner

Of course! Using Stanford's Named Entity Recognizer (NER) in Python is a common task. The process has evolved over time, with the Stanford CoreNLP server being the recommended modern approach for production systems, while the older Stanford NER tagger is simpler for basic scripts.

python stanford ner-图1
(图片来源网络,侵删)

I'll guide you through both methods, starting with the recommended one.


Method 1: The Modern & Recommended Way - Stanford CoreNLP Server

This method involves running the Stanford CoreNLP server as a background process and then sending text to it via a simple HTTP request. This is ideal for production applications because it's fast, scalable, and decouples the Java processing from your Python code.

Step 1: Download and Run the CoreNLP Server

  1. Download the CoreNLP package: Go to the Stanford CoreNLP download page and download the latest version. You'll get a .zip file.

  2. Unzip the file: Unzip the downloaded file to a convenient location, for example, C:\stanford-corenlp-4.5.5 (Windows) or /home/user/stanford-corenlp-4.5.5 (Linux/macOS).

    python stanford ner-图2
    (图片来源网络,侵删)
  3. Download the models: The package includes models for English (and others). You need to download the specific models file. The easiest way is to run this command from your terminal inside the unzipped directory:

    # Navigate to the CoreNLP directory
    cd /path/to/stanford-corenlp-4.5.5
    # Download the models (this will create a "models" folder)
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

    This command also starts the server! The -mx4g part allocates memory to the Java process. Adjust this if you run into memory issues.

  4. Verify the server is running: Open your web browser and go to http://localhost:9000. You should see a page with a text box and some information about the server. If you see this, the server is running correctly.

Step 2: Python Code to Query the Server

Now, you can write a simple Python script to send text to the server and get the NER results. We'll use the popular requests library.

python stanford ner-图3
(图片来源网络,侵删)

First, install it if you haven't already:

pip install requests

Now, here's the Python script:

import requests
import json
# The URL of the running Stanford CoreNLP server
# We add the properties for annotators and output format
url = "http://localhost:9000/?properties=%7B%22annotators%22%3A%20%22tokenize%2Cssplit%2Cner%22%2C%20%22outputFormat%22%3A%20%22json%22%7D"
# The text you want to analyze
text_to_analyze = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
# Send the POST request
# The server expects the raw text in the request body
response = requests.post(url, data=text_to_analyze)
# Check if the request was successful
if response.status_code == 200:
    # The response is in JSON format
    result = response.json()
    # The result is a list of sentences
    for sentence in result['sentences']:
        # Each sentence has a list of tokens (words)
        for token in sentence['tokens']:
            # Each token has the original text, its character offset, and NER tag
            original_text = token['originalText']
            ner_tag = token['ner']
            print(f"Word: {original_text:<15} NER Tag: {ner_tag}")
else:
    print(f"Error: {response.status_code}")
    print(response.text)

Output of the script:

Word: Barack          NER Tag: PERSON
Word: Obama           NER Tag: PERSON
Word: was             NER Tag: O
Word: born            NER Tag: O
Word: in              NER Tag: O
Word: Hawaii          NER Tag: LOCATION
Word: .               NER Tag: O
Word: He              NER Tag: O
Word: was             NER Tag: O
Word: the             NER Tag: O
Word: 44th            NER Tag: O
Word: President      NER Tag: O
Word: of              NER Tag: O
Word: the             NER Tag: O
Word: United          NER Tag: LOCATION
Word: States          NER Tag: LOCATION
Word: .               NER Tag: O

Method 2: The Classic Way - Stand Stanford NER Tagger

This method involves using the stanford-corenlp Python package, which acts as a wrapper for the older, standalone stanford-ner.jar file. It's simpler for a one-off script but less flexible for production.

Step 1: Download the Stanford NER Package

  1. Download the NER package: Go to the Stanford NER download page and download the "Full Stanford NER distribution".
  2. Unzip the file. You will find a stanford-ner.jar file and a folder named classifiers, which contains the pre-trained model files (e.g., english.muc.7class.distsim.crf.ser.gz).

Step 2: Set Up Your Python Environment

You need two things:

  1. The stanfordcorenlp Python library.
  2. A Java Development Kit (JDK) installed on your system, as the library runs the Java JAR file.

Install the Python library:

pip install stanfordcorenlp

Step 3: Python Code to Use the Tagger

You need to tell the StanfordCoreNLP class where to find the stanford-ner.jar file and the model file you want to use.

from stanfordcorenlp import StanfordCoreNLP
# Path to the stanford-ner.jar file
# IMPORTANT: Change this path to where you unzipped the file
jar_path = '/path/to/stanford-ner.jar' 
# Path to the pre-trained model file
# IMPORTANT: Change this path to where you unzipped the file
model_path = '/path/to/classifiers/english.muc.7class.distsim.crf.ser.gz'
# Initialize the NLP object
# This can take a few seconds as it starts the Java process
nlp = StanfordCoreNLP(jar_path, model_path=model_path)
text_to_analyze = "Barack Obama was born in Hawaii. He was the 44th President of the United States."
# The 'ner' tag performs Named Entity Recognition
# The output is a list of (token, tag) tuples
ner_tags = nlp.ner(text_to_analyze)
# Print the results
for token, tag in ner_tags:
    print(f"Word: {token:<15} NER Tag: {tag}")
# Don't forget to close the connection when you're done
# This shuts down the Java process
nlp.close()

Output of the script: (The tags might be slightly different from the CoreNLP server, as the models can vary.)

Word: Barack          NER Tag: I-PER
Word: Obama           NER Tag: I-PER
Word: was             NER Tag: O
Word: born            NER Tag: O
Word: in              NER Tag: O
Word: Hawaii          NER Tag: I-LOC
Word: .               NER Tag: O
Word: He              NER Tag: O
Word: was             NER Tag: O
Word: the             NER Tag: O
Word: 44th            NER Tag: O
Word: President      NER Tag: O
Word: of              NER Tag: O
Word: the             NER Tag: O
Word: United          NER Tag: I-ORG
Word: States          NER Tag: I-ORG
Word: .               NER Tag: O

Comparison and Recommendation

Feature Stanford CoreNLP Server (Method 1) Stand Stanford NER Tagger (Method 2)
How it Works Client-server model. Python sends HTTP requests. Python library directly calls a Java JAR file.
Scalability Excellent. Multiple Python scripts can connect to one server. Poor. Each script starts its own Java process, which is resource-intensive.
Performance Very Fast. The server is optimized for repeated requests. Slower. Starting the JVM for each script is costly.
Ease of Use Simple once the server is running. The code is very clean. Simple for a single script, but requires managing paths to JAR and model files.
Flexibility High. Can perform many NLP tasks (POS tagging, parsing, etc.) by changing the URL parameters. Low. Primarily for NER. Other tasks require different wrappers.
Best For Production applications, web services, and any repeated use. Quick experiments, learning, or simple, one-off scripts.

Final Recommendation:

For any serious project, use Method 1 (Stanford CoreNLP Server). The setup is a one-time effort, and the benefits in terms of performance and scalability are immense. Use Method 2 only if you are in a hurry for a quick test and don't want to bother with running a server.

分享:
扫描分享到社交APP
上一篇
下一篇