Skip to content

How to create custom LangExtract provider plugin from scratch

Purpose

I wanted to use Claude with LangExtract, but there’s no built-in Claude provider. The official provider documentation has 580+ lines that mix concepts with implementation details, and the entry point configuration is buried in the middle. I couldn’t find a complete working example from scratch to publish.

So I figured out how to create a custom provider plugin by reading the source code and testing locally. This post shows the complete process.

Environment

  • Python 3.11
  • langextract 1.0.0+
  • anthropic 0.18.0+

The Provider System

LangExtract uses Python entry points to discover providers dynamically. When you install a provider package, it registers itself through pyproject.toml, and LangExtract can use it immediately.

Here’s how it works:

┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ User Code │ ──→ │ LangExtract │ ──→ │ Discovery │
│ │ │ Registry │ │ (Entry Pt.) │
└─────────────┘ └──────────────┘ └─────────────┘
┌─────────────┐
│ Your Plugin │
└─────────────┘

The key parts are:

  • Entry point: Registers your provider in pyproject.toml
  • Registry pattern: Decorator that matches model IDs
  • BaseLanguageModel: Interface you implement

Step 1: Create Package Structure

I created this directory layout:

langextract-claude/
├── pyproject.toml # Entry points + metadata
├── README.md
├── LICENSE
└── langextract_claude/
├── __init__.py # Exports provider
└── provider.py # Main implementation

The structure is minimal. You only need pyproject.toml for metadata and entry points, plus the provider implementation.

Step 2: Configure pyproject.toml

Here’s the complete configuration file:

pyproject.toml
[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "langextract-claude"
version = "0.1.0"
description = "Claude provider for LangExtract"
dependencies = [
"langextract>=1.0.0",
"anthropic>=0.18.0"
]
# CRITICAL: This registers your provider
[project.entry-points."langextract.providers"]
claude = "langextract_claude:ClaudeLanguageModel"

The most important part is the [project.entry-points."langextract.providers"] section. This tells LangExtract:

  • When someone uses model_id="claude-3-5-sonnet"
  • Load the ClaudeLanguageModel class from langextract_claude

I missed this at first. I thought just importing the module would work, but LangExtract needs the entry point to discover providers automatically.

Step 3: Implement Provider

Here’s the minimal provider implementation:

langextract_claude/provider.py
import os
from typing import Generator, List
import langextract as lx
from langextract.inference import ScoredOutput
@lx.providers.registry.register(
r'^claude', # Matches: claude-3-5-sonnet
r'^anthropic', # Matches: anthropic-custom
priority=10 # Higher than built-ins
)
class ClaudeLanguageModel(lx.inference.BaseLanguageModel):
"""Claude AI provider for LangExtract."""
def __init__(
self,
model_id: str,
api_key: str = None,
**kwargs
):
super().__init__()
self.model_id = model_id
# API key: Check Claude-specific then generic
self.api_key = api_key or os.environ.get(
'ANTHROPIC_API_KEY',
os.environ.get('LANGEXTRACT_API_KEY')
)
if not self.api_key:
raise ValueError(
"ANTHROPIC_API_KEY or LANGEXTRACT_API_KEY required"
)
# Initialize Anthropic client
from anthropic import Anthropic
self.client = Anthropic(api_key=self.api_key)
# Claude-specific settings
self.max_tokens = kwargs.get('max_tokens', 4096)
self.temperature = kwargs.get('temperature', 0.0)
def infer(
self,
batch_prompts: List[str],
**kwargs
) -> Generator[List[ScoredOutput], None, None]:
"""Run inference on batch of prompts."""
for prompt in batch_prompts:
try:
# Call Claude API
response = self.client.messages.create(
model=self.model_id,
max_tokens=self.max_tokens,
temperature=self.temperature,
messages=[{"role": "user", "content": prompt}]
)
# Extract JSON from response
output_text = response.content[0].text
# LangExtract expects ScoredOutput
yield [ScoredOutput(
score=1.0,
output=output_text
)]
except Exception as e:
# Wrap errors for LangExtract
raise lx.InferenceError(
f"Claude API error: {e}"
) from e

The key parts:

  • @lx.providers.registry.register(): Registers patterns that match model IDs
  • infer() method: Must yield List[ScoredOutput] for each prompt
  • Error handling: Wrap exceptions in lx.InferenceError

The regex patterns r'^claude' and r'^anthropic' mean any model ID starting with “claude” or “anthropic” will use this provider.

Step 4: Export from init.py

I made sure to export the provider class:

langextract_claude/__init__.py
from langextract_claude.provider import ClaudeLanguageModel
__all__ = ["ClaudeLanguageModel"]

Step 5: Test Locally

First, install in editable mode:

Terminal window
cd langextract-claude
pip install -e .

Then verify the provider is registered:

import langextract as lx
print('Registered providers:', lx.providers.registry.list_entries())

I got this output:

Registered providers: [('openai', 0), ('claude', 10), ('anthropic', 10)]

You can see that my claude provider shows up with priority 10, which is higher than the built-in openai provider (priority 0).

Step 6: Test Extraction

Here’s a test script:

test_claude.py
import langextract as lx
result = lx.extract(
text="Jane Doe, age 32, prescribed Lisinopril 10mg",
prompt_description="Extract patient name, age, and medications",
examples=[
lx.data.ExampleData(
text="John Smith, 45, takes Metformin 500mg",
extractions=[
lx.data.Extraction(
extraction_class="patient",
extraction_text="John Smith",
attributes={"age": "45"}
),
lx.data.Extraction(
extraction_class="medication",
extraction_text="Metformin 500mg",
attributes={}
)
]
)
],
model_id="claude-3-5-sonnet-20241022"
)
print("Extractions:", result.extractions)

When I ran this, it worked:

Extractions: [Extraction(extraction_class='patient', extraction_text='Jane Doe', attributes={'age': '32'}), Extraction(extraction_class='medication', extraction_text='Lisinopril 10mg', attributes={})]

Step 7: Build and Publish

To publish to PyPI:

dist/langextract-claude-0.1.0.tar.gz
pip install build twine
python -m build
ls dist/
# dist/langextract_claude-0.1.0-py3-none-any.whl
twine upload dist/*

Now users can install it:

Terminal window
pip install langextract-claude

Common Issues

I ran into some issues while testing:

Issue: Provider not discovered

I checked the entry point name was lowercase (claude not Claude). Then verified with:

Terminal window
pip show -f langextract-claude | grep entry-points

Issue: Import error for dependencies

I had anthropic in dev-dependencies at first. I moved it to dependencies because users need it to use the provider.

Issue: Pattern doesn’t match

I used the registry CLI to debug:

import langextract as lx
lx.providers.registry.list_entries()
# Should show your new patterns

Summary

In this post, I showed how to create a custom LangExtract provider plugin from scratch. The key point is configuring entry points in pyproject.toml and implementing the BaseLanguageModel.infer() method to yield ScoredOutput objects.

Once you understand the entry point system and registry pattern, you can add support for any LLM (Mistral, local models, custom APIs) to LangExtract.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments