How to create custom LangExtract provider plugin from scratch
Purpose
I wanted to use Claude with LangExtract, but there’s no built-in Claude provider. The official provider documentation has 580+ lines that mix concepts with implementation details, and the entry point configuration is buried in the middle. I couldn’t find a complete working example from scratch to publish.
So I figured out how to create a custom provider plugin by reading the source code and testing locally. This post shows the complete process.
Environment
- Python 3.11
- langextract 1.0.0+
- anthropic 0.18.0+
The Provider System
LangExtract uses Python entry points to discover providers dynamically. When you install a provider package, it registers itself through pyproject.toml, and LangExtract can use it immediately.
Here’s how it works:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐│ User Code │ ──→ │ LangExtract │ ──→ │ Discovery ││ │ │ Registry │ │ (Entry Pt.) │└─────────────┘ └──────────────┘ └─────────────┘ ↓ ┌─────────────┐ │ Your Plugin │ └─────────────┘The key parts are:
- Entry point: Registers your provider in
pyproject.toml - Registry pattern: Decorator that matches model IDs
- BaseLanguageModel: Interface you implement
Step 1: Create Package Structure
I created this directory layout:
langextract-claude/├── pyproject.toml # Entry points + metadata├── README.md├── LICENSE└── langextract_claude/ ├── __init__.py # Exports provider └── provider.py # Main implementationThe structure is minimal. You only need pyproject.toml for metadata and entry points, plus the provider implementation.
Step 2: Configure pyproject.toml
Here’s the complete configuration file:
[build-system]requires = ["setuptools>=61.0", "wheel"]build-backend = "setuptools.build_meta"
[project]name = "langextract-claude"version = "0.1.0"description = "Claude provider for LangExtract"dependencies = [ "langextract>=1.0.0", "anthropic>=0.18.0"]
# CRITICAL: This registers your provider[project.entry-points."langextract.providers"]claude = "langextract_claude:ClaudeLanguageModel"The most important part is the [project.entry-points."langextract.providers"] section. This tells LangExtract:
- When someone uses
model_id="claude-3-5-sonnet" - Load the
ClaudeLanguageModelclass fromlangextract_claude
I missed this at first. I thought just importing the module would work, but LangExtract needs the entry point to discover providers automatically.
Step 3: Implement Provider
Here’s the minimal provider implementation:
import osfrom typing import Generator, List
import langextract as lxfrom langextract.inference import ScoredOutput
@lx.providers.registry.register( r'^claude', # Matches: claude-3-5-sonnet r'^anthropic', # Matches: anthropic-custom priority=10 # Higher than built-ins)class ClaudeLanguageModel(lx.inference.BaseLanguageModel): """Claude AI provider for LangExtract."""
def __init__( self, model_id: str, api_key: str = None, **kwargs ): super().__init__() self.model_id = model_id
# API key: Check Claude-specific then generic self.api_key = api_key or os.environ.get( 'ANTHROPIC_API_KEY', os.environ.get('LANGEXTRACT_API_KEY') )
if not self.api_key: raise ValueError( "ANTHROPIC_API_KEY or LANGEXTRACT_API_KEY required" )
# Initialize Anthropic client from anthropic import Anthropic self.client = Anthropic(api_key=self.api_key)
# Claude-specific settings self.max_tokens = kwargs.get('max_tokens', 4096) self.temperature = kwargs.get('temperature', 0.0)
def infer( self, batch_prompts: List[str], **kwargs ) -> Generator[List[ScoredOutput], None, None]: """Run inference on batch of prompts."""
for prompt in batch_prompts: try: # Call Claude API response = self.client.messages.create( model=self.model_id, max_tokens=self.max_tokens, temperature=self.temperature, messages=[{"role": "user", "content": prompt}] )
# Extract JSON from response output_text = response.content[0].text
# LangExtract expects ScoredOutput yield [ScoredOutput( score=1.0, output=output_text )]
except Exception as e: # Wrap errors for LangExtract raise lx.InferenceError( f"Claude API error: {e}" ) from eThe key parts:
@lx.providers.registry.register(): Registers patterns that match model IDsinfer()method: Must yieldList[ScoredOutput]for each prompt- Error handling: Wrap exceptions in
lx.InferenceError
The regex patterns r'^claude' and r'^anthropic' mean any model ID starting with “claude” or “anthropic” will use this provider.
Step 4: Export from init.py
I made sure to export the provider class:
from langextract_claude.provider import ClaudeLanguageModel
__all__ = ["ClaudeLanguageModel"]Step 5: Test Locally
First, install in editable mode:
cd langextract-claudepip install -e .Then verify the provider is registered:
import langextract as lxprint('Registered providers:', lx.providers.registry.list_entries())I got this output:
Registered providers: [('openai', 0), ('claude', 10), ('anthropic', 10)]You can see that my claude provider shows up with priority 10, which is higher than the built-in openai provider (priority 0).
Step 6: Test Extraction
Here’s a test script:
import langextract as lx
result = lx.extract( text="Jane Doe, age 32, prescribed Lisinopril 10mg", prompt_description="Extract patient name, age, and medications", examples=[ lx.data.ExampleData( text="John Smith, 45, takes Metformin 500mg", extractions=[ lx.data.Extraction( extraction_class="patient", extraction_text="John Smith", attributes={"age": "45"} ), lx.data.Extraction( extraction_class="medication", extraction_text="Metformin 500mg", attributes={} ) ] ) ], model_id="claude-3-5-sonnet-20241022")
print("Extractions:", result.extractions)When I ran this, it worked:
Extractions: [Extraction(extraction_class='patient', extraction_text='Jane Doe', attributes={'age': '32'}), Extraction(extraction_class='medication', extraction_text='Lisinopril 10mg', attributes={})]Step 7: Build and Publish
To publish to PyPI:
pip install build twinepython -m buildls dist/# dist/langextract_claude-0.1.0-py3-none-any.whl
twine upload dist/*Now users can install it:
pip install langextract-claudeCommon Issues
I ran into some issues while testing:
Issue: Provider not discovered
I checked the entry point name was lowercase (claude not Claude). Then verified with:
pip show -f langextract-claude | grep entry-pointsIssue: Import error for dependencies
I had anthropic in dev-dependencies at first. I moved it to dependencies because users need it to use the provider.
Issue: Pattern doesn’t match
I used the registry CLI to debug:
import langextract as lxlx.providers.registry.list_entries()# Should show your new patternsSummary
In this post, I showed how to create a custom LangExtract provider plugin from scratch. The key point is configuring entry points in pyproject.toml and implementing the BaseLanguageModel.infer() method to yield ScoredOutput objects.
Once you understand the entry point system and registry pattern, you can add support for any LLM (Mistral, local models, custom APIs) to LangExtract.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 LangExtract Provider Documentation
- 👨💻 Python Entry Points Guide
- 👨💻 Anthropic API Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments