Spaces:

megharudushi
/

agentic-api

Runtime error

App Files Files Community

MiniMax Agent commited on Jan 1

Commit

c126015

1 Parent(s): 9604400

Add OpenAI API compatible endpoints for OpenELM models

Browse files

Files changed (4) hide show

README.md +218 -33
app.py +475 -11
examples/curl_examples.sh +139 -27
examples/openai_sdk_example.py +148 -0

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: OpenELM Anthropic API
 emoji: 🤖
 colorFrom: blue
 colorTo: purple
@@ -7,21 +7,23 @@ sdk: docker
 pinned: false
 ---
-# OpenELM Anthropic API Compatible Wrapper
-A FastAPI-based service that provides an Anthropic-compatible API for Apple's OpenELM models, allowing you to use the Anthropic SDK with OpenELM for text generation tasks.
 ## Overview
-This project creates a REST API that mimics the Anthropic Messages API format, enabling developers to use OpenELM models with existing Anthropic SDK code with minimal modifications. The API supports both streaming and non-streaming responses, multi-turn conversations, system prompts, and various generation parameters.
-The OpenELM (Open Efficient Language Model) family from Apple uses a layer-wise scaling strategy to efficiently allocate parameters within each transformer layer, resulting in enhanced accuracy while maintaining computational efficiency. This wrapper makes these powerful models accessible through a familiar API interface.
 ## Features
-The API provides comprehensive support for Anthropic-style message generation with several key capabilities. First, it offers full Anthropic API compatibility, including endpoints that match the Anthropic Messages API structure, making it easy to integrate with existing codebases. Second, it supports streaming responses through Server-Sent Events (SSE), enabling real-time output display as tokens are generated. Third, the API handles multi-turn conversations by maintaining conversation history and formatting prompts appropriately for OpenELM models.
-Additionally, the wrapper properly handles system prompts by prepending them to the conversation context, which is essential for defining assistant behavior. The API also provides flexible generation parameters, allowing control over temperature, top-p sampling, maximum tokens, and other generation settings. Finally, comprehensive token usage statistics are included in responses, matching the Anthropic response format exactly.
 ## Quick Start
@@ -29,8 +31,8 @@ Additionally, the wrapper properly handles system prompts by prepending them to
 ```bash
 # Build and run with Docker
-docker build -t openelm-anthropic-api .
-docker run -p 8000:8000 openelm-anthropic-api
 ```
 ### Local Development
@@ -43,7 +45,20 @@ pip install -r requirements.txt
 python -m uvicorn app:app --host 0.0.0.0 --port 8000
 ```
-### Test the API
 ```bash
 # Basic message generation
@@ -58,17 +73,69 @@ curl -X POST http://localhost:8000/v1/messages \
 ## API Reference
-### Endpoints
 | Method | Endpoint | Description |
 |--------|----------|-------------|
-| GET | / | API information |
-| GET | /health | Health check |
-| GET | /v1/models | List available models |
 | POST | /v1/messages | Create message (non-streaming) |
 | POST | /v1/messages/stream | Create message (streaming) |
-### Request Format
 ```json
 {
@@ -79,19 +146,20 @@ curl -X POST http://localhost:8000/v1/messages \
   "system": "Optional system prompt",
   "max_tokens": 1024,
   "temperature": 0.7,
-  "top_p": 0.9,
   "stream": false
 }
 ```
-### Response Format
 ```json
 {
   "id": "msg_abc123",
   "type": "message",
   "role": "assistant",
-  "content": [{"type": "text", "text": "Generated response"}],
   "model": "openelm-450m-instruct",
   "stop_reason": "end_turn",
   "usage": {
@@ -101,25 +169,90 @@ curl -X POST http://localhost:8000/v1/messages \
 }
 ```
 ## Using with Anthropic SDK
 ```python
-from anthropic import Anthropic
 # Point to your local API
-client = Anthropic(
     base_url="http://localhost:8000/v1",
     api_key="dummy"  # Any string works
 )
 # Use the same API you use with Claude!
-response = client.messages.create(
     model="openelm-450m-instruct",
     messages=[{"role": "user", "content": "Hello!"}],
     max_tokens=100
 )
-print(response.content[0].text)
 ```
 ## Model Information
@@ -129,14 +262,16 @@ print(response.content[0].text)
 - **Context Window**: 2048 tokens
 - **Weight Format**: Safetensors (secure and efficient)
 - **Quantization**: FP16 for optimal performance
 ## Architecture
-- **Framework**: FastAPI with async support
-- **ML Backend**: PyTorch + HuggingFace Transformers
-- **Model Loading**: Lazy loading on startup with caching
-- **Streaming**: Server-Sent Events (SSE)
-- **Response Format**: 100% Anthropic API compatible
 ## Configuration
@@ -147,20 +282,68 @@ Environment variables can be used to customize the deployment:
 | PORT | 8000 | API server port |
 | HF_HOME | ~/.cache/huggingface | Model cache directory |
 | TRANSFORMERS_CACHE | ~/.cache/transformers | Transformers cache |
 ## Examples
 See the `examples/` directory for complete usage examples:
-- `anthropic_sdk_example.py` - Python SDK usage
-- `curl_examples.sh` - Command-line examples
 ## Troubleshooting
-- **Model not loading**: Check internet connection for HuggingFace download
-- **Out of memory**: Reduce max_tokens or use CPU inference
-- **Slow responses**: First request downloads model (subsequent requests are faster)
-- **Port conflicts**: Change PORT environment variable
 ## License
@@ -169,6 +352,8 @@ This project is provided for educational and research purposes. The OpenELM mode
 ## Resources
 - [OpenELM Model Card](https://huggingface.co/apple/OpenELM-450M-Instruct)
 - [Anthropic API Documentation](https://docs.anthropic.com)
 - [FastAPI Documentation](https://fastapi.tiangolo.com)
 - [HuggingFace Transformers](https://huggingface.co/docs/transformers)

 ---
+title: OpenELM OpenAI API
 emoji: 🤖
 colorFrom: blue
 colorTo: purple
 pinned: false
 ---
+# OpenELM OpenAI & Anthropic API Compatible Wrapper
+A FastAPI-based service that provides both OpenAI and Anthropic-compatible APIs for Apple's OpenELM models, allowing you to use the OpenAI SDK or Anthropic SDK with OpenELM for text generation tasks.
 ## Overview
+This project creates a REST API that mimics both the OpenAI Chat Completions API and Anthropic Messages API formats, enabling developers to use OpenELM models with existing SDK code with minimal modifications. The API supports both streaming and non-streaming responses, multi-turn conversations, system prompts, and various generation parameters. This dual compatibility means you can use the same underlying OpenELM model whether your codebase is built for OpenAI or Anthropic APIs.
+The OpenELM (Open Efficient Language Model) family from Apple uses a layer-wise scaling strategy to efficiently allocate parameters within each transformer layer, resulting in enhanced accuracy while maintaining computational efficiency. This wrapper makes these powerful models accessible through familiar API interfaces, bridging the gap between Apple's innovative architecture and the widely-adopted API standards used in the industry.
 ## Features
+The API provides comprehensive support for both OpenAI and Anthropic-style generation with several key capabilities. First, it offers full dual API compatibility, including endpoints that match both the OpenAI Chat Completions API structure and the Anthropic Messages API, making it easy to integrate with existing codebases regardless of which provider you currently use. Second, it supports streaming responses through Server-Sent Events (SSE), enabling real-time output display as tokens are generated in both API formats.
+Third, the API handles multi-turn conversations by maintaining conversation history and formatting prompts appropriately for OpenELM models, regardless of which API format you choose. Additionally, the wrapper properly handles system prompts by prepending them to the conversation context, which is essential for defining assistant behavior. The API also provides flexible generation parameters, allowing control over temperature, top-p sampling, maximum tokens, and other generation settings that work across both API styles.
+Finally, comprehensive token usage statistics are included in responses, matching both the OpenAI and Anthropic response formats exactly, ensuring compatibility with tools and dashboards that expect standard usage reporting.
 ## Quick Start
 ```bash
 # Build and run with Docker
+docker build -t openelm-api .
+docker run -p 8000:8000 openelm-api
 ```
 ### Local Development
 python -m uvicorn app:app --host 0.0.0.0 --port 8000
 ```
+### Test the API (OpenAI Format)
+```bash
+# Basic chat completion
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openelm-450m-instruct",
+    "messages": [{"role": "user", "content": "Say hello!"}],
+    "max_tokens": 100
+  }'
+```
+### Test the API (Anthropic Format)
 ```bash
 # Basic message generation
 ## API Reference
+### OpenAI API Endpoints
+The OpenAI-compatible endpoints follow the standard Chat Completions API format used by OpenAI's GPT models. These endpoints accept message arrays with roles and content, and return completion responses in the standard OpenAI format.
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | /v1/models | List available models (OpenAI format) |
+| POST | /v1/chat/completions | Create chat completion (non-streaming) |
+| POST | /v1/chat/completions (with stream=true) | Create chat completion (streaming) |
+#### OpenAI Request Format
+```json
+{
+  "model": "openelm-450m-instruct",
+  "messages": [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Your prompt here"}
+  ],
+  "temperature": 0.7,
+  "top_p": 0.9,
+  "max_tokens": 1024,
+  "stream": false
+}
+```
+#### OpenAI Response Format
+```json
+{
+  "id": "chatcmpl-abc123",
+  "object": "chat.completion",
+  "created": 1677858242,
+  "model": "openelm-450m-instruct",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Generated response"
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 13,
+    "completion_tokens": 25,
+    "total_tokens": 38
+  }
+}
+```
+### Anthropic API Endpoints
+The Anthropic-compatible endpoints follow the Messages API format used by Claude. These endpoints accept message arrays with roles and content, and support both streaming and non-streaming responses.
 | Method | Endpoint | Description |
 |--------|----------|-------------|
+| GET | /v1/models | List available models (Anthropic format) |
 | POST | /v1/messages | Create message (non-streaming) |
 | POST | /v1/messages/stream | Create message (streaming) |
+#### Anthropic Request Format
 ```json
 {
   "system": "Optional system prompt",
   "max_tokens": 1024,
   "temperature": 0.7,
   "stream": false
 }
 ```
+#### Anthropic Response Format
 ```json
 {
   "id": "msg_abc123",
   "type": "message",
   "role": "assistant",
+  "content": [
+    {"type": "text", "text": "Generated response"}
+  ],
   "model": "openelm-450m-instruct",
   "stop_reason": "end_turn",
   "usage": {
 }
 ```
+## Using with OpenAI SDK
+```python
+from openai import OpenAI
+# Point to your local API
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="dummy"  # Any string works
+)
+# Use the same API you use with GPT!
+response = client.chat.completions.create(
+    model="openelm-450m-instruct",
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "Hello!"}
+    ],
+    max_tokens=100
+)
+print(response.choices[0].message.content)
+```
+### Streaming with OpenAI SDK
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="dummy"
+)
+stream = client.chat.completions.create(
+    model="openelm-450m-instruct",
+    messages=[{"role": "user", "content": "Tell me a story."}],
+    max_tokens=100,
+    stream=True
+)
+for chunk in stream:
+    if chunk.choices[0].delta.content:
+        print(chunk.choices[0].delta.content, end="", flush=True)
+```
 ## Using with Anthropic SDK
 ```python
+import anthropic
 # Point to your local API
+client = anthropic.Anthropic(
     base_url="http://localhost:8000/v1",
     api_key="dummy"  # Any string works
 )
 # Use the same API you use with Claude!
+message = client.messages.create(
     model="openelm-450m-instruct",
     messages=[{"role": "user", "content": "Hello!"}],
     max_tokens=100
 )
+print(message.content[0].text)
+```
+### Streaming with Anthropic SDK
+```python
+import anthropic
+client = anthropic.Anthropic(
+    base_url="http://localhost:8000/v1",
+    api_key="dummy"
+)
+with client.messages.stream(
+    model="openelm-450m-instruct",
+    messages=[{"role": "user", "content": "Tell me a story."}],
+    max_tokens=100
+) as stream:
+    for text in stream.text_stream:
+        print(text, end="", flush=True)
 ```
 ## Model Information
 - **Context Window**: 2048 tokens
 - **Weight Format**: Safetensors (secure and efficient)
 - **Quantization**: FP16 for optimal performance
+- **Layer-wise Scaling**: Efficient parameter allocation within transformer layers
 ## Architecture
+- **Framework**: FastAPI with async support for high concurrency
+- **ML Backend**: PyTorch + HuggingFace Transformers for model inference
+- **Model Loading**: Lazy loading on startup with caching for fast restarts
+- **Streaming**: Server-Sent Events (SSE) for real-time token delivery
+- **Dual Compatibility**: Full OpenAI and Anthropic API format support
+- **Prompt Engineering**: Custom formatting for OpenELM's text completion interface
 ## Configuration
 | PORT | 8000 | API server port |
 | HF_HOME | ~/.cache/huggingface | Model cache directory |
 | TRANSFORMERS_CACHE | ~/.cache/transformers | Transformers cache |
+| CUDA_VISIBLE_DEVICES | all | GPU device selection |
 ## Examples
 See the `examples/` directory for complete usage examples:
+- `openai_sdk_example.py` - OpenAI SDK usage with streaming support
+- `anthropic_sdk_example.py` - Anthropic SDK usage with streaming support
+- `curl_examples.sh` - Command-line examples for both APIs
+## Streaming Response Format
+### OpenAI Streaming (SSE)
+```
+data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"openelm-450m-instruct","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
+data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"openelm-450m-instruct","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
+data: [DONE]
+```
+### Anthropic Streaming (SSE)
+```
+event: message_start
+data: {"id":"msg_abc123","type":"message","role":"assistant","content":[],"model":"openelm-450m-instruct"}
+event: content_block_start
+data: {"type":"text","text":""}
+event: content_block_delta
+data: {"type":"text_delta","text":"Hello"}
+event: content_block_stop
+data: {}
+event: message_delta
+data: {"delta":{"stop_reason":"end_turn"},"usage":{"input_tokens":10,"output_tokens":5}}
+event: message_stop
+data: {}
+```
 ## Troubleshooting
+- **Model not loading**: Check internet connection for HuggingFace download, ensure sufficient disk space for model cache
+- **Out of memory**: Reduce max_tokens, use smaller context windows, or switch to CPU inference by removing GPU-specific settings
+- **Slow responses**: First request downloads model from HuggingFace (subsequent requests use cached model and are much faster)
+- **Port conflicts**: Change PORT environment variable to use a different port
+- **Streaming not working**: Ensure you're using the correct endpoint (with stream=true for OpenAI) and proper SSE parsing
+- **Format errors**: Verify your request matches the expected format for the API you're using (OpenAI vs Anthropic have different schemas)
+## Migration Guide
+### Migrating from OpenAI to OpenELM
+If you're currently using OpenAI's API and want to switch to OpenELM, the migration is straightforward. Simply change the base_url to point to your local OpenELM API server and update the model name. All other parameters and response handling remain the same, making it easy to toggle between providers for testing or A/B comparisons.
+### Migrating from Anthropic to OpenELM
+Similarly, if you're using Anthropic's API, you can migrate by updating the base_url and model name. The message format is similar, though you may need to adjust how you handle system prompts since OpenAI uses inline system messages while Anthropic uses a separate system parameter.
 ## License
 ## Resources
 - [OpenELM Model Card](https://huggingface.co/apple/OpenELM-450M-Instruct)
+- [OpenAI API Documentation](https://platform.openai.com/docs/api-reference)
 - [Anthropic API Documentation](https://docs.anthropic.com)
 - [FastAPI Documentation](https://fastapi.tiangolo.com)
 - [HuggingFace Transformers](https://huggingface.co/docs/transformers)
+- [Apple OpenELM Research Paper](https://machinelearning.apple.com/research/openelm)

app.py CHANGED Viewed

@@ -1,8 +1,13 @@
 """
-OpenELM Anthropic API Compatible Wrapper
-This FastAPI application provides an Anthropic-compatible API for the OpenELM model,
-allowing users to call OpenELM models using the Anthropic SDK with minimal code changes.
 """
 import asyncio
@@ -80,9 +85,9 @@ async def lifespan(app: FastAPI) -> AsyncIterator:
 # Create FastAPI app
 app = FastAPI(
-    title="OpenELM Anthropic API",
-    description="Anthropic API compatible wrapper for OpenELM models",
-    version="1.0.0",
     lifespan=lifespan
 )
@@ -115,6 +120,7 @@ class Usage(BaseModel):
     """Token usage statistics."""
     input_tokens: int = 0
     output_tokens: int = 0
 class ContentBlock(BaseModel):
@@ -162,6 +168,88 @@ class ModelListResponse(BaseModel):
     data: List[ModelInfo]
 # ==================== Helper Functions ====================
 def format_prompt_for_openelm(
@@ -283,12 +371,14 @@ def map_anthropic_params_to_transformers(
 async def root():
     """Root endpoint with API information."""
     return {
-        "name": "OpenELM Anthropic API",
-        "version": "1.0.0",
-        "description": "Anthropic API compatible wrapper for OpenELM models",
         "endpoints": {
-            "messages": "POST /v1/messages",
-            "models": "GET /v1/models",
             "health": "GET /health"
         }
     }
@@ -643,6 +733,380 @@ class MessageResource:
         return response.json()
 # ==================== Main Entry Point ====================
 if __name__ == "__main__":

 """
+OpenELM OpenAI & Anthropic API Compatible Wrapper
+This FastAPI application provides both OpenAI and Anthropic-compatible APIs for the OpenELM model,
+allowing users to call OpenELM models using either SDK with minimal code changes.
+Supported APIs:
+- OpenAI Chat Completions API (v1/chat/completions)
+- Anthropic Messages API (v1/messages)
+- Both support streaming and non-streaming responses
 """
 import asyncio
 # Create FastAPI app
 app = FastAPI(
+    title="OpenELM OpenAI API",
+    description="OpenAI and Anthropic API compatible wrapper for OpenELM models",
+    version="1.1.0",
     lifespan=lifespan
 )
     """Token usage statistics."""
     input_tokens: int = 0
     output_tokens: int = 0
+    total_tokens: int = 0
 class ContentBlock(BaseModel):
     data: List[ModelInfo]
+# ==================== OpenAI API Models ====================
+class ChatMessage(BaseModel):
+    """A chat message (OpenAI format)."""
+    role: str
+    content: str
+    name: Optional[str] = None
+class ChatCompletionRequest(BaseModel):
+    """Chat completion request (OpenAI API compatible)."""
+    model: str = "openelm-450m-instruct"
+    messages: List[ChatMessage]
+    temperature: Optional[float] = Field(default=None, ge=0.0, le=2.0)
+    top_p: Optional[float] = Field(default=None, ge=0.0, le=1.0)
+    n: Optional[int] = Field(default=1, ge=1)
+    max_tokens: Optional[int] = Field(default=None, ge=1, le=4096)
+    stream: Optional[bool] = False
+    presence_penalty: Optional[float] = Field(default=None, ge=-2.0, le=2.0)
+    frequency_penalty: Optional[float] = Field(default=None, ge=-2.0, le=2.0)
+    logit_bias: Optional[Dict[str, float]] = None
+    user: Optional[str] = None
+class ChatCompletionChoice(BaseModel):
+    """Choice in a chat completion response."""
+    index: int
+    message: ChatMessage
+    finish_reason: Optional[str] = None
+    logprobs: Optional[Any] = None
+class ChatCompletionUsage(BaseModel):
+    """Token usage in chat completion."""
+    prompt_tokens: int
+    completion_tokens: int
+    total_tokens: int
+class ChatCompletionResponse(BaseModel):
+    """Chat completion response (OpenAI API compatible)."""
+    id: str
+    object: str = "chat.completion"
+    created: int
+    model: str
+    choices: List[ChatCompletionChoice]
+    usage: ChatCompletionUsage
+    system_fingerprint: Optional[str] = None
+class ChatCompletionChunkChoice(BaseModel):
+    """Choice in a streaming chunk."""
+    index: int
+    delta: Dict[str, Any]
+    finish_reason: Optional[str] = None
+    logprobs: Optional[Any] = None
+class ChatCompletionChunk(BaseModel):
+    """Streaming chunk (OpenAI API compatible)."""
+    id: str
+    object: str = "chat.completion.chunk"
+    created: int
+    model: str
+    choices: List[ChatCompletionChunkChoice]
+class OpenAIModelInfo(BaseModel):
+    """Model information (OpenAI format)."""
+    id: str
+    object: str = "model"
+    created: int = 0
+    owned_by: str = "openelm"
+    permission: List[Any] = []
+class OpenAIModelListResponse(BaseModel):
+    """Model list response (OpenAI format)."""
+    object: str = "list"
+    data: List[OpenAIModelInfo]
 # ==================== Helper Functions ====================
 def format_prompt_for_openelm(
 async def root():
     """Root endpoint with API information."""
     return {
+        "name": "OpenELM OpenAI API",
+        "version": "1.1.0",
+        "description": "OpenAI and Anthropic API compatible wrapper for OpenELM models",
         "endpoints": {
+            "openai_chat": "POST /v1/chat/completions",
+            "openai_models": "GET /v1/models",
+            "anthropic_messages": "POST /v1/messages",
+            "anthropic_models": "GET /v1/models",
             "health": "GET /health"
         }
     }
         return response.json()
+# ==================== OpenAI API Endpoints ====================
+@app.get("/v1/models", response_model=OpenAIModelListResponse, tags=["OpenAI"])
+async def list_openai_models():
+    """List available models (OpenAI API format)."""
+    return OpenAIModelListResponse(
+        data=[
+            OpenAIModelInfo(
+                id="openelm-450m-instruct",
+                owned_by="apple",
+                created=int(uuid.uuid1().time)
+            )
+        ]
+    )
+@app.post("/v1/chat/completions", tags=["OpenAI"])
+async def create_chat_completion(
+    request: ChatCompletionRequest,
+    raw_request: Request = None
+):
+    """
+    Create chat completion (OpenAI API compatible).
+    This endpoint accepts OpenAI-style chat completion requests and returns
+    responses in the same format, allowing existing code to work with OpenELM.
+    """
+    # Check if model is loaded
+    if model is None or tokenizer is None:
+        raise HTTPException(
+            status_code=503,
+            detail="Model not loaded. Please wait for model to initialize."
+        )
+    # Handle streaming
+    if request.stream:
+        return await create_chat_completion_stream(request)
+    try:
+        # Extract system message if present
+        system_message = None
+        formatted_messages = []
+        for msg in request.messages:
+            if msg.role == "system" and system_message is None:
+                system_message = msg.content
+            else:
+                formatted_messages.append(Message(
+                    role=msg.role,
+                    content=msg.content
+                ))
+        # Format prompt for OpenELM
+        prompt = format_prompt_for_openelm(formatted_messages, system_message)
+        # Calculate max_tokens
+        max_tokens = request.max_tokens or 1024
+        max_context_tokens = 2048 - max_tokens
+        prompt = truncate_prompt(prompt, max_context_tokens, system_message)
+        # Tokenize input
+        inputs = tokenizer(prompt, return_tensors="pt")
+        input_tokens = len(inputs.input_ids[0])
+        # Move to same device as model
+        if hasattr(model, 'device'):
+            inputs = {k: v.to(model.device) for k, v in inputs.items()}
+        # Map parameters
+        gen_params = map_anthropic_params_to_transformers(
+            request.temperature,
+            request.top_p,
+            None,
+            max_tokens
+        )
+        # Generate
+        with torch.no_grad():
+            outputs = model.generate(
+                **inputs,
+                **gen_params,
+                pad_token_id=tokenizer.eos_token_id,
+                eos_token_id=tokenizer.eos_token_id,
+            )
+        # Decode output
+        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+        # Extract the assistant's response
+        response_text = extract_assistant_response(generated_text)
+        output_tokens = count_tokens(response_text)
+        # Build response matching OpenAI format
+        response_id = f"chatcmpl-{uuid.uuid4().hex[:12]}"
+        timestamp = int(uuid.uuid1().time)
+        return ChatCompletionResponse(
+            id=response_id,
+            created=timestamp,
+            model="openelm-450m-instruct",
+            choices=[
+                ChatCompletionChoice(
+                    index=0,
+                    message=ChatMessage(role="assistant", content=response_text),
+                    finish_reason="stop"
+                )
+            ],
+            usage=ChatCompletionUsage(
+                prompt_tokens=input_tokens,
+                completion_tokens=output_tokens,
+                total_tokens=input_tokens + output_tokens
+            )
+        )
+    except Exception as e:
+        raise HTTPException(
+            status_code=500,
+            detail=f"Generation failed: {str(e)}"
+        )
+async def create_chat_completion_stream(request: ChatCompletionRequest):
+    """Create streaming chat completion (OpenAI API compatible)."""
+    async def generate_stream():
+        """Generate streaming response in OpenAI format."""
+        try:
+            # Extract system message if present
+            system_message = None
+            formatted_messages = []
+            for msg in request.messages:
+                if msg.role == "system" and system_message is None:
+                    system_message = msg.content
+                else:
+                    formatted_messages.append(Message(
+                        role=msg.role,
+                        content=msg.content
+                    ))
+            # Format prompt for OpenELM
+            prompt = format_prompt_for_openelm(formatted_messages, system_message)
+            # Calculate max_tokens
+            max_tokens = request.max_tokens or 1024
+            max_context_tokens = 2048 - max_tokens
+            prompt = truncate_prompt(prompt, max_context_tokens, system_message)
+            # Tokenize
+            inputs = tokenizer(prompt, return_tensors="pt")
+            input_tokens = len(inputs.input_ids[0])
+            # Move to same device as model
+            if hasattr(model, 'device'):
+                inputs = {k: v.to(model.device) for k, v in inputs.items()}
+            # Map parameters
+            gen_params = map_anthropic_params_to_transformers(
+                request.temperature,
+                request.top_p,
+                None,
+                max_tokens
+            )
+            # Set up streaming
+            gen_params["stopping_criteria"] = []
+            # Use TextIteratorStreamer for streaming
+            streamer = TextIteratorStreamer(
+                tokenizer,
+                skip_prompt=True,
+                skip_special_tokens=True
+            )
+            gen_params["streamer"] = streamer
+            # Run generation in a separate thread
+            def generate():
+                with torch.no_grad():
+                    model.generate(**inputs, **gen_params)
+            thread = Thread(target=generate)
+            thread.start()
+            # Send streaming chunks in OpenAI format
+            chunk_id = f"chatcmpl-{uuid.uuid4().hex[:12]}"
+            timestamp = int(uuid.uuid1().time)
+            # Send role first
+            yield f"data: {{\"id\":\"{chunk_id}\",\"object\":\"chat.completion.chunk\",\"created\":{timestamp},\"model\":\"openelm-450m-instruct\",\"choices\":[{{\"index\":0,\"delta\":{{\"role\":\"assistant\"}},\"finish_reason\":null}}]}}\n\n"
+            # Stream the generated text
+            full_text = ""
+            for text in streamer:
+                full_text += text
+                chunk_data = {
+                    "id": chunk_id,
+                    "object": "chat.completion.chunk",
+                    "created": timestamp,
+                    "model": "openelm-450m-instruct",
+                    "choices": [
+                        {
+                            "index": 0,
+                            "delta": {"content": text},
+                            "finish_reason": None
+                        }
+                    ]
+                }
+                yield f"data: {chunk_data}\n\n"
+            # Send stop chunk
+            output_tokens = count_tokens(full_text)
+            stop_chunk = {
+                "id": chunk_id,
+                "object": "chat.completion.chunk",
+                "created": timestamp,
+                "model": "openelm-450m-instruct",
+                "choices": [
+                    {
+                        "index": 0,
+                        "delta": {},
+                        "finish_reason": "stop"
+                    }
+                ]
+            }
+            yield f"data: {stop_chunk}\n\n"
+            # Send usage data (OpenAI format)
+            usage_data = {
+                "id": chunk_id,
+                "object": "chat.completion",
+                "created": timestamp,
+                "model": "openelm-450m-instruct",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": full_text},
+                        "finish_reason": "stop"
+                    }
+                ],
+                "usage": {
+                    "prompt_tokens": input_tokens,
+                    "completion_tokens": output_tokens,
+                    "total_tokens": input_tokens + output_tokens
+                }
+            }
+            yield f"data: {usage_data}\n\n"
+            # Signal end of stream
+            yield "data: [DONE]\n\n"
+            thread.join()
+        except Exception as e:
+            yield f"data: {{\"error\": {{\"message\": \"{str(e)}\", \"type\": \"server_error\"}}}}\n\n"
+    return StreamingResponse(
+        generate_stream(),
+        media_type="text/event-stream",
+        headers={
+            "Cache-Control": "no-cache",
+            "Connection": "keep-alive",
+            "X-Accel-Buffering": "no",
+        }
+    )
+def extract_assistant_response(generated_text: str) -> str:
+    """Extract assistant response from generated text."""
+    response_text = generated_text
+    if "Assistant:" in generated_text:
+        response_text = generated_text.split("Assistant:")[-1].strip()
+    elif ":" in generated_text:
+        # Find the last role and extract content after it
+        lines = generated_text.split("\n")
+        in_assistant = False
+        response_parts = []
+        for line in lines:
+            if line.startswith("Assistant:"):
+                in_assistant = True
+                response_parts.append(line.replace("Assistant:", "").strip())
+            elif in_assistant and not line.startswith("User:") and not line.startswith("System:"):
+                response_parts.append(line)
+            elif line.startswith("User:") or line.startswith("System:"):
+                in_assistant = False
+        response_text = "\n".join(response_parts).strip()
+    return response_text
+# ==================== OpenAI SDK Compatibility ====================
+class OpenAIClient:
+    """
+    Simple OpenAI SDK compatible client for testing.
+    Usage:
+        client = OpenAIClient(base_url="http://localhost:8000/v1", api_key="dummy")
+        response = client.chat.completions.create(
+            model="openelm-450m-instruct",
+            messages=[{"role": "user", "content": "Hello!"}],
+            max_tokens=100
+        )
+    """
+    def __init__(self, base_url: str = "http://localhost:8000", api_key: str = "dummy"):
+        self.base_url = base_url.rstrip("/")
+        self.api_key = api_key
+        self.session = None
+    def _get_session(self):
+        """Get or create a requests session."""
+        import requests
+        if self.session is None:
+            self.session = requests.Session()
+            self.session.headers.update({
+                "Authorization": f"Bearer {self.api_key}",
+                "Content-Type": "application/json"
+            })
+        return self.session
+    @property
+    def chat(self) -> "ChatResource":
+        """Access chat operations."""
+        return ChatResource(self)
+class ChatResource:
+    """Resource for chat completion operations."""
+    def __init__(self, client: OpenAIClient):
+        self.client = client
+    def create(
+        self,
+        model: str,
+        messages: List[Dict[str, str]],
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        max_tokens: Optional[int] = None,
+        stream: bool = False,
+        **kwargs
+    ) -> Dict[str, Any]:
+        """Create chat completion."""
+        import requests
+        url = f"{self.client.base_url}/v1/chat/completions"
+        payload = {
+            "model": model,
+            "messages": messages,
+        }
+        if temperature is not None:
+            payload["temperature"] = temperature
+        if top_p is not None:
+            payload["top_p"] = top_p
+        if max_tokens is not None:
+            payload["max_tokens"] = max_tokens
+        if stream:
+            payload["stream"] = True
+        # Add any extra kwargs
+        payload.update({k: v for k, v in kwargs.items() if k not in ['stream']})
+        response = self.client._get_session().post(url, json=payload)
+        if response.status_code != 200:
+            raise Exception(f"API request failed: {response.text}")
+        return response.json()
 # ==================== Main Entry Point ====================
 if __name__ == "__main__":

examples/curl_examples.sh CHANGED Viewed

@@ -1,8 +1,8 @@
 #!/bin/bash
-# OpenELM Anthropic API - Curl Examples
 #
-# This script demonstrates how to call the OpenELM Anthropic API
-# using curl commands directly.
 #
 # Usage:
 #     chmod +x examples/curl_examples.sh
@@ -13,7 +13,7 @@ API_URL="${OPENELM_API_URL:-http://localhost:8000}"
 API_URL="${API_URL%/}"  # Remove trailing slash
 echo "=============================================="
-echo "OpenELM Anthropic API - Curl Examples"
 echo "=============================================="
 echo "API URL: $API_URL"
 echo ""
@@ -24,16 +24,25 @@ echo "------------------------"
 curl -s "$API_URL/health" | python3 -m json.tool
 echo ""
-# Example 2: List Available Models
-echo "Example 2: List Available Models"
-echo "---------------------------------"
 curl -s "$API_URL/v1/models" | python3 -m json.tool
 echo ""
-# Example 3: Basic Message Generation
-echo "Example 3: Basic Message Generation"
-echo "------------------------------------"
-curl -s -X POST "$API_URL/v1/messages" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "openelm-450m-instruct",
@@ -48,10 +57,10 @@ curl -s -X POST "$API_URL/v1/messages" \
   }' | python3 -m json.tool
 echo ""
-# Example 4: Multi-turn Conversation
-echo "Example 4: Multi-turn Conversation"
-echo "-----------------------------------"
-curl -s -X POST "$API_URL/v1/messages" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "openelm-450m-instruct",
@@ -62,7 +71,7 @@ curl -s -X POST "$API_URL/v1/messages" \
       },
       {
         "role": "assistant",
-        "content": "Python is a high-level programming language known for its simplicity and readability."
       },
       {
         "role": "user",
@@ -74,36 +83,39 @@ curl -s -X POST "$API_URL/v1/messages" \
   }' | python3 -m json.tool
 echo ""
-# Example 5: Using System Prompt
-echo "Example 5: Using System Prompt"
-echo "-------------------------------"
-curl -s -X POST "$API_URL/v1/messages" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "openelm-450m-instruct",
     "messages": [
       {
         "role": "user",
-        "content": "Explain the concept simply."
       }
     ],
-    "system": "You are a helpful tutor who explains things simply.",
     "max_tokens": 200,
     "temperature": 0.8
   }' | python3 -m json.tool
 echo ""
-# Example 6: Deterministic Generation (temperature=0)
-echo "Example 6: Deterministic Generation"
-echo "------------------------------------"
-curl -s -X POST "$API_URL/v1/messages" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "openelm-450m-instruct",
     "messages": [
       {
         "role": "user",
-        "content": "What is the capital of France?"
       }
     ],
     "max_tokens": 50,
@@ -111,6 +123,106 @@ curl -s -X POST "$API_URL/v1/messages" \
   }' | python3 -m json.tool
 echo ""
 echo "=============================================="
 echo "All curl examples completed!"
 echo "=============================================="

 #!/bin/bash
+# OpenELM OpenAI & Anthropic API - Curl Examples
 #
+# This script demonstrates how to call the OpenELM API using both
+# OpenAI and Anthropic compatible endpoints with curl commands.
 #
 # Usage:
 #     chmod +x examples/curl_examples.sh
 API_URL="${API_URL%/}"  # Remove trailing slash
 echo "=============================================="
+echo "OpenELM OpenAI & Anthropic API - Curl Examples"
 echo "=============================================="
 echo "API URL: $API_URL"
 echo ""
 curl -s "$API_URL/health" | python3 -m json.tool
 echo ""
+# ============================================
+# OpenAI API Examples
+# ============================================
+echo "##########################################"
+echo "# OpenAI API Examples                    #"
+echo "##########################################"
+echo ""
+# Example 2: OpenAI - List Available Models
+echo "Example 2: OpenAI - List Available Models"
+echo "-------------------------------------------"
 curl -s "$API_URL/v1/models" | python3 -m json.tool
 echo ""
+# Example 3: OpenAI - Basic Chat Completion
+echo "Example 3: OpenAI - Basic Chat Completion"
+echo "--------------------------------------------"
+curl -s -X POST "$API_URL/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "openelm-450m-instruct",
   }' | python3 -m json.tool
 echo ""
+# Example 4: OpenAI - Multi-turn Conversation
+echo "Example 4: OpenAI - Multi-turn Conversation"
+echo "--------------------------------------------"
+curl -s -X POST "$API_URL/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "openelm-450m-instruct",
       },
       {
         "role": "assistant",
+        "content": "Python is a high-level programming language."
       },
       {
         "role": "user",
   }' | python3 -m json.tool
 echo ""
+# Example 5: OpenAI - Using System Message
+echo "Example 5: OpenAI - Using System Message"
+echo "------------------------------------------"
+curl -s -X POST "$API_URL/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "openelm-450m-instruct",
     "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful coding assistant."
+      },
       {
         "role": "user",
+        "content": "What is a decorator?"
       }
     ],
     "max_tokens": 200,
     "temperature": 0.8
   }' | python3 -m json.tool
 echo ""
+# Example 6: OpenAI - Deterministic Generation
+echo "Example 6: OpenAI - Deterministic Generation"
+echo "----------------------------------------------"
+curl -s -X POST "$API_URL/v1/chat/completions" \
   -H "Content-Type: application/json" \
   -d '{
     "model": "openelm-450m-instruct",
     "messages": [
       {
         "role": "user",
+        "content": "What is 2 + 2?"
       }
     ],
     "max_tokens": 50,
   }' | python3 -m json.tool
 echo ""
+# Example 7: OpenAI - Streaming Response
+echo "Example 7: OpenAI - Streaming Response"
+echo "----------------------------------------"
+echo "Streaming output:"
+curl -s -X POST "$API_URL/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -H "Accept: text/event-stream" \
+  -d '{
+    "model": "openelm-450m-instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Count to 3, one per line."
+      }
+    ],
+    "max_tokens": 100,
+    "temperature": 0.7,
+    "stream": true
+  }' | head -20
+echo ""
+echo ""
+# ============================================
+# Anthropic API Examples
+# ============================================
+echo "##########################################"
+echo "# Anthropic API Examples                 #"
+echo "##########################################"
+echo ""
+# Example 8: Anthropic - List Available Models
+echo "Example 8: Anthropic - List Available Models"
+echo "----------------------------------------------"
+curl -s "$API_URL/v1/models" | python3 -m json.tool
+echo ""
+# Example 9: Anthropic - Basic Message Generation
+echo "Example 9: Anthropic - Basic Message Generation"
+echo "-------------------------------------------------"
+curl -s -X POST "$API_URL/v1/messages" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openelm-450m-instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Say hello in a friendly way!"
+      }
+    ],
+    "max_tokens": 100,
+    "temperature": 0.7
+  }' | python3 -m json.tool
+echo ""
+# Example 10: Anthropic - Multi-turn Conversation
+echo "Example 10: Anthropic - Multi-turn Conversation"
+echo "-------------------------------------------------"
+curl -s -X POST "$API_URL/v1/messages" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openelm-450m-instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": "What is AI?"
+      },
+      {
+        "role": "assistant",
+        "content": "AI stands for Artificial Intelligence."
+      },
+      {
+        "role": "user",
+        "content": "Tell me more."
+      }
+    ],
+    "max_tokens": 150,
+    "temperature": 0.5
+  }' | python3 -m json.tool
+echo ""
+# Example 11: Anthropic - Using System Prompt
+echo "Example 11: Anthropic - Using System Prompt"
+echo "----------------------------------------------"
+curl -s -X POST "$API_URL/v1/messages" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openelm-450m-instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Explain quantum computing."
+      }
+    ],
+    "system": "You are a science educator who explains complex topics simply.",
+    "max_tokens": 200,
+    "temperature": 0.8
+  }' | python3 -m json.tool
+echo ""
 echo "=============================================="
 echo "All curl examples completed!"
 echo "=============================================="

examples/openai_sdk_example.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""
+Example: Using OpenAI SDK with OpenELM API
+This example demonstrates how to use the OpenAI SDK (or compatible client)
+to call OpenELM models through our OpenAI API compatible wrapper.
+Note: The official openai Python package requires the API server to have
+proper authentication. For testing, use the included OpenAIClient helper.
+Usage:
+    python examples/openai_sdk_example.py
+"""
+import sys
+import os
+# Add parent directory to path for imports
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from app import OpenAIClient
+def main():
+    """Example usage of the OpenAI-compatible OpenELM API."""
+    # Create client pointing to our local API
+    base_url = os.environ.get("OPENELM_API_URL", "http://localhost:8000")
+    client = OpenAIClient(base_url=base_url, api_key="dummy-key")
+    print("=" * 60)
+    print("OpenELM OpenAI API - Usage Example")
+    print("=" * 60)
+    print(f"API URL: {base_url}")
+    print()
+    # Example 1: Basic chat completion
+    print("Example 1: Basic Chat Completion")
+    print("-" * 40)
+    response = client.chat.completions.create(
+        model="openelm-450m-instruct",
+        messages=[
+            {"role": "user", "content": "Say hello in a friendly way!"}
+        ],
+        max_tokens=100,
+        temperature=0.7
+    )
+    print(f"Response ID: {response['id']}")
+    print(f"Model: {response['model']}")
+    print(f"Content: {response['choices'][0]['message']['content']}")
+    print(f"Usage: {response['usage']}")
+    print()
+    # Example 2: Multi-turn conversation
+    print("Example 2: Multi-turn Conversation")
+    print("-" * 40)
+    response = client.chat.completions.create(
+        model="openelm-450m-instruct",
+        messages=[
+            {"role": "user", "content": "What is artificial intelligence?"},
+            {"role": "assistant", "content": "Artificial intelligence (AI) refers to systems that can perform tasks that typically require human intelligence."},
+            {"role": "user", "content": "What are some examples?"}
+        ],
+        max_tokens=150,
+        temperature=0.5
+    )
+    print(f"Content: {response['choices'][0]['message']['content']}")
+    print(f"Usage: {response['usage']}")
+    print()
+    # Example 3: Using system message
+    print("Example 3: Using System Message")
+    print("-" * 40)
+    response = client.chat.completions.create(
+        model="openelm-450m-instruct",
+        messages=[
+            {"role": "system", "content": "You are a helpful coding assistant."},
+            {"role": "user", "content": "What is a Python decorator?"}
+        ],
+        max_tokens=200,
+        temperature=0.8
+    )
+    print(f"Content: {response['choices'][0]['message']['content']}")
+    print(f"Usage: {response['usage']}")
+    print()
+    # Example 4: Deterministic generation (temperature=0)
+    print("Example 4: Deterministic Generation (temperature=0)")
+    print("-" * 40)
+    response = client.chat.completions.create(
+        model="openelm-450m-instruct",
+        messages=[
+            {"role": "user", "content": "What is 2 + 2?"}
+        ],
+        max_tokens=50,
+        temperature=0.0  # Deterministic output
+    )
+    print(f"Content: {response['choices'][0]['message']['content']}")
+    print(f"Usage: {response['usage']}")
+    print()
+    # Example 5: Streaming response
+    print("Example 5: Streaming Response")
+    print("-" * 40)
+    print("Streaming response:")
+    response = client.chat.completions.create(
+        model="openelm-450m-instruct",
+        messages=[
+            {"role": "user", "content": "Count to 5, one number per line."}
+        ],
+        max_tokens=100,
+        temperature=0.7,
+        stream=True
+    )
+    # For streaming, response is a generator
+    chunk_count = 0
+    for chunk in response:
+        if 'choices' in chunk and chunk['choices']:
+            delta = chunk['choices'][0].get('delta', {})
+            if 'content' in delta:
+                content = delta['content']
+                if content:
+                    print(content, end="", flush=True)
+                    chunk_count += 1
+        elif 'error' in chunk:
+            print(f"Error: {chunk['error']}")
+            break
+    print("\n")
+    print(f"Received {chunk_count} chunks")
+    print()
+    print("=" * 60)
+    print("All examples completed successfully!")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()