experiments#
Run experiments to test different models, prompts, parameters for your LLM apps. Read our quickstart guide for more information.
functions#
These are used to run experiments and evaluations. To import functions, use the following:
from arize.experimental.datasets.experiments.functions import ...
- evaluate_experiment(examples, experiment_results, evaluators, *, rate_limit_errors=None, concurrency=3, tracer=None, resource=None, exit_on_error=False)#
Evaluate the results of an experiment using the provided evaluators.
- Parameters:
examples (Sequence[Example]) – The examples to evaluate.
experiment_results (Sequence[ExperimentRun]) – The results of the experiment.
evaluators (Evaluators) – The evaluators to use for assessment.
rate_limit_errors (Optional[RateLimitErrors]) – Optional rate limit errors.
concurrency (int) – The number of concurrent tasks to run. Default is 3.
tracer (Optional[Tracer]) – Optional tracer for tracing the evaluation.
resource (Optional[Resource]) – Optional resource for the evaluation.
exit_on_error (bool) – Whether to exit on error. Default is False.
- Returns:
The evaluation results.
- Return type:
List[ExperimentEvaluationRun]
- run_experiment(dataset, task, evaluators=None, *, experiment_name=None, tracer=None, resource=None, rate_limit_errors=None, concurrency=3, exit_on_error=False)#
Run an experiment on a dataset.
- Parameters:
dataset (pd.DataFrame) – The dataset to run the experiment on.
task (ExperimentTask) – The task to be executed on the dataset.
evaluators (Optional[Evaluators]) – Optional evaluators to assess the task.
experiment_name (Optional[str]) – Optional name for the experiment.
tracer (Optional[Tracer]) – Optional tracer for tracing the experiment.
resource (Optional[Resource]) – Optional resource for the experiment.
rate_limit_errors (Optional[RateLimitErrors]) – Optional rate limit errors.
concurrency (int) – The number of concurrent tasks to run. Default is 3.
exit_on_error (bool) – Whether to exit on error. Default is False.
- Returns:
The results of the experiment.
- Return type:
pd.DataFrame
evaluators#
These are used to create evaluators as a class. See our docs for more information.
To import evaluators, use the following:
from arize.experimental.datasets.experiments.evaluators.base import ...
- class Evaluator(*args, **kwargs)#
Bases:
ABC
A helper super class to guide the implementation of an Evaluator object. Subclasses must implement either the evaluate or async_evaluate method. Implementing both methods is recommended, but not required.
This Class is intended to be subclassed, and should not be instantiated directly.
- async async_evaluate(*, output=None, expected=None, dataset_row=None, metadata=MappingProxyType({}), input=MappingProxyType({}), **kwargs)#
Asynchronously evaluate the given inputs and produce an evaluation result.
This method should be implemented by subclasses to perform the actual evaluation logic. It is recommended to implement both this asynchronous method and the synchronous evaluate method, but it is not required.
- Parameters:
output (Optional[TaskOutput]) – The output produced by the task.
expected (Optional[ExampleOutput]) – The expected output for comparison.
dataset_row (Optional[Mapping[str, JSONSerializable]]) – A row from the dataset.
metadata (ExampleMetadata) – Metadata associated with the example.
input (ExampleInput) – The input provided for evaluation.
**kwargs (Any) – Additional keyword arguments.
- Returns:
The result of the evaluation.
- Return type:
- Raises:
NotImplementedError – If the method is not implemented by the subclass.
- evaluate(*, output=None, expected=None, dataset_row=None, metadata=MappingProxyType({}), input=MappingProxyType({}), **kwargs)#
Evaluate the given inputs and produce an evaluation result.
This method should be implemented by subclasses to perform the actual evaluation logic. It is recommended to implement both this synchronous method and the asynchronous async_evaluate method, but it is not required.
- Parameters:
output (Optional[TaskOutput]) – The output produced by the task.
expected (Optional[ExampleOutput]) – The expected output for comparison.
dataset_row (Optional[Mapping[str, JSONSerializable]]) – A row from the dataset.
metadata (ExampleMetadata) – Metadata associated with the example.
input (ExampleInput) – The input provided for evaluation.
**kwargs (Any) – Additional keyword arguments.
- Raises:
NotImplementedError – If the method is not implemented by the subclass.
types#
These are the classes used across the experiment functions.
To import types, use the following:
from arize.experimental.datasets.experiments.types import ...
- class Example(id=<factory>, updated_at=<factory>, input=<factory>, output=<factory>, metadata=<factory>, dataset_row=<factory>)#
Bases:
object
Represents an example in an experiment dataset.
- Variables:
id (ExampleId) – The unique identifier for the example.
updated_at (datetime) – The timestamp when the example was last updated.
input (Mapping[str, JSONSerializable]) – The input data for the example.
output (Mapping[str, JSONSerializable]) – The output data for the example.
metadata (Mapping[str, JSONSerializable]) – Additional metadata for the example.
dataset_row – The original dataset row containing the example data.
- class EvaluationResult(score=None, label=None, explanation=None, metadata=<factory>)#
Bases:
object
Represents the result of an evaluation.
- Variables:
score (Optional[float]) – The score of the evaluation.
label (Optional[str]) – The label of the evaluation.
explanation (Optional[str]) – The explanation of the evaluation.
metadata (Mapping[str, JSONSerializable]) – Additional metadata for the evaluation.
- class ExperimentRun(start_time, end_time, experiment_id, dataset_example_id, repetition_number, output, error=None, id=<factory>, trace_id=None)#
Bases:
object
Represents a single run of an experiment.
- Variables:
start_time (datetime) – The start time of the experiment run.
end_time (datetime) – The end time of the experiment run.
experiment_id (str) – The unique identifier for the experiment.
dataset_example_id (str) – The unique identifier for the dataset example.
repetition_number (int) – The repetition number of the experiment run.
output (JSONSerializable) – The output of the experiment run.
error (Optional[str]) – The error message if the experiment run failed.
id (str) – The unique identifier for the experiment run.
trace_id (Optional[str]) – The trace identifier for the experiment run.
- class ExperimentEvaluationRun(experiment_run_id, start_time, end_time, name, annotator_kind, error=None, result=None, id=<factory>, trace_id=None)#
Bases:
object
Represents a single evaluation run of an experiment.
- Variables:
experiment_run_id (ExperimentRunId) – The unique identifier for the experiment run.
start_time (datetime) – The start time of the evaluation run.
end_time (datetime) – The end time of the evaluation run.
name (str) – The name of the evaluation run.
annotator_kind (str) – The kind of annotator used in the evaluation run.
error – The error message if the evaluation run failed. result (Optional[EvaluationResult]): The result of the evaluation run. id (str): The unique identifier for the evaluation run. trace_id (Optional[TraceId]): The trace identifier for the evaluation run.