Skip to content

Commit

Permalink
docs: update loader part (#1160)
Browse files Browse the repository at this point in the history
Co-authored-by: Wendong <[email protected]>
Co-authored-by: Wendong-Fan <[email protected]>
  • Loading branch information
3 people authored Nov 8, 2024
1 parent 7468dde commit 62f0c90
Showing 1 changed file with 243 additions and 1 deletion.
244 changes: 243 additions & 1 deletion docs/key_modules/loaders.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## 1. Concept
CAMEL introduced two IO modules, `Base IO` and `Unstructured IO` which are designed for handling various file types and unstructured data processing.

Additionally, four new data readers were added, `Apify Reader`,`Chunkr Reader`, `Firecrawl Reader`, and `Jina_url Reader`, which enable retrieval of external data for improved data integration and analysis.

## 2. Types

Expand All @@ -12,6 +12,19 @@ Base IO module is focused on fundamental input/output operations related to file
### 2.2. Unstructured IO
Unstructured IO module deals with the handling, parsing, and processing of unstructured data. It provides tools for parsing files or URLs, cleaning data, extracting specific information, staging elements for different platforms, and chunking elements. The core of this module lies in its advanced ETL capabilities to manipulate unstructured data to make it usable for various applications like Retrieval-Augmented Generation(RAG).

### 2.3. Apify Reader
Apify Reader provides a Python interface to interact with the Apify platform for automating web workflows. It allows users to authenticate via an API key and offers methods to execute and manage actors (automated web tasks) and datasets on the platform.
It includes functionalities for client initialization, actor management, dataset operation.

### 2.4. Chunkr Reader
Chunkr Reader is a Python client for interacting with the Chunkr API, which processes documents and returns content in various formats. It includes functionalities for client intialization, task management, formating response.

### 2.5. Firecrawl Reader
Firecrawl Reader provides a Python interface to interact with the Firecrawl API, allowing users to turn websites into large language model (LLM)-ready markdown format.

### 2.6. Jina_url Reader
JinaURL Reader is a Python client for Jina AI's URL reading service, optimized to provide cleaner, LLM-friendly content from URLs.

## 3. Get Started

### 3.1. Using `Base IO`
Expand Down Expand Up @@ -121,3 +134,232 @@ print(staged_element)
>>> {'rows': [{'data': {'type': 'UncategorizedText', 'element_id': 'e78902d05b0cb1e4c38fc7a79db450d5', 'text': 'CNN\n \xa0—'}, 'metadata': {'filetype': 'text/html', 'languages': ['eng'], 'page_number': 1, 'url': 'https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html', 'emphasized_text_contents': ['CNN'], 'emphasized_text_tags': ['span']}}, ...
```
This is a basic guide to get you started with the `Unstructured IO` module. For more advanced usage, refer to the specific method documentation and the [Unstructured IO Documentation](https://unstructured-io.github.io/unstructured/).

### 3.3. Using `Apify Reader`

Initialize the client, set up the required actors and parameters.
```python
apify = Apify()

run_input = {
"startUrls": [{"url": "https://www.camel-ai.org/"}],
"maxCrawlDepth": 0,
"maxCrawlPages": 1,
}
actor_result = apify.run_actor(
actor_id="apify/website-content-crawler", run_input=run_input
)
```

Retrieve the result database ID and access it using the get_dataset_items method.
```python
dataset_result = apify.get_dataset_items(
dataset_id=actor_result["defaultDatasetId"]
)
```
```markdown
>>>[{'url': 'https://www.camel-ai.org/', 'crawl': {'loadedUrl': 'https://www.camel
-ai.org/', 'loadedTime': '2024-10-27T04:51:16.651Z', 'referrerUrl': 'https://ww
w.camel-ai.org/', 'depth': 0, 'httpStatusCode': 200}, 'metadata': {'canonicalUr
l': 'https://www.camel-ai.org/', 'title': 'CAMEL-AI', 'description': 'CAMEL-AI.
org is the 1st LLM multi-agent framework and an open-source community dedicated
to finding the scaling law of agents.', 'author': None, 'keywords': None, 'lan
guageCode': 'en', 'openGraph': [{'property': 'og:title', 'content': 'CAMEL-AI'
}, {'property': 'og:description', 'content': 'CAMEL-AI.org is the 1st LLM mult
i-agent framework and an open-source community dedicated to finding the scalin
g law of agents.'}, {'property': 'twitter:title', 'content': 'CAMEL-AI'}, {'pr
operty': 'twitter:description', 'content': 'CAMEL-AI.org is the 1st LLM multi-
agent framework and an open-source community dedicated to finding the scaling
law of agents.'}, {'property': 'og:type', 'content': 'website'}], 'jsonLd': No
ne, 'headers': {'date': 'Sun, 27 Oct 2024 04:50:18 GMT', 'content-type': 'text
/html', 'cf-ray': '8d901082dae7efbe-PDX', 'cf-cache-status': 'HIT', 'age': '10
81', 'content-encoding': 'gzip', 'last-modified': 'Sat, 26 Oct 2024 11:51:32 G
MT', 'strict-transport-security': 'max-age=31536000', 'surrogate-control': 'ma
x-age=432000', 'surrogate-key': 'www.camel-ai.org 6659a154491a54a40551bc78 pag
eId:6686a2bcb7ece5fb40457491 668181a0a818ade34e653b24 6659a155491a54a40551bd7e
', 'x-lambda-id': 'd6c4424b-ac67-4c54-b52a-cb2a22ca09f0', 'vary': 'Accept-Enco
ding', 'set-cookie': '__cf_bm=oX5EmZ2SNJDOBQXI8dL_reCYlCpp1FMzu40qCNxiopU-1730
004618-1.0.1.1-3teEeqUoemzHWAeGCtlPJVqdmAbiFkyu3JxopKfQFFndSCi_Z56RR.UDjLGZiq.
L_4LvAZYmNKxQ.k6VRhbA7g; path=/; expires=Sun, 27-Oct-24 05:20:18 GMT; domain=.
cdn.webflow.com; HttpOnly; Secure; SameSite=None\n_cfuvid=om_8lj9jNMIh.HEIxEAq
gszhEWaKlyS2kdXKwqGedSM-1730004618924-0.0.1.1-604800000; path=/; domain=.cdn.w
ebflow.com; HttpOnly; Secure; SameSite=None', 'alt-svc': 'h3=":443"; ma=86400'
, 'x-cluster-name': 'us-west-2-prod-hosting-red', 'x-firefox-spdy': 'h2'}}, 's
creenshotUrl': None, 'text': 'Build Multi-Agent Systems for _\nFEATURES & Inte
grations\nSeamless integrations with\npopular platforms \nScroll to explore ou
r features & integrations.', 'markdown': '# Build Multi-Agent Systems for \\_
\n\nFEATURES & Integrations\n\n## Seamless integrations with \npopular platfo
rms\n\nScroll to explore our features & integrations.'}]
```

### 3.4. Using `Firecrawl Reader`

Initialize the client and set the URL from which we want to retrieve information. When the status is "completed," the information retrieval is finished and ready for reading.
```python
firecrawl = Firecrawl()

response = firecrawl.crawl(url="https://www.camel-ai.org/about")
print(response["status"])
```
```markdown
>>>completed
```

Directly retrieve information from the returned results.
```python
print(response["data"][0]["markdown"])
```
```markdown
>>>Camel-AI Team

We are finding the
scaling law of agent

🐫 CAMEL is an open-source library designed for the study of autonomous and
communicative agents. We believe that studying these agents on a large scale
offers valuable insights into their behaviors, capabilities, and potential
risks. To facilitate research in this field, we implement and support various
types of agents, tasks, prompts, models, and simulated environments.

**We are** always looking for more **contributors** and **collaborators**.
Contact us to join forces via [Slack](https://join.slack.com/t/camel-kwr1314/
shared_invite/zt-1vy8u9lbo-ZQmhIAyWSEfSwLCl2r2eKA)
or [Discord](https://discord.gg/CNcNpquyDc)...
```

### 3.5. Using `Chunkr Reader`

Initialize the client and set the local PDF file path, then use the generated task id to featch the output.

```python
from camel.loaders import ChunkrReader

chunkr = ChunkrReader()

task_id = chunkr.submit_task(
file_path="/Users/enrei/Large Language Model based Multi-Agents- A Survey of Progress and Challenges.pdf",
model="Fast",
ocr_strategy="Auto",
target_chunk_length=512,
)

print(task_id)
```
```markdown
>>>"7becf001-6f07-4f63-bddf-5633df363bbb"
```

Featch the output.
```python
task_output = chunkr.get_task_output(
task_id="7becf001-6f07-4f63-bddf-5633df363bbb"
)

print(task_output)
```
```markdown
>>>{
"task_id": "7becf001-6f07-4f63-bddf-5633df363bbb",
"status": "Succeeded",
"created_at": "2024-11-08T12:45:04.260765Z",
"finished_at": "2024-11-08T12:45:48.942365Z",
"expires_at": null,
"message": "Task succeeded",
"output": {
"chunks": [
{
"segments": [
{
"segment_id": "d53ec931-3779-41be-a220-3fe4da2770c5",
"bbox": {
"left": 224.16666,
"top": 370.0,
"width": 2101.6665,
"height": 64.166664
},
"page_number": 1,
"page_width": 2550.0,
"page_height": 3300.0,
"content": "Large Language Model based Multi-Agents: A Survey of Progress and Challenges",
"segment_type": "Title",
"ocr": null,
"image": "https://chunkmydocs-bucket-prod.storage.googleapis.com/e68161bd-5d39-4cbf-afb2-5c6640a05522/7becf001-6f07-4f63-bddf-5633df363bbb/images/d53ec931-3779-41be-a220-3fe4da2770c5.jpg?x-id=GetObject&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=GOOG1E67ULNM7PPHKQDVSRZD64OWC4CJTKOHXCOIDKI5QCMJK4U6ROEJQSOJM%2F20241108%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20241108T125023Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=7d2854d7d87ef0b152ac342dfdc0caa7c2ac10418e9445b73c584a9b34736fbf",
"html": "<h1>Large Language Model based Multi-Agents: A Survey of Progress and Challenges</h1>",
"markdown": "# Large Language Model based Multi-Agents: A Survey of Progress and Challenges\n\n"
}
],
"chunk_length": 11
},
{
"segments": [
{
"segment_id": "7bb38fc7-c1b3-4153-a3cc-116c0b9caa0a",
"bbox": {
"left": 432.49997,
"top": 474.16666,
"width": 1659.9999,
"height": 122.49999
},
"page_number": 1,
"page_width": 2550.0,
"page_height": 3300.0,
"content": "Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020",
"segment_type": "Text",
"ocr": null,
"image": "https://chunkmydocs-bucket-prod.storage.googleapis.com/e68161bd-5d39-4cbf-afb2-5c6640a05522/7becf001-6f07-4f63-bddf-5633df363bbb/images/7bb38fc7-c1b3-4153-a3cc-116c0b9caa0a.jpg?x-id=GetObject&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=GOOG1E67ULNM7PPHKQDVSRZD64OWC4CJTKOHXCOIDKI5QCMJK4U6ROEJQSOJM%2F20241108%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20241108T125023Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=14b9a5048ff052a3641d100ec60da1b1a26e512e9863d09619bfe8237a2bdf80",
"html": "<p>Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020</p>",
"markdown": "Taicheng Guo 1 , Xiuying Chen 2 , Yaqi Wang 3 \u2217 , Ruidi Chang , Shichao Pei 4 , Nitesh V. Chawla 1 , Olaf Wiest 1 , Xiangliang Zhang 1 \u2020\n\n"
},
{
"segment_id": "35c9e3ab-399c-43af-8f9e-60f5d9263c80",
"bbox": {
"left": 440.8333,
"top": 599.1666,
"width": 1668.3333,
"height": 64.166664
},
"page_number": 1,...
}]}]}}
```

### 3.6. Using `Jina Reader`

Initialize the client and set the URL from which we want to retrieve information, then print the response.

```python
from camel.loaders import JinaURLReader
from camel.types.enums import JinaReturnFormat

jina_reader = JinaURLReader(return_format=JinaReturnFormat.MARKDOWN)
response = jina_reader.read_content("https://docs.camel-ai.org/")

print(response)
```
```markdown
>>>Welcome to CAMEL’s documentation! — CAMEL 0.2.6 documentation
===============

[Skip to main content](https://docs.camel-ai.org/#main-content)

Back to top Ctrl+K

[![Image 1](https://raw.githubusercontent.com/camel-ai/camel/master/misc/logo_light.png) ![Image 2](https://raw.githubusercontent.com/camel-ai/camel/master/misc/logo_light.png)CAMEL 0.2.6](https://docs.camel-ai.org/#)

Search Ctrl+K

Get Started

* [Installation](https://docs.camel-ai.org/get_started/installation.html)
* [API Setup](https://docs.camel-ai.org/get_started/setup.html)

Agents

* [Creating Your First Agent](https://docs.camel-ai.org/cookbooks/create_your_first_agent.html)
* [Creating Your First Agent Society](https://docs.camel-ai.org/cookbooks/create_your_first_agents_society.html)
* [Embodied Agents](https://docs.camel-ai.org/cookbooks/embodied_agents.html)
* [Critic Agents and Tree Search](https://docs.camel-ai.org/cookbooks/critic_agents_and_tree_search.html)

Key Modules

* [Models](https://docs.camel-ai...)
```

0 comments on commit 62f0c90

Please sign in to comment.