Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[minor]: Spider Document Loader #5415

Merged
merged 21 commits into from
May 20, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
format
WilliamEspegren committed Apr 29, 2024
commit c189f0904a79becf94a043ba110ea4faba9fbdbd
20 changes: 10 additions & 10 deletions examples/src/document_loaders/spider.ts
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
import { SpiderLoader } from "langchain/document_loaders/web/spider";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! 👋 I've reviewed the code and flagged a change in the PR for your attention. The addition of apiKey: process.env.SPIDER_API_KEY accesses an environment variable, so please take a look when you get a chance. Let me know if you need further assistance!


const loader = new SpiderLoader({
url: "https://spider.cloud", // The URL to scrape
apiKey: process.env.SPIDER_API_KEY, // Optional, defaults to `SPIDER_API_KEY` in your env.
mode: "scrape", // The mode to run the crawler in. Can be "scrape" for single urls or "crawl" for deeper scraping following subpages
params: {
// optional parameters based on Spider API docs
// For API documentation, visit https://spider.cloud/docs/api
},
});
const docs = await loader.load();
url: "https://spider.cloud", // The URL to scrape
apiKey: process.env.SPIDER_API_KEY, // Optional, defaults to `SPIDER_API_KEY` in your env.
mode: "scrape", // The mode to run the crawler in. Can be "scrape" for single urls or "crawl" for deeper scraping following subpages
params: {
// optional parameters based on Spider API docs
// For API documentation, visit https://spider.cloud/docs/api
},
});

const docs = await loader.load();
2 changes: 1 addition & 1 deletion langchain/package.json
Original file line number Diff line number Diff line change
@@ -1241,7 +1241,7 @@
"@mendable/firecrawl-js": "^0.0.13",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! 👋 I noticed that a new package "@spider-cloud/spider-client" has been added to the dependencies in the package.json file. This change is flagged for your review to ensure it aligns with the project's dependency management strategy. Keep up the great work! 🚀

"@notionhq/client": "^2.2.10",
"@pinecone-database/pinecone": "^1.1.0",
"@spider-cloud/spider-client": "^0.0.10",
"@spider-cloud/spider-client": "^0.0.11",
"@supabase/supabase-js": "^2.10.0",
"@swc/core": "^1.3.90",
"@swc/jest": "^0.2.29",
2 changes: 1 addition & 1 deletion langchain/src/document_loaders/tests/spider.int.test.ts
Original file line number Diff line number Diff line change
@@ -31,4 +31,4 @@ test("Test SpiderLoader load method with crawl mode", async () => {
expect(document).toBeInstanceOf(Document);
expect(document.pageContent).toBeTruthy();
expect(document.metadata).toBeTruthy();
}, 15000);
}, 15000);
4 changes: 2 additions & 2 deletions langchain/src/document_loaders/web/spider.ts
Original file line number Diff line number Diff line change
@@ -24,7 +24,7 @@ interface SpiderLoaderParameters {
mode?: "crawl" | "scrape";
params?: Record<string, unknown>;
}
interface SpiderDocument {
interface SpiderDocument {
markdown: string;
metadata: Record<string, unknown>;
}
@@ -105,4 +105,4 @@ export class SpiderLoader extends BaseDocumentLoader {
})
);
}
}
}