Design Documentation
Dataset Design
FastGPT dataset file and data design
Relationship Between Files and Data
In FastGPT, files are stored using MongoDB's GridFS, while the actual data is stored in PostgreSQL. Each row in PG has a file_id column that references the corresponding file. For backward compatibility and to support manual input and annotated data, file_id has some special values:
- manual: Manually entered data
- mark: Manually annotated data
Note: file_id is only written at data insertion time and cannot be modified afterward.
File Import Process
- Upload the file to MongoDB GridFS and obtain a
file_id. The file is marked asunusedat this point. - The browser parses the file to extract text and chunks.
- Each chunk is tagged with the
file_id. - Click upload: the file status changes to
used, and the data is pushed to the mongotrainingcollection to await processing. - The training thread pulls data from mongo, generates vectors, and inserts them into PG.
Edit on GitHub
File Updated