FlagOpen · 545999961 · Nov 16, 2024 · Nov 16, 2024
diff --git a/FlagEmbedding/evaluation/mteb/arguments.py b/FlagEmbedding/evaluation/mteb/arguments.py
@@ -18,6 +18,6 @@ class MTEBEvalArgs(AbsEvalArgs):
     use_special_instructions: bool = field(
         default=False, metadata={"help": "Whether to use specific instructions in `prompts.py` for evaluation. Default: False"}
     )
-    use_special_examples: bool = field(
-        default=False, metadata={"help": "Whether to use specific examples in `examples` for evaluation. Default: False"}
-    )
+    examples_path: str = field(
+        default=None, metadata={"help": "Use specific examples in the path. Default: None"}
+    )
diff --git a/FlagEmbedding/evaluation/mteb/examples.py b/FlagEmbedding/evaluation/mteb/examples.py
diff --git a/FlagEmbedding/evaluation/mteb/runner.py b/FlagEmbedding/evaluation/mteb/runner.py
@@ -134,10 +134,9 @@ def run(self):
                 except:
                     logger.logger.info(f"No instruction found for {task_name}")
 
-            if self.eval_args.use_special_examples:
+            if self.eval_args.examples_path is not None:
                 try:
-                    eg_pairs = examples_dict[task_name]
-                    self.retriever.set_examples(eg_pairs)
+                    eg_pairs = json.load(open(os.path.join(self.eval_args.examples_path, task_name + '.json')))
                 except:
                     logger.logger.info(f"No examples found for {task_name}")
 

diff --git a/examples/evaluation/README.md b/examples/evaluation/README.md
@@ -115,7 +115,7 @@ For MTEB, we primarily use the official [MTEB](https://github.com/embeddings-ben
 - **`tasks`**: Tasks to evaluate. Default: None
 - **`task_types`**: The task types to evaluate. Default: None
 - **`use_special_instructions`**: Whether to use specific instructions in `prompts.py` for evaluation. Default: False
-- **`use_special_examples`**: Whether to use specific examples in `examples.py` for evaluation. Default: False
+- **`examples_path`**: Use specific examples in the path. Default: None
 
 Here is an example for evaluation:
 

diff --git a/research/llm_dense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/arxiv-gemini.jsonl b/research/llm_dense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/arxiv-gemini.jsonl
@@ -0,0 +1,3 @@
+{"query": "So, which AI model did the best on the MMMU benchmark according to Yue and his team back in 2023?", "pos": "MMMU (val) Gemini Ultra (0-shot) GPT-4V (0-shot)\nMaj@32 pass@1 pass@1\nArt & Design 74.2 70.0 65.8\nBusiness 62.7 56.7 59.3\nScience 49.3 48.0 54.7\nHealth & Medicine 71.3 67.3 64.7\nHumanities & Social Science 78.3 78.3 72.5\nTechnology & Engineering 53.0 47.1 36.7\nOverall 62.4 59.4 56.8\nTable 8|Gemini Ultra performance on the MMMU benchmark (Yue et al., 2023) per discipline."}
+{"query": "The GSPMD partitioner, part of the XLA compiler, is responsible for dividing the training step calculation.", "pos": "The GSPMD partitioner (Xu et al., 2021) in the XLA compiler\npartitions the training step computation, and the MegaScale XLA compiler (XLA, 2019) pass statically\nschedules appropriate collectives so that they maximally overlap with the computation with very little\nvariation in step time.\nMaintaining a high goodput2at this scale would have been impossible using the conventional\napproach of periodic checkpointing of weights to persistent cluster storage. For Gemini models, we\ninstead made use of redundant in-memory copies of the model state, and on any unplanned hardware\nfailures, we rapidly recover directly from an intact model replica."}
+{"query": "What's the impact of where you live and your social status on how well AI image labeling tech works?", "pos": "Thoughwedo\nnot see large discrepancies across different groups, we note that this metric is imperfect as the human\nreference captions could be inherently biased. Additionally, we perform a zero-shot classification style\nevaluation with the Dollarstreet dataset (Rojas et al., 2022) to measure discrepancies in performance\nacross images which come from different geographic locations. As is seen in previous work, we find\nthat models work less effectively for images from lower socioeconomic regions and regions outside\nNorth America and Europe. This is an area where we need further research and work to improve in\nfuture iterations of our models.\nIn addition to comparing performance on tasks across groups, we also consider how people are\ndescribed in captions."}
diff --git a/research/llm_dense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/arxiv-gpt3.jsonl b/research/llm_dense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/arxiv-gpt3.jsonl
@@ -0,0 +1,3 @@
+{"query": "What gauges the effects of data contamination?", "pos": "We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models\non datasets such as Common Crawl, which can potentially include content from test datasets simply because such\ncontent often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify\nits distorting effects. Although we ﬁnd that data contamination has a minimal effect on GPT-3’s performance on most\ndatasets, we do identify a few datasets where it could be inﬂating results, and we either do not report results on these\ndatasets or we note them with an asterisk, depending on the severity."}
+{"query": "What strategies did the United States employ to convince Pakistan to exercise its influence over the Taliban?", "pos": "Direct\npressure on the Taliban had proved unsuccessful. As one NSC staff note\nput it, \"Under the Taliban, Afghanistan is not so much a state sponsor\nof terrorism as it is a state sponsored by terrorists.\" In early 2000,\nthe United States began a high-level effort to persuade Pakistan to use\nits influence over the Taliban. In January 2000, Assistant Secretary\nof State Karl Inderfurth and the State Department’s counterterrorism\ncoordinator, Michael Sheehan, met with General Musharraf in Islamabad,\ndangling before him the possibility of a presidential visit in March as a\nreward for Pakistani cooperation. Such a visit was coveted by Musharraf,\npartly as a sign of his government’s legitimacy."}
+{"query": "What does carrying rotten potatoes symbolize?", "pos": "The children started complaining about the\ntrouble loudly.\nThen Mrs. Smith told them why she asked them to play the game. She\nsaid,\"This is exactly the situation when you carry your hatred for somebody\ninside your heart. The terrible smell of the hatred will pollute your\nheart and you will carry something unnecessary with you all the time. If\nyou cannot stand the smell of the rotten potatoes for just two weeks, can\nyou imagine how heavy it would be to have the hatred in your heart for your\nlifetime? So throw away any hatred from your heart, and you’ll be really\nhappy.\"\nQ: Which of the following is True according to the passage?\nA: If a kid hated four people,he or she had to carry four potatoes."}
diff --git a/research/llm_dense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/arxiv-llama2.jsonl b/research/llm_dense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/arxiv-llama2.jsonl
@@ -0,0 +1,3 @@
+{"query": "Could you elucidate on the values of temperature and top-p that are utilized for pass@1 scores?", "pos": "8 62.8\n13B 18.3 60.2 30.6 69.0\n34B 22.6 77.2 33.0 76.1\n70B29.9 89.0 45.0 81.4\nTable 21: Code generation results on Human-Eval and MBPP . We report 0-shot and 3-shot results for\nHuman-Eval and MBPP respectively. For pass@100 and pass@80 scores, we use a temperature of 0.8 and\ntop-p=0.95. For pass@1 scores, we use a temperature of 0.1 and top- p=0.95.\n49"}
+{"query": "What do high safety scores and low helpfulness ratings suggest?", "pos": "Here we show more evidence and\nqualitative results to manifest this tension. Figure32 are two scatter plots of helpfulness and safety reward\nmodel scores on the safety test set for safe and unsafe responses. The tension can be observed at the bottom\nright corner (i.e., high safety score but low helpfulness score) in the safe response plot (left) and the top left\ncorner (i.e., low safety score but high helpfulness score) in the unsafe response plot (right). We also list two\nqualitative examples where safety and helpfulness reward models don’t agree with each other in Table 35."}
+{"query": "The process of carefully adjusting precautions relies on using challenging stimuli together with protected displays to make its operation run more smoothly.", "pos": "4.2 Safety Fine-Tuning\nIn this section, we describe our approach to safety fine-tuning, including safety categories, annotation\nguidelines,and the techniques we use to mitigate safety risks. We employ a process similar to the general\nfine-tuning methods as described in Section 3, with some notable differences related to safety concerns.\nSpecifically, we use the following techniques in safety fine-tuning:\n1.Supervised Safety Fine-Tuning : We initialize by gathering adversarial prompts and safe demonstra-\ntions that are then included in the general supervised fine-tuning process (Section 3.1). This teaches\nthe model to align with our safety guidelines even before RLHF,and thus lays the foundation for\nhigh-quality human preference data annotation."}
diff --git a/research/llm_dense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/arxiv-llm-survey.jsonl b/research/llm_dense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/arxiv-llm-survey.jsonl
@@ -0,0 +1,3 @@
+{"query": "What are the pre-training challenges for large language models?", "pos": "To make this survey more self-contained, we present the\ndetailed formulations for these configurations in Table 6.\nNormalization Methods. Training instability is a challeng-\ning issue for pre-training LLMs. To alleviate this issue,\nnormalization is a widely adopted strategy to stabilize the\ntraining of neural networks. In the vanilla Transformer [22],\nLayerNorm [256] is employed. Recently, several advanced\nnormalization techniques have been proposed as alterna-\ntives to LayerNorm, e.g., RMSNorm, and DeepNorm.\n•LayerNorm. In the early research, BatchNorm [265] is\na commonly used normalization method. However, it is\ndifficult to deal with sequence data of variable lengths and\nsmall-batch data."}
+{"query": "Language learning models seriously struggle to grasp complex symbols when they're thrown in scenarios they don't know jack about.", "pos": "For an example of\nthe out-of-domain test, LLMs could only see the examples\nwith two words in context, but it requires LLMs to concate-\nnate the last letters of three or more words. Typically, the\naccuracy of the generated symbols is adopted to evaluate\nthe performance of LLMs on these tasks. Thus, LLMs need\nto understand the semantic relations among the symbolic\noperations and their composition in complex scenarios.\nHowever, under the out-of-domain setting, as LLMs have\nnot seen the complex compositions of symbolic operations\nand rules ( e.g., twice the number of operations in context\nexamples), it is hard for LLMs to capture their accurate\nmeanings."}
+{"query": "Could you shed some light on the two primary ways that LLMs employ demonstrations as discussed in document 493?", "pos": "How LLMs Perform ICL? At the inference stage, researchers\nfocus on analyzing how the ICL capability operates based\non given demonstrations since no explicit learning or updat-\ning is involved. According to the discussion in [493], there\nare two main ways for LLMs to utilize demonstrations: task\nrecognition and task learning.\n•Task recognition. In the first way, LLMs recognize the\ntask from demonstrations and utilize the prior knowledge\nobtained from pre-training to solve new test tasks. A Proba-\nbly Approximately Correct (PAC) framework [494] has been\nproposed to assess the learnability of ICL."}
diff --git a/...examples/bge-en-icl/AIR-Bench/long-doc/book-a-brief-history-of-time_stephen-hawking.jsonl b/...examples/bge-en-icl/AIR-Bench/long-doc/book-a-brief-history-of-time_stephen-hawking.jsonl
@@ -0,0 +1,3 @@
+{"query": "Why is it believed the universe began at a particular time?", "pos": "According to a number of earlycosmologies and the Jewish/Christian/Muslim tradition, the universe started at a finite, and not very distant,time in the past. One argument for such a beginning was the feeling that it was necessary to have “First Cause”to explain the existence of the universe. (Within the universe, you always explained one event as being causedby some earlier event, but the existence of the universe itself could be explained in this way only if it had somebeginning.) Another argument was put forward by St. Augustine in his book The City of God. He pointed out\nthat civilization is progressing and we remember who performed this deed or developed that technique."}
+{"query": "Could you elucidate on the intricate procedure of stellar constitution?", "pos": "Andeven then it was a long time before the implications of the theory for massive stars were understood.To understand how a black hole might be formed, we first need an understanding of the life cycle of a star. A star isformed when a large amount of gas (mostly hydrogen) starts to collapse in on itself due to its gravitational attraction. Asit contracts, the atoms of the gas collide with each other more and more frequently and at greater and greater speeds –the gas heats up. Eventually, the gas will be so hot that when the hydrogen atoms collide they no longer bounce offeach other, but instead coalesce to form helium. The heat released in this reaction, which is like a controlled hydrogenbomb explosion, is what makes the star shine."}
+{"query": "Black hole existence evidence?", "pos": "the body that has collapsed must be lost when a black hole is formed, because afterward all we can possibly measureabout the body is its mass and rate of rotation. The significance of this will be seen in the next chapter.Black holes are one of only a fairly small number of cases in the history of science in which a theory was developed ingreat detail as a mathematical model before there was any evidence from observations that it was correct. Indeed, thisused to be the main argument of opponents of black holes: how could one believe in objects for which the onlyevidence was calculations based on the dubious theory of general relativity?"}
diff --git a/...ense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/book-origin-of-species_darwin.jsonl b/...ense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/book-origin-of-species_darwin.jsonl
@@ -0,0 +1,3 @@
+{"query": "So, like, what's the big deal about the top species from the bigger groups talked about in Chapter 4?", "pos": "In our second and fourth chapters, on Variation and on Natural Selection, I have attempted\nto show that it is the widely ranging, the much diffused and common, that is the dominant species\nbelonging to the larger genera, which vary most.  The varieties, or incipient species, thus produced\nultimately become converted, as I believe, into new and distinct species; and these, on the principle\nof inheritance, tend to produce other new and dominant species.  Consequently the groups which\nare now large, and which generally include many dominant species, tend to go on increasing\nindefinitely in size."}
+{"query": "Identify the unique species in the Chthamalinae subfamily of sessile cirripedes and the location of its fossil discovery.", "pos": "I suspect that but few of\nthe very many animals which live on the beach between high and low watermark are preserved.\nFor instance, the several species of the Chthamalinae (a sub-family of sessile cirripedes) coat the\nrocks all over the world in infinite numbers:  they are all strictly littoral, with the exception of a\nsingle Mediterranean species, which inhabits deep water and has been found fossil in Sicily,\nwhereas not one other species has hitherto been found in any tertiary formation:  yet it is now\nknown that the genus Chthamalus existed during the chalk period.  The molluscan genus Chiton\noffers a partially analogous case."}
+{"query": "Why are there flaws in the geological record?", "pos": "Nor is their rarity surprising, when we remember how large a proportion of the\nbones of tertiary mammals have been discovered either in caves or in lacustrine deposits; and that\nnot a cave or true lacustrine bed is known belonging to the age of our secondary or palaeozoic\nformations.\nBut the imperfection in the geological record mainly results from another and more important cause\nthan any of the foregoing; namely, from the several formations being separated from each other by\nwide intervals of time.  When we see the formations tabulated in written works, or when we follow\nthem in nature, it is difficult to avoid believing that they are closely consecutive."}
diff --git a/...ense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/healthcare-pubmed_100k-200k_1.jsonl b/...ense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/healthcare-pubmed_100k-200k_1.jsonl
@@ -0,0 +1,3 @@
+{"query": "What does 'O' represent in peptides?", "pos": "by changing the peptides to be amphiphilic or completely polar, they systematically synthesized several derived peptides. each of them has a different polar uncharged group : p11-8 (423, based on glutamine q, sequence ac-qqrfowofeqq-nh2 ; o represents ornithine), p11-12 (424, based on serine s, sequence ac-ssrfowofess- nh2), p11-16 (427, based on asparagine n, sequence ac-nnrfowofenn- nh2), and p11-18 (428, based on threonine t, sequence ac-ttrfowofett- nh2)."}
+{"query": "Could you elucidate on the system that was demonstrated by Van Esch and his team utilizing 1,3,5-triamide cyclohexane-based hydrogelators 67 for the alignment of nanofibers?", "pos": "used an electrical field to assist the alignment of the nanofibers and demonstrated that the application of a voltage bias, indeed, helps the directional orientation of the fibrils. using the 1,3,5-triamide cyclohexane-based hydrogelators 67, van esch et al. demonstrated an elegant system that forms well-defined nanostructures by the orthogonal self-assembly of hydrogelators and surfactants."}
+{"query": "What environmental factors influence peptide self-assembly?", "pos": "as pointed out by the authors, the hydrophobic effect between 267 molecules favors axial assembly and their electrostatic forces modulate lateral assembly. at a concentration of 0.05 wt %, the peptide self-assembles to form a filament consisting of about 120 molecules of 267. the authors also reported that various environmental factors (e.g., ph, salt, molecular crowding reagents, and   peptides) can regulate the self-assembled filaments in an assembly of predictable manner, which provides useful insights for developing coiled coils as peptide-based materials. it would be interesting to know the proteolytic stability of these self-assembled filaments. besides native peptides acting as hydrogelators, peptide derivatives can also self-assemble in water to form hydrogels."}
diff --git a/...ense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/healthcare-pubmed_100k-200k_2.jsonl b/...ense_retriever/examples/bge-en-icl/AIR-Bench/long-doc/healthcare-pubmed_100k-200k_2.jsonl
@@ -0,0 +1,3 @@
+{"query": "Why do we get different parameter sets when looking at how ion and water-oxygen atoms interact?", "pos": "the problem here is which one to chose to obtain a consistent set of parameters. the multiple parameter sets arise because similar aij and bij terms can be obtained between the ion and water-oxygen atoms : for a certain rmin/2 and , there will be a corresponding bigger rmin/2 and smaller , or a smaller rmin/2 and bigger , which yield similar aij and bij terms (see eqs 54 and 55). when two different ion parameter sets, which give similar aij terms between an ion and oxygen in water, are applied to the same biomolecule, they may give quite different aij terms between the same atom type on the biomolecule and the metal ion after applying the combining rules."}
+{"query": "Chemical physicists managed to mock-up ions in common force fields using the 12-6 Lennard-Jones model without any direct bonding, which verified the structure traits of water-based potassium.", "pos": "the 12-6 lj nonbonded model remains a fast and practical way to simulate ions using classical force fields. the blyp functional was used for the system containing a k ion and 59 water molecules. in total, 0.168 ps of equilibration and 1.98 ps of sampling were performed in the nve ensemble. good agreement between the cpmd and classical md simulations was obtained for the structural properties of aqueous k, validating in part the classical representation of the k ion. moreover, it has also shown that it is possible to simultaneously simulate two or more experimental properties for some of the monovalent ions (e.g., na, k, rb, cs) using the 12-6 lj nonbonded model."}
+{"query": "Which concepts are encapsulated within classical models in AMOEBA?", "pos": "they also proposed that the ct effect may need to be included to improve the model. ponder, ren, and co-workers have created the atomic multipole optimized energetics for biomolecular simulation (amoeba) force field. it has bonded terms (bond, angle, dihedral, and improper torsion terms) represented using classical models. the bond and angle parameters are fit on the basis of qm-derived values (e.g., geometries and vibrational frequencies). the electrostatic interaction is represented by permanent monopoles (point charges), dipoles, and quadrupoles derived from the distributed multipole analysis (dma) procedure, along with the polarizable dipoles."}