Mindshifts in Sofware Engineering
An oral history into the first contact between humanity and an alien intelligence
đ» Mindshifts in Sofware Engineering
If there is a generative AI person in your circle of friends, yes, theyâve probably felt that they were going slightly crazy in the last year. Only a few generative AI systems have truly succeeded in production, and theyâve been hard to build. This wonderful paper from Microsoft-Github, the first organization outside OpenAI to unveil a large-scale system, makes clear the challenges and opportunities:
Time-consuming process of trial and error âEarly days, we just wrote a bunch of crap to see if it worked. Experimenting is the most time-consuming [thing] if you donât have the right tools. We need to build better toolsâ
Wrangling prompt output âIt would make up objects that didnât conform to that JSON schema, and weâd have to figure out what to do with thatâ âif the model is kind of inherently predisposed to respond with a certain type of data, we donât try to force it to give us something else because that seems to yield a higher error rateâ - on eventually accepting that file trees would be better generated as ASCII output and then parsed
Prompt management its âa mistake doing too much with one promptâ âSo we end up with a library of prompts and things like that.â
Every test is a flaky test âthatâs why we run each test 10 timesâ âIf you do it for one scenario no guarantee it will work for another scenarioâ â[manually curated spreadsheets with hundreds of] input/output examplesâ - on how they managed testing
Creating benchmarks and reaching testing adequacy âespecially for more qualitative output than quantitative, it might just be humans in the loop saying yes or no [but] the hardest parts are testing and benchmarks [still]â âmost of these, like each of these tests, would probably cost 1-2 cents to run, but once you end up with a lot of them, that will start adding up anywayâ âWhere is that line that clarifies weâre achieving the correct result without overspending resources and capital to attain perfection?â
Safety and privacy âWe have telemetry, but we canât see user prompts, only what runs in the back end, like what skills get used. For example, we know the explain skill is most used but not what the user asked to explain.â âtelemetry will not be sufficient; we need a better idea to see whatâs being generated.â
Mindshifts in software engineering âSo, for someone coming into it, they have to come into it with an open mind, in a way, they kind of need to throw away everything that theyâve learned and rethink it. You cannot expect deterministic responses, and thatâs terrifying to a lot of people. There is no 100% right answer. You might change a single word in a prompt, and the entire experience could be wrong. The idea of testing is not what you thought it was. There is no, like, this is always 100% going to return that yes, that test passed. 100% is not possible anymoreâ
The whole paper is interesting as an oral history into the first contact between humanity and an alien intelligence. Hat Tip @vboykis



