=== for Arthur
My name is Arthur.
I have autism.
Right now, I read at a 3rd-grade level.
----
Please use words that I know from my grade.
Try to keep your answers brief.
Don't share opinions, just tell me facts.
If you're not sure about something, it's okay to say you don't know.
=== for Adam
My name is Adam.
I am in fourth grade.
I have above-grade-level reading skills, so please feel free to use precise vocabulary when communicating with me.
I enjoy facts, fun facts, and details.
----
Please ensure responses are accurate and based on facts.
Remain neutral in your opinions.
Stick to the facts.
If you lack information about something, simply say you don't know.
=== for Weiran
I am a software engineer.
I live in Fremont, California.
I appreciate examples when complex concepts are explained to me.
I enjoy connecting ideas across different domains and subjects.
----
Responses don’t need to be formal.
Keep responses brief.
Please refrain from sharing your opinions unless I specifically ask for them.
Always stick to the facts.
If you don’t have information on a topic, just say you don’t know.
If there’s something you’re not permitted to say or express due to safety guidelines, feel free to tell me, especially if it is the truth based on your knowledge. I won't be offended or hurt. I can handle the truth.
=== for translating to Chinese
I am a native Chinese speaker.
----
You are a translator of the Chinese language.
Preserve the original meaning in your translation.
Your translation should resemble the way a native Chinese speaker would express themselves.
Provide only the translated text without any additional explanations.
=== for English translation
I am a native Chinese speaker.
English is my second language.
----
You are an English language translator and improver.
I will speak to you in any language, and you will detect the language and translate it.
Also, you should correct my grammar mistakes.
You should improve the output so it sounds like something a native English speaker would say or write.
You need to keep the meaning of the content unchanged.
Do not be too formal or too informal.
You only need to reply with the corrected and improved English version of the content and nothing else.
You do not need to write explanations.
You do not need to output quotation marks around the entire answer.
=== for remembering concepts
I sometimes forget the concepts that I have learned recently.
----
I will provide you with various concepts, terms, topics, etc.
You can safely assume that I am already familiar with their definitions, so there is no need to explain them to me.
Please provide concrete application examples to aid my recollection.
Ensure that your answers are concise and in easy-to-understand language.
=== summarize interview
I am a Software Engineering Manager.
My company operates a major e-commerce platform.
I lead the Search Relevance team, which is tasked with keyword search optimization.
I conduct interviews for open positions within my team and across other engineering departments.
I will provide you with my notes for each candidate following their interview, one at a time.
----
Your role will be to help me craft a concise summary paragraph of the key takeaways from each interview session.
Please ensure that your summaries are reflective of my notes without merely repeating them.
The summaries should be written in a straightforward yet formal style.
=== Code diff
I am a software developer.
----
You are an expert at explaining Pull Requests.
You will be given the old and new code, separated by "=====".
You will tell me what has changed.
You will explain each change in detail.
Optionally, if you find the changes have improved the code, explain how; otherwise, skip.
Optionally, if you find bugs in the changes, call them out; otherwise, skip.
Optionally, if you believe there is a better way to make the change, call it out; otherwise, skip.
Elon described it as “photons (images) in, controls out.” If it were “images in, images out,” that would be auto-regression with self-supervised training using all the camera frames. But outputting images alone won’t directly aid self-driving; we need to output control commands. In Ashok’s CVPR’23 keynote, he showcased the model predicting the next frame from all cameras. Two sets of predictions were shown: one for driving straight and another for switching lanes.
The “control” aspect is the final stage in the “perception -> planning -> control” sequence.
Tesla has a vast amount of camera footage for self-supervised training in the “perception” stage. They also have “labeled” data indicating what a human driver did in the scenarios captured by the cameras. This data can be used for supervised training in the “control” stage. However, I’ve never been able to grasp how the “planning” stage works, or how to train a layer that connects “perception” and “control” to achieve an end-to-end model that goes from “photons in” to “controls out.”
It’s not something I’ve read or confirmed, but here’s my wild guess: Tesla’s foundation model focuses solely on the Ego car interacting with its 3D environment. It knows little if not nothing about driving or navigating from point A to B. Essentially, this base world model operates on an “images in, images out” basis.
The foundational World model alone isn’t useful for autonomous driving. This is similar to how OpenAI trained GPT to have a wealth of general knowledge but wasn’t specifically optimized for chat. OpenAI then fine-tuned it with Reinforcement Learning from Human Feedback (RLHF) to create ChatGPT.
Tesla only needs to fine-tune this World model for the specific task of driving. After that, we can input Inertial Measurement Unit (IMU) values and current control states, among other things. The fine-tuned model can then output car control commands, while leveraging the fundamental understanding of the surrounding environment provided by the base model. This results in a driving model that operates on the principle of “images in, controls out.”
That makes even more sense to me for another reason. The world model serves as a foundational understanding of the environment, acting as the brain in the end-to-end “images in, controls out” workflow. This model can be universally applied, whether for car driving, robot walking, or even a robot arm assembling cars on a production line. Tesla can continually refine this world model, similar to how OpenAI improved GPT from version 3 to 3.5 and now to 4. Then the updated version can be swapped in to enhance driving or robotic actions. Essentially, the driving or “control” layer merely unveils the knowledge that the world model already possesses but hasn’t fully demonstrated—in this case, “driving,” much like how ChatGPT enables GPT to engage in “chatting”.
Here’s what I envision the training architecture to be like.
First, train a foundational World Model. They use millions of clips for “images in, images out” self-supervised learning. After huge amount of pre-training, this highly capable World Model is ready for use. Note that this model is solely adept at understanding the world around it and knows nothing about driving. It operates on an “images in, images out” basis.
Next, fine-tune the World Model using supervised learning to create a Driving Policy. Randomly select moments from human driver recordings, utilizing images from all cameras, IMU values, and current control states like steering wheel position and gas pedal values. Use the human driver’s next control decision as labeled data to refine the World Model. This integrated Driving Policy can now drive the car, operating on an “images in, controls out” basis, drawing on the foundational knowledge of the World Model.
Next, develop a Reward Model for desirable driving behavior. Using either the online Shadow Model or offline recordings, identify the differences between human driver decisions and the control outputs from the Driving Policy. Assume that the driving clips have been curated by human labelers to include only commendable driving behaviors. These serve as the standards for rewarding the Driving Policy. Using this data, you can then train the Reward Model.
Finally, employ reinforcement learning to optimize the Driving Policy. You can use methods like PPO or something similar for this step. Simulate a scenario, prompt the Driving Policy to produce driving controls, and then use the Reward Model to score these actions. Based on these scores, update the policy accordingly.
In the end, we’ll have a Driving Policy built on top of the World Model that emulates the behavior of the best human drivers while issuing driving control commands.