{"version":8,"graph":{"viewport":{"xmin":-3.2782285850133155,"ymin":-1107.6760677260318,"xmax":38.119571250577245,"ymax":29249.941252184937},"squareAxes":false},"randomSeed":"9ed0bb3573147f5f52ab5e64eb28061c","expressions":{"list":[{"id":"4","type":"table","columns":[{"values":["3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30"],"hidden":true,"id":"2","color":"#2d70b3","latex":"x_{1}"},{"values":["7693","9524","11557","14019","14382","12914","10350","7600","4701","2874","1738","886","526","250","173","95","73","62","36","25","31","35","33","29","28","0","29","0"],"id":"3","color":"#388c46","latex":"y_{1}"}]},{"type":"text","id":"9","text":"Extracted from some random Google-associated wordlist across which I came online"},{"type":"text","id":"11","text":"It had somewhere on the order of 5 million words"},{"type":"text","id":"13","text":"This analysis only looks at the first 100 thousand"},{"type":"text","id":"15","text":"It appears to be sorted in order of frequency (starts with \"the\" and other such common things, ends with various miscellaneous domain-specific terms and fake words)"},{"type":"text","id":"17","text":"The data here is y = number of words in top 100 thousand, x = length"},{"type":"text","id":"19","text":"At x = 1, x = 2, the counts are 26 and 676, respectively, which produce a discontinuity -- this is due to encoding-space saturation; there are 26 letters and 676 possible letter pairs"},{"type":"text","id":"21","text":"Modeling this with an exponential impulse (found from https://iquilezles.org/www/articles/functions/functions.htm )"},{"type":"expression","id":"23","color":"#000000","latex":"y_{1}\\sim ax_{1}e^{\\left(1-bx_{1}\\right)}","residualVariable":"e_{1}","regressionParameters":{"a":3160.0033867898637,"b":0.2566436781232041}},{"type":"text","id":"26","text":"Modeling this with a bell curve"},{"type":"expression","id":"27","color":"#388c46","latex":"y_{1}\\sim ce^{\\left(-d\\left(x_{1}+f\\right)^{2}\\right)}","residualVariable":"e_{2}","regressionParameters":{"c":14134.656327375462,"d":0.05365228121812108,"f":-6.611853364421262}},{"type":"expression","id":"28","color":"#6042a6","latex":"g\\left(x\\right)=ce^{\\left(-d\\left(x+f\\right)^{2}\\right)}","hidden":true},{"type":"text","id":"32","text":"These are how many one- and two-letter words we'd expect in the top 100K if encoding saturation wasn't a limitation"},{"type":"expression","id":"29","color":"#000000","latex":"g\\left(1\\right)"},{"type":"expression","id":"30","color":"#c74440","latex":"g\\left(2\\right)"}]}}